[Advanced Learning Algorithms] #10. Decision Trees

👩‍💻LEARN : ML&Data/Lecture

[Advanced Learning Algorithms] #10. Decision Trees

쟈니유 2023. 3. 28. 18:55

728x90

#10. Decision Trees

Decision Trees

▶️ Decision Tree model

Root node → Decision nodes → Leaf nodes

▶️ Learning Process

1️⃣ Decision 1 : How to choose what features to spit on at each node

Maximize purity (or minimize impurity) : 최대한 하나의 클래스만 결과값에 나오도록 하는 것

2️⃣ Decision 2 : When do you stop splitting?

When a node is 100% one class
When Splitting a node will result in the tree exceeding a maximum depth(설정값)
When improvements in purity score are below a threshold
When the number of examples in a node is below a threshold

Decision tree learning

▶️ Measuring Purity

Entropy as a measure of impurity

Entropy graph : p1의 값에 따라 Purity를 확인할 수 있음 (P1이 0 혹은 1에 가까울수록 엔트로피가 0이 됨)
H(p1) = -p1 * log2(p1) - p0*log2(p0) = -p1 * log2(p1) - (1-p1) * log2(1-p1) ※ p0 = 1-p1

▶️ Choosing a split : Information Gain

Information gain : Sample 수를 가중치로 엔트로피를 곱한 값을 Root node의 엔트로피에서 빼서 split option을 선택

Root node : 고양이 5, 강아지 5이므로 p1 = 0.5, H(0.5) = 1
차이를 본 것이므로, 엔트로피 값을 '얼마나 줄였는가'가 결과값이므로 해당 값이 가장 큰 것 = 선택값
그러므로 아래 예에서는 0.28이 가장 많이 줄였으므로 선택되는 split이 됨

w = 개수, p = 고양이의 확률

▶️ Putting it together

Decision Tree Learning 방법

1. Root node 에 모든 샘플들을 놓고 시작
2. 모든 가능한 feature에 대한 information gain을 계산한 후, 가장 높은 값을 가진 Feature선택
3. 해당 feature에 따라 dataset을 분리하고, split 하는 것을 계속 진행
4. 아래 기준을 충족하면 split을 멈춤
- 4.1 하나의 노드가 100% 하나의 클래스(분류)가 될 때
- 4.2 설정한 최대 depth가 되었을 때
- 4.3 설정한 threshold보다 information gain이 작아질때
- 4.4 설정한 threshold보다 한 노드에 있는 샘플 수가 작아질 때

▶️ Using one-hot encoding of categorical features

3개 이상의 feature로 node를 만들어야 할 때엔 원핫인코딩을 사용해서 binary feature로 만들어야 함
e.g. 귀 모양이 1) 뾰족하다 2) 접혔다 3) 둥글다 일 경우 feature를 1)뾰족여부 2)접힌 여부 3) 둥근 여부로 변환하여 0, 1 반환

▶️ Continuous valued features

연속형 변수를 사용하게 될 경우, 해당 변수의 기준 (threshold)을 잡아서 binary feature로 만들어야 함
기준점은 information gain을 활용해서 가장 높은 기준을 선택 (10개의 sample이 있다면 split할 수 있는 기준인 9번 해봐야함)

▶️ Regression Trees

Regression Tree를 Regression (Predict number)에도 사용할 수 있음

Split 기준

전체 샘플의 variance - {각 feature 분류의 (variance 값 * 샘플의 개수)의 합} 가장 high 한 것을 선택.

Variance를 가장 많이 줄이는 선택지를 택하기 위함

#1. 엔트로피 계산식 

def entropy(p):
    if p == 0 or p == 1:
        return 0
    else:
        return -p * np.log2(p) - (1- p)*np.log2(1 - p)
        
#2. Split 

def split_indices(X, index_feature):
    """Given a dataset and a index feature, return two lists for the two split nodes, the left node has the animals that have 
    that feature = 1 and the right node those that have the feature = 0 
    index feature = 0 => ear shape
    index feature = 1 => face shape
    index feature = 2 => whiskers
    """
    left_indices = []
    right_indices = []
    for i,x in enumerate(X):     #인덱스와 원소로 이루어진 튜플(tuple)반환 
        if x[index_feature] == 1:
            left_indices.append(i)
        else:
            right_indices.append(i)
    return left_indices, right_indices
    
    
#3. Weighted entropy 

def weighted_entropy(X,y,left_indices,right_indices):
    """
    This function takes the splitted dataset, the indices we chose to split and returns the weighted entropy.
    """
    w_left = len(left_indices)/len(X)
    w_right = len(right_indices)/len(X)
    p_left = sum(y[left_indices])/len(left_indices)
    p_right = sum(y[right_indices])/len(right_indices)
    
    weighted_entropy = w_left * entropy(p_left) + w_right * entropy(p_right)
    return weighted_entropy
    
    
#4. Information gain 

def information_gain(X, y, left_indices, right_indices):
    """
    Here, X has the elements in the node and y is theirs respectives classes
    """
    p_node = sum(y)/len(y)
    h_node = entropy(p_node)
    w_entropy = weighted_entropy(X,y,left_indices,right_indices)
    return h_node - w_entropy
    
    
#Computing

for i, feature_name in enumerate(['Ear Shape', 'Face Shape', 'Whiskers']):
    left_indices, right_indices = split_indices(X_train, i)
    i_gain = information_gain(X_train, y_train, left_indices, right_indices)
    print(f"Feature: {feature_name}, information gain if we split the root node using this feature: {i_gain:.2f}")

Tree ensembles

▶️ Using multiple decision trees

하나의 tree 만 있을 경우, 하나의 데이터가 바뀌면 tree가 완전 다르게 나올 수 있음

그러므로 Multiple Tree ensembles을 만들어 새로운 데이터가 들어왔을 때 과반수로 결정할 수 있게 함

▶️ Sampling with replacement

New training set : data set에서 random sampling을 통해 여러개의 데이터셋을 확보 (중복 데이터 선택 가능!)

▶️ Random forest algorithm

Generating a tree sample

New training set으로 decision tree를 만들기 * 각각의 새로운 data set (B번 반복)

B의 값이 높을수록 좋지만 100이상 되면 큰 성능 향상은 없음

Randomizing the feature choice

At each node, when choosing a feature to use to split, if n features are available, pick a random subset of k (<n) features and allow the algorithm to only choose from that subset of feature.

즉 전체 feature의 수(n)보다 적은 수로 트레이닝 샘플에 들어갈 feature의 수(k)를 선택하는데, 주로 n의 제곱근값을 취한다

▶️ XGBoost

Boosted trees intuition

여러번 tree를 만들면서 training set sampling을 할 때, 무작위로 sampling을 하되 기존 tree가 잘 분류하지 못한 original data set의 data들을 sample에 포함시키면서 tree model의 성능을 올리는 것

XGBoost (eXtreme Gradient Boosting)

#Classification 

from xgboost import XGBClassifier 

model = XGBClassifier()

model.fit(X_train,Y_train)
model.predict(X_test)

#Regression

from xgboost import XGBRegressor

model = XGBRegressor()

model.fit(X_train,Y_train)
model.predict(X_test)

'👩‍💻LEARN : ML&Data > Lecture' 카테고리의 다른 글

[Unsupervised Learning, Recommenders, Reinforcement Learning] #2. Anomaly detection (0)	2023.03.28
[Unsupervised Learning, Recommenders, Reinforcement Learning] #1. Clustering (0)	2023.03.28
[Advanced Learning Algorithms] #9. Machine learning development process (0)	2023.03.28
[Advanced Learning Algorithms] #8. Bias and variance (0)	2023.03.28
[Advanced Leaning Algorithms] #7. Advice for applying machine learning (0)	2023.03.28

현재글[Advanced Learning Algorithms] #10. Decision Trees

전방에 정체가 있어 새로운 길로 안내합니다

지도학습, 문과생살아남기, 프로퇴사러, People Analytics, 미분, 경사하강법, 코세라, 컨볼루션, 적분, 선형회귀, 딥러닝, 비지도학습, coursera, 노잼, 7차교육과정은 미적분을 안배웠어요, 머신러닝을위한수학, HR Analytics, neural network, HRD, 지금은 개념만 우겨넣자..우선...,

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

낡고 지친 회사원