[Unsupervised Learning, Recommenders, Reinforcement Learning] #1. Clustering

👩‍💻LEARN : ML&Data/Lecture

[Unsupervised Learning, Recommenders, Reinforcement Learning] #1. Clustering

쟈니유 2023. 3. 28. 20:32

728x90

7일 안에...끝내기로 한 것은..매우 잘못된..선택이었던거같다...⭐️

일단 우겨넣고...나중에 이해해보려 한다...🫥

개념들은 크게크게 이해가 되는데, 이걸 코드로?구현? 어? 이게 되네? 어 ? 이게 안되네? 이렇게 되는 상황

→ 일단 끝내놓고 ...다시..코드...백지에서 구현해보자...화이팅...

#1. Clustering

▶️ What is Clustering

label y 가 없이 수행하는 것으로, 데이터 자체에서 structure을 찾도록 하는 것

▶️ K-means intuition

1. 임의의 K개의 점을 cluster의 중심 (Cluster centroids)으로 가정

2. 각각의 데이터를 가장 가까운 centroid에 배정

3. 소속된 데이터들의 평균치로 centroid의 위치를 변경

4. 2번으로 돌아감 → 옮겨진 centroid에 가까운 데이터가 있다면 해당 centroid로 소속을 변경

5. 더이상 변경ㅇ ㅣ없을때까지 반복

▶️ K-means algorithm

K개의 클러스터의 centroids를 임의로 설정한다
다음을 반복한다
- X들을 가장 가까운 cluster centroid에 배정한다
  - ci = 클러스터 명
- Cluster centroid를 배정된 X들의 평균으로 이동시킨다
  - 평균 = average of points assigned to cluster
  - X는 벡터로 feature의 수 만큼의 차원을 갖고 있다

▶️ Optimization objective

Notation

Cost function (=Distortion cost function)

각각의 X들과 X가 배정된 cluster의 centroid 간의 squared 거리의 평균 → 최소화시키는 것이 목표
여기에 있는 저 수식은 유클리드 거리의 평균을 의미한다. 코드로 짤 때 주의할 것

▶️ Initializing K-means

Random Initialization

→K means 수행 시 랜덤한 X를 centroid로 잡기 때문에 Local minima가 되어 적절하지 않은 군집을 형성할 수 있음

→여러번 돌려보고 Cost가 가장 낮은 cluster를 도출하자 ^^!

▶️ Choosing the number of clusters

Elbow method

K means를 K개를 여러개로 돌려보고 그에 따른 Cost값을 그래프로 만들어서 그래프가 꺾이는 지점을 적절한 K로 설정
하지만 대부분의 데이터는 확 꺾이지 않는다는 한계가 있음

Evaluate K-means based on how well it performs on that later purpose

데이터의 성격을 보고, 해당 데이터를 가장 잘 설명하는 것으로 보이는 k를 선택하는 것

참고. 코드

np.linalg.norm
np.argmin

#1. 가장 가까운 클러스터 중앙 찾는 법 

def find_closest_centroids(X, centroids):
    """
    
    Args:
        X (ndarray): (m, n) Input values      
        centroids (ndarray): (K, n) centroids
    
    Returns:
        idx (array_like): (m,) closest centroids
    
    """

    # Set K
    K = centroids.shape[0]

    # You need to return the following variables correctly
    idx = np.zeros(X.shape[0], dtype=int)

    ### START CODE HERE ###
    
    for i in range (X.shape[0]):
        distance = []
        
        for j in range(K):
            norm_ij = np.linalg.norm(X[i]-centroids[j]) #거리구하는거. 제곱한것 합한 다음에 제곱근 씌우는거임 
            distance.append(norm_ij) 
        
        idx[i]=np.argmin(distance)  #최소값 인덱스 반환 
            
        
     ### END CODE HERE ###
    
    return idx
    
    
#2. 중앙값 옮겨보는 거 

def compute_centroids(X, idx, K):
    """
    
    Args:
        X (ndarray):   (m, n) Data points
        idx (ndarray): (m,) Array containing index of closest centroid for each 
                       example in X. Concretely, idx[i] contains the index of 
                       the centroid closest to example i
        K (int):       number of centroids
    
    Returns:
        centroids (ndarray): (K, n) New centroids computed
    """
    
    # Useful variables
    m, n = X.shape
    
    # You need to return the following variables correctly
    centroids = np.zeros((K, n))
    
    ### START CODE HERE ###
    
    for k in range(K) :
        points = X[idx==k]
        centroids[k] = np.mean(points, axis=0)
        
        
    ### END CODE HERE ## 
    
    return centroids

'👩‍💻LEARN : ML&Data > Lecture' 카테고리의 다른 글

[Unsupervised Learning, Recommenders, Reinforcement Learning] #3. Collaborative Filtering (0)	2023.03.30
[Unsupervised Learning, Recommenders, Reinforcement Learning] #2. Anomaly detection (0)	2023.03.28
[Advanced Learning Algorithms] #10. Decision Trees (0)	2023.03.28
[Advanced Learning Algorithms] #9. Machine learning development process (0)	2023.03.28
[Advanced Learning Algorithms] #8. Bias and variance (0)	2023.03.28

현재글[Unsupervised Learning, Recommenders, Reinforcement Learning] #1. Clustering

전방에 정체가 있어 새로운 길로 안내합니다

프로퇴사러, HRD, 컨볼루션, 선형회귀, neural network, 경사하강법, 머신러닝을위한수학, 미분, 지도학습, 노잼, HR Analytics, 문과생살아남기, 비지도학습, 코세라, People Analytics, 지금은 개념만 우겨넣자..우선..., 적분, 딥러닝, coursera, 7차교육과정은 미적분을 안배웠어요,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

낡고 지친 회사원

[Unsupervised Learning, Recommenders, Reinforcement Learning] #1. Clustering