Goal¶
In this tutorial, we explore clustering methods in Python, focusing on K-Means and Agglomerative Hierarchical Clustering. Clustering is a powerful technique used to group similar data points based on their features, making it easier to analyze and interpret complex datasets.
K-Means Clustering¶
K-Means is a popular clustering algorithm which groups unlabeled dataset into different clusters. For example, We will create a dataset of 500 points in a 2-Dimentional space in 4 clusters
Required Modules for the Jupyter Notebook¶
Before ruuning the notebook make sure to load the following modules.
Module: make_blobs, numpy, matplotlib, KMeans, dendorgram, linkage, AgglomerativeClustering
from sklearn.datasets import make_blobs
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import linkage
from sklearn.cluster import AgglomerativeClustering
We will create a dataset of 500 points in a 2-Dimentional space in 4 clusters
# make blobs
data = make_blobs(n_samples=500, n_features=2, centers=4, cluster_std=1.2, random_state=50)
# create np array for data points
points = data[0]
xvals = points[:,0]
yvals = points[:,1]
cluster_num = data[1]
Now we plot the points in the 2D space. They are colored based on which cluster they are in
# create scatter plot
plt.scatter(xvals, yvals, c=cluster_num, cmap='viridis')
plt.show()
In k-means, we first place 4 random cluster centroids, the algorithm does the following untill convergence:
- calculated the disctance of each point to the K centroids and assigns itself to the closest centroid
- the poistion of the centroid is recalcualated by taking hte mean of the points in the cluster
kmeans = KMeans(n_clusters = 4, n_init=10)
kmeans.fit(points)
y_kmeans = kmeans.predict(points)
import os
# Get the number of available CPUs
num_cpus = os.cpu_count()
print(f"Number of CPUs available: {num_cpus}")
Number of CPUs available: 40
Now we plot the points in a 2D space. The points are colored by the cluster the Kmeans algorithm has clustered them into. The red dots represent the center of the clusters
plt.scatter(xvals, yvals, c = y_kmeans)
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c = 'red', s = 90);
plt.show()
Agglomerative Hierarchical Clustering¶
We have 15 data points with 3 clusters. These data points are stored in the data_H array
data_H = make_blobs(n_samples=15, n_features=2, centers=3, cluster_std=1.2, random_state=10)[0]
plt.figure(figsize = (12, 6))
labels = range(1,len(data_H)+1)
plt.scatter(data_H[:,0], data_H[:,1])
for label, xval, yval in zip(labels, data_H[:, 0], data_H[:, 1]):
plt.annotate(label,xy = (xval, yval), xytext = (-4, 0),textcoords = 'offset points', ha = 'right', va = 'bottom')
plt.show()
Now we will create a dendrogram of the data. Those grouped lower down are more similar (such as 1 and 9)
linked = linkage(data_H, 'single')
plt.figure(figsize = (12, 7))
dendrogram(linked, orientation = 'top',labels = labels,
distance_sort ='descending',show_leaf_counts = True)
plt.show()
Next, we will cluster the data points using AgglomerativeClustering from sklearn. Note that the clustering shows the same members in each cluster as the dendrogram above
cluster = AgglomerativeClustering(n_clusters = 3, affinity = 'euclidean', linkage = 'ward')
cluster.fit_predict(data_H)
plt.figure(figsize = (12, 6))
plt.scatter(data_H[:,0],data_H[:,1], c = cluster.labels_)
for label, xval, yval in zip(labels, data_H[:, 0], data_H[:, 1]):
plt.annotate(label,xy = (xval, yval), xytext = (-4, 0),textcoords = 'offset points', ha = 'right', va = 'bottom')
plt.show()
Submit Ticket¶
If you find anything that needs to be changed, edited, or if you would like to provide feedback or contribute to the notebook, please submit a ticket by contacting us at:
Email: consult@sdsc.edu
We appreciate your input and will review your suggestions promptly!