What is Clustering?
Clustering is unsupervised learning that groups similar data points together. It's used for customer segmentation, anomaly detection, image compression, and more.
Common Applications:
- Customer segmentation for marketing
- Document categorization
- Image segmentation
- Anomaly detection
- Recommendation systems
šÆ K-Means Clustering
How K-Means Works
- Choose K (number of clusters)
- Randomly initialize K centroids
- Assign each point to nearest centroid
- Update centroids (mean of assigned points)
- Repeat steps 3-4 until convergence
Implementation
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
# Generate sample data
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
# Train K-Means
kmeans = KMeans(
n_clusters=4, # Number of clusters
init='k-means++', # Smart initialization
n_init=10, # Number of times to run
max_iter=300, # Max iterations
random_state=42
)
kmeans.fit(X)
# Get results
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
# Plot
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.6)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X', s=200)
plt.title('K-Means Clustering')
plt.show()
Choosing Optimal K - Elbow Method
# Try different K values
inertias = []
K_range = range(1, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
# Plot elbow curve
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (Within-cluster sum of squares)')
plt.title('Elbow Method')
plt.show()
# Look for "elbow" - where curve bends
Silhouette Score
from sklearn.metrics import silhouette_score
# Evaluate clustering quality
silhouette_scores = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
labels = kmeans.fit_predict(X)
score = silhouette_score(X, labels)
silhouette_scores.append(score)
print(f"K={k}: Silhouette Score = {score:.3f}")
# Higher score = better clustering
# Score ranges from -1 to 1
# >0.5 = good, >0.7 = strong
š DBSCAN (Density-Based)
DBSCAN finds clusters of arbitrary shape and identifies outliers. Unlike K-Means, you don't need to specify the number of clusters!
Implementation
from sklearn.cluster import DBSCAN
# Train DBSCAN
dbscan = DBSCAN(
eps=0.5, # Maximum distance between points
min_samples=5 # Minimum points to form cluster
)
labels = dbscan.fit_predict(X)
# -1 indicates outliers/noise
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
print(f"Clusters found: {n_clusters}")
print(f"Outliers: {n_noise}")
# Plot
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title('DBSCAN Clustering')
plt.show()
Advantages of DBSCAN
- Finds clusters of any shape (not just circular)
- Automatically determines number of clusters
- Identifies outliers
- Robust to noise
Disadvantages
- Sensitive to eps and min_samples parameters
- Struggles with varying densities
- Doesn't work well in high dimensions
š³ Hierarchical Clustering
Creates a tree of clusters (dendrogram). Two approaches: Agglomerative (bottom-up) and Divisive (top-down).
Agglomerative Clustering
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
# Train hierarchical clustering
agg = AgglomerativeClustering(
n_clusters=4,
linkage='ward' # 'ward', 'complete', 'average', 'single'
)
labels = agg.fit_predict(X)
# Create dendrogram
linkage_matrix = linkage(X, method='ward')
plt.figure(figsize=(10, 7))
dendrogram(linkage_matrix)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()
# Plot clusters
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title('Agglomerative Clustering')
plt.show()
Linkage Methods
- Ward: Minimizes variance (default, usually best)
- Complete: Maximum distance between clusters
- Average: Average distance between all pairs
- Single: Minimum distance (can create chains)
šØ Gaussian Mixture Models (GMM)
Probabilistic clustering - assigns soft probabilities instead of hard labels.
from sklearn.mixture import GaussianMixture
# Train GMM
gmm = GaussianMixture(
n_components=4, # Number of Gaussian components
covariance_type='full', # 'full', 'tied', 'diag', 'spherical'
random_state=42
)
gmm.fit(X)
# Hard clustering (most likely cluster)
labels = gmm.predict(X)
# Soft clustering (probabilities)
probabilities = gmm.predict_proba(X)
print("Probabilities for first sample:")
print(probabilities[0]) # [0.9, 0.05, 0.03, 0.02]
# Generate new samples
new_samples, _ = gmm.sample(100)
# Plot
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.6)
plt.scatter(new_samples[:, 0], new_samples[:, 1], c='red', marker='x', alpha=0.3)
plt.title('Gaussian Mixture Model')
plt.show()
š Comparing Clustering Algorithms
| Algorithm | Pros | Cons | Use When |
|---|---|---|---|
| K-Means | Fast, scalable, simple | Need to specify K, assumes spherical clusters | Large datasets, spherical clusters |
| DBSCAN | Finds arbitrary shapes, detects outliers | Sensitive to parameters, struggles with varying density | Non-spherical clusters, noisy data |
| Hierarchical | No need to specify K, dendrogr am visualization | Slow O(n²), memory intensive | Small datasets, need hierarchy |
| GMM | Soft clustering, probabilistic, flexible shapes | Slower than K-Means, sensitive to initialization | Need probabilities, overlapping clusters |
ā Evaluating Clusters
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
# Silhouette Score (higher is better, -1 to 1)
silhouette = silhouette_score(X, labels)
print(f"Silhouette Score: {silhouette:.3f}")
# Davies-Bouldin Index (lower is better, ā„0)
db_score = davies_bouldin_score(X, labels)
print(f"Davies-Bouldin Score: {db_score:.3f}")
# Calinski-Harabasz Index (higher is better)
ch_score = calinski_harabasz_score(X, labels)
print(f"Calinski-Harabasz Score: {ch_score:.1f}")
šÆ Real-World Example: Customer Segmentation
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Load customer data
df = pd.read_csv('customers.csv')
# Features: age, income, spending_score
# Preprocess
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[['age', 'income', 'spending_score']])
# Find optimal K
inertias = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X_scaled)
inertias.append(kmeans.inertia_)
# Train final model
kmeans = KMeans(n_clusters=4, random_state=42)
df['cluster'] = kmeans.fit_predict(X_scaled)
# Analyze segments
for cluster in range(4):
segment = df[df['cluster'] == cluster]
print(f"\nCluster {cluster}:")
print(f" Size: {len(segment)}")
print(f" Avg Age: {segment['age'].mean():.1f}")
print(f" Avg Income: ${segment['income'].mean():.0f}")
print(f" Avg Spending: {segment['spending_score'].mean():.1f}")
# Visualize (2D projection)
import matplotlib.pyplot as plt
plt.scatter(df['income'], df['spending_score'], c=df['cluster'], cmap='viridis')
plt.xlabel('Income')
plt.ylabel('Spending Score')
plt.title('Customer Segments')
plt.colorbar(label='Cluster')
plt.show()
šÆ Key Takeaways
- K-Means: Fast and scalable, use for large datasets
- DBSCAN: Finds arbitrary shapes, detects outliers
- Hierarchical: Creates tree structure, good for small data
- GMM: Soft clustering with probabilities
- Scale features before clustering (except for tree-based)
- Evaluate with silhouette score or domain knowledge