šŸ”µ Clustering Algorithms

Discover hidden patterns in unlabeled data

What is Clustering?

Clustering is unsupervised learning that groups similar data points together. It's used for customer segmentation, anomaly detection, image compression, and more.

Common Applications:

  • Customer segmentation for marketing
  • Document categorization
  • Image segmentation
  • Anomaly detection
  • Recommendation systems

šŸŽÆ K-Means Clustering

How K-Means Works

  1. Choose K (number of clusters)
  2. Randomly initialize K centroids
  3. Assign each point to nearest centroid
  4. Update centroids (mean of assigned points)
  5. Repeat steps 3-4 until convergence

Implementation

from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt

# Generate sample data
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

# Train K-Means
kmeans = KMeans(
    n_clusters=4,           # Number of clusters
    init='k-means++',       # Smart initialization
    n_init=10,              # Number of times to run
    max_iter=300,           # Max iterations
    random_state=42
)

kmeans.fit(X)

# Get results
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

# Plot
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.6)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X', s=200)
plt.title('K-Means Clustering')
plt.show()

Choosing Optimal K - Elbow Method

# Try different K values
inertias = []
K_range = range(1, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

# Plot elbow curve
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (Within-cluster sum of squares)')
plt.title('Elbow Method')
plt.show()

# Look for "elbow" - where curve bends

Silhouette Score

from sklearn.metrics import silhouette_score

# Evaluate clustering quality
silhouette_scores = []

for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X)
    score = silhouette_score(X, labels)
    silhouette_scores.append(score)
    print(f"K={k}: Silhouette Score = {score:.3f}")

# Higher score = better clustering
# Score ranges from -1 to 1
# >0.5 = good, >0.7 = strong

🌐 DBSCAN (Density-Based)

DBSCAN finds clusters of arbitrary shape and identifies outliers. Unlike K-Means, you don't need to specify the number of clusters!

Implementation

from sklearn.cluster import DBSCAN

# Train DBSCAN
dbscan = DBSCAN(
    eps=0.5,               # Maximum distance between points
    min_samples=5          # Minimum points to form cluster
)

labels = dbscan.fit_predict(X)

# -1 indicates outliers/noise
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)

print(f"Clusters found: {n_clusters}")
print(f"Outliers: {n_noise}")

# Plot
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title('DBSCAN Clustering')
plt.show()

Advantages of DBSCAN

Disadvantages

🌳 Hierarchical Clustering

Creates a tree of clusters (dendrogram). Two approaches: Agglomerative (bottom-up) and Divisive (top-down).

Agglomerative Clustering

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# Train hierarchical clustering
agg = AgglomerativeClustering(
    n_clusters=4,
    linkage='ward'     # 'ward', 'complete', 'average', 'single'
)

labels = agg.fit_predict(X)

# Create dendrogram
linkage_matrix = linkage(X, method='ward')
plt.figure(figsize=(10, 7))
dendrogram(linkage_matrix)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()

# Plot clusters
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title('Agglomerative Clustering')
plt.show()

Linkage Methods

šŸŽØ Gaussian Mixture Models (GMM)

Probabilistic clustering - assigns soft probabilities instead of hard labels.

from sklearn.mixture import GaussianMixture

# Train GMM
gmm = GaussianMixture(
    n_components=4,        # Number of Gaussian components
    covariance_type='full', # 'full', 'tied', 'diag', 'spherical'
    random_state=42
)

gmm.fit(X)

# Hard clustering (most likely cluster)
labels = gmm.predict(X)

# Soft clustering (probabilities)
probabilities = gmm.predict_proba(X)
print("Probabilities for first sample:")
print(probabilities[0])  # [0.9, 0.05, 0.03, 0.02]

# Generate new samples
new_samples, _ = gmm.sample(100)

# Plot
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.6)
plt.scatter(new_samples[:, 0], new_samples[:, 1], c='red', marker='x', alpha=0.3)
plt.title('Gaussian Mixture Model')
plt.show()

šŸ“Š Comparing Clustering Algorithms

Algorithm Pros Cons Use When
K-Means Fast, scalable, simple Need to specify K, assumes spherical clusters Large datasets, spherical clusters
DBSCAN Finds arbitrary shapes, detects outliers Sensitive to parameters, struggles with varying density Non-spherical clusters, noisy data
Hierarchical No need to specify K, dendrogr am visualization Slow O(n²), memory intensive Small datasets, need hierarchy
GMM Soft clustering, probabilistic, flexible shapes Slower than K-Means, sensitive to initialization Need probabilities, overlapping clusters

āœ… Evaluating Clusters

from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

# Silhouette Score (higher is better, -1 to 1)
silhouette = silhouette_score(X, labels)
print(f"Silhouette Score: {silhouette:.3f}")

# Davies-Bouldin Index (lower is better, ≄0)
db_score = davies_bouldin_score(X, labels)
print(f"Davies-Bouldin Score: {db_score:.3f}")

# Calinski-Harabasz Index (higher is better)
ch_score = calinski_harabasz_score(X, labels)
print(f"Calinski-Harabasz Score: {ch_score:.1f}")

šŸŽÆ Real-World Example: Customer Segmentation

import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load customer data
df = pd.read_csv('customers.csv')
# Features: age, income, spending_score

# Preprocess
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[['age', 'income', 'spending_score']])

# Find optimal K
inertias = []
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)

# Train final model
kmeans = KMeans(n_clusters=4, random_state=42)
df['cluster'] = kmeans.fit_predict(X_scaled)

# Analyze segments
for cluster in range(4):
    segment = df[df['cluster'] == cluster]
    print(f"\nCluster {cluster}:")
    print(f"  Size: {len(segment)}")
    print(f"  Avg Age: {segment['age'].mean():.1f}")
    print(f"  Avg Income: ${segment['income'].mean():.0f}")
    print(f"  Avg Spending: {segment['spending_score'].mean():.1f}")

# Visualize (2D projection)
import matplotlib.pyplot as plt
plt.scatter(df['income'], df['spending_score'], c=df['cluster'], cmap='viridis')
plt.xlabel('Income')
plt.ylabel('Spending Score')
plt.title('Customer Segments')
plt.colorbar(label='Cluster')
plt.show()

šŸŽÆ Key Takeaways