🎯 K-Nearest Neighbors

Simple yet powerful instance-based learning

What is K-Nearest Neighbors?

KNN is a simple, intuitive algorithm that classifies data points based on their K nearest neighbors. It's non-parametric and makes no assumptions about data distribution.

How KNN Works:

  1. Choose K (number of neighbors)
  2. Calculate distance to all training points
  3. Find K nearest neighbors
  4. Classification: Majority vote
  5. Regression: Average of neighbors

🎯 KNN Classification

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report

# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

# Scale features (important for distance-based algorithms!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN
knn = KNeighborsClassifier(
    n_neighbors=5,        # K = 5
    weights='uniform',    # or 'distance'
    algorithm='auto',     # 'ball_tree', 'kd_tree', 'brute'
    metric='minkowski',   # distance metric
    p=2                   # p=2 for Euclidean, p=1 for Manhattan
)

knn.fit(X_train_scaled, y_train)

# Predict
y_pred = knn.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Get Probabilities

# Probability estimates
probabilities = knn.predict_proba(X_test_scaled)
print("Probabilities for first sample:")
print(probabilities[0])  # [0.2, 0.6, 0.2] = class 1 most likely

# Get nearest neighbors
distances, indices = knn.kneighbors(X_test_scaled[:1])
print(f"\nNearest {knn.n_neighbors} neighbors:")
print(f"Indices: {indices[0]}")
print(f"Distances: {distances[0]}")

📈 KNN Regression

from sklearn.neighbors import KNeighborsRegressor
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Generate regression data
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN Regressor
knn_reg = KNeighborsRegressor(n_neighbors=5, weights='distance')
knn_reg.fit(X_train_scaled, y_train)

# Predict
y_pred = knn_reg.predict(X_test_scaled)

# Evaluate
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.3f}")
print(f"R² Score: {r2:.3f}")

🔢 Choosing K

Finding Optimal K

import matplotlib.pyplot as plt

# Try different K values
k_values = range(1, 31)
train_scores = []
test_scores = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)
    
    train_scores.append(knn.score(X_train_scaled, y_train))
    test_scores.append(knn.score(X_test_scaled, y_test))

# Plot
plt.figure(figsize=(10, 6))
plt.plot(k_values, train_scores, 'bo-', label='Training Accuracy')
plt.plot(k_values, test_scores, 'ro-', label='Test Accuracy')
plt.xlabel('K (Number of Neighbors)')
plt.ylabel('Accuracy')
plt.title('K vs Accuracy')
plt.legend()
plt.grid(True)
plt.show()

# Best K
best_k = k_values[np.argmax(test_scores)]
print(f"Best K: {best_k}")
print(f"Best Test Accuracy: {max(test_scores):.3f}")

K Selection Guidelines

📏 Distance Metrics

# Euclidean Distance (L2, default)
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
# Distance = √[(x₁-x₂)² + (y₁-y₂)²]

# Manhattan Distance (L1)
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
# Distance = |x₁-x₂| + |y₁-y₂|

# Minkowski Distance (generalized)
knn_minkowski = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=3)
# p=1: Manhattan, p=2: Euclidean

# Chebyshev Distance (max difference)
knn_chebyshev = KNeighborsClassifier(n_neighbors=5, metric='chebyshev')

# Cosine Similarity (for text/sparse data)
knn_cosine = KNeighborsClassifier(n_neighbors=5, metric='cosine')

# Compare metrics
for name, knn in [('Euclidean', knn_euclidean), 
                   ('Manhattan', knn_manhattan),
                   ('Chebyshev', knn_chebyshev)]:
    knn.fit(X_train_scaled, y_train)
    score = knn.score(X_test_scaled, y_test)
    print(f"{name}: {score:.3f}")

⚖️ Weighting Schemes

# Uniform weights (all neighbors contribute equally)
knn_uniform = KNeighborsClassifier(n_neighbors=5, weights='uniform')
knn_uniform.fit(X_train_scaled, y_train)
print(f"Uniform: {knn_uniform.score(X_test_scaled, y_test):.3f}")

# Distance weights (closer neighbors have more influence)
knn_distance = KNeighborsClassifier(n_neighbors=5, weights='distance')
knn_distance.fit(X_train_scaled, y_train)
print(f"Distance: {knn_distance.score(X_test_scaled, y_test):.3f}")

# Custom weight function
def custom_weight(distances):
    return 1 / (distances + 1e-5)  # Inverse distance

knn_custom = KNeighborsClassifier(n_neighbors=5, weights=custom_weight)
knn_custom.fit(X_train_scaled, y_train)
print(f"Custom: {knn_custom.score(X_test_scaled, y_test):.3f}")

🚀 Optimization Techniques

Algorithm Selection

# Brute Force (exact, O(n))
knn_brute = KNeighborsClassifier(n_neighbors=5, algorithm='brute')
# Good for: Small datasets, high dimensions

# KD-Tree (O(log n) search)
knn_kd = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree')
# Good for: Low dimensions (< 20), medium datasets

# Ball Tree (O(log n) search)
knn_ball = KNeighborsClassifier(n_neighbors=5, algorithm='ball_tree')
# Good for: High dimensions, large datasets

# Auto (chooses best based on data)
knn_auto = KNeighborsClassifier(n_neighbors=5, algorithm='auto')
# Recommended: Let scikit-learn decide

import time

for name, knn in [('Brute', knn_brute), ('KD-Tree', knn_kd), 
                   ('Ball Tree', knn_ball), ('Auto', knn_auto)]:
    start = time.time()
    knn.fit(X_train_scaled, y_train)
    knn.predict(X_test_scaled)
    elapsed = time.time() - start
    print(f"{name}: {elapsed:.4f}s")

⚙️ Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

# Parameter grid
param_grid = {
    'n_neighbors': [3, 5, 7, 9, 11, 15],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'minkowski'],
    'p': [1, 2]  # for minkowski
}

# Grid search
knn = KNeighborsClassifier()
grid_search = GridSearchCV(
    knn, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1
)
grid_search.fit(X_train_scaled, y_train)

# Best parameters
print("Best parameters:", grid_search.best_params_)
print(f"Best CV score: {grid_search.best_score_:.3f}")

# Test best model
best_knn = grid_search.best_estimator_
test_score = best_knn.score(X_test_scaled, y_test)
print(f"Test accuracy: {test_score:.3f}")

✅ Advantages & Disadvantages

Advantages ✅ Disadvantages ❌
Simple and intuitive Slow prediction (must search neighbors)
No training phase (lazy learning) High memory usage (stores all data)
No assumptions about data Sensitive to feature scaling
Works for classification & regression Curse of dimensionality (high dimensions)
Naturally handles multiclass Sensitive to irrelevant features
Can update model easily (add points) Imbalanced data problematic

⚠️ Important Considerations

1. Feature Scaling is Critical

# Example: Age (20-80) vs Income (20k-200k)
# Income dominates distance calculation!

# Without scaling
knn_no_scale = KNeighborsClassifier()
knn_no_scale.fit(X_train, y_train)
print(f"No scaling: {knn_no_scale.score(X_test, y_test):.3f}")

# With scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier()
knn_scaled.fit(X_train_scaled, y_train)
print(f"With scaling: {knn_scaled.score(X_test_scaled, y_test):.3f}")

2. Curse of Dimensionality

KNN performs poorly in high dimensions (>20 features) because distances become meaningless. Use dimensionality reduction first (PCA, feature selection).

3. Imbalanced Data

# Use weighted voting to handle imbalanced classes
from sklearn.utils.class_weight import compute_sample_weight

# Calculate class weights
sample_weights = compute_sample_weight('balanced', y_train)

# Or use distance weighting
knn = KNeighborsClassifier(n_neighbors=5, weights='distance')
knn.fit(X_train_scaled, y_train)

💡 Best Practices

🎯 When to Use KNN

✅ Use KNN When:

  • Small to medium datasets (< 10,000)
  • Low dimensionality (< 20 features)
  • Non-linear decision boundaries
  • Need simple baseline model
  • Want interpretable results
  • Training speed important (instant)

❌ Avoid KNN When:

  • Large datasets (> 100,000)
  • High dimensionality (> 50 features)
  • Need fast predictions
  • Limited memory
  • Many irrelevant features
  • Highly imbalanced classes

🎯 Key Takeaways