What is K-Nearest Neighbors?
KNN is a simple, intuitive algorithm that classifies data points based on their K nearest neighbors. It's non-parametric and makes no assumptions about data distribution.
How KNN Works:
- Choose K (number of neighbors)
- Calculate distance to all training points
- Find K nearest neighbors
- Classification: Majority vote
- Regression: Average of neighbors
🎯 KNN Classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42
)
# Scale features (important for distance-based algorithms!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train KNN
knn = KNeighborsClassifier(
n_neighbors=5, # K = 5
weights='uniform', # or 'distance'
algorithm='auto', # 'ball_tree', 'kd_tree', 'brute'
metric='minkowski', # distance metric
p=2 # p=2 for Euclidean, p=1 for Manhattan
)
knn.fit(X_train_scaled, y_train)
# Predict
y_pred = knn.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
Get Probabilities
# Probability estimates
probabilities = knn.predict_proba(X_test_scaled)
print("Probabilities for first sample:")
print(probabilities[0]) # [0.2, 0.6, 0.2] = class 1 most likely
# Get nearest neighbors
distances, indices = knn.kneighbors(X_test_scaled[:1])
print(f"\nNearest {knn.n_neighbors} neighbors:")
print(f"Indices: {indices[0]}")
print(f"Distances: {distances[0]}")
📈 KNN Regression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Generate regression data
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train KNN Regressor
knn_reg = KNeighborsRegressor(n_neighbors=5, weights='distance')
knn_reg.fit(X_train_scaled, y_train)
# Predict
y_pred = knn_reg.predict(X_test_scaled)
# Evaluate
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.3f}")
print(f"R² Score: {r2:.3f}")
🔢 Choosing K
Finding Optimal K
import matplotlib.pyplot as plt
# Try different K values
k_values = range(1, 31)
train_scores = []
test_scores = []
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train_scaled, y_train)
train_scores.append(knn.score(X_train_scaled, y_train))
test_scores.append(knn.score(X_test_scaled, y_test))
# Plot
plt.figure(figsize=(10, 6))
plt.plot(k_values, train_scores, 'bo-', label='Training Accuracy')
plt.plot(k_values, test_scores, 'ro-', label='Test Accuracy')
plt.xlabel('K (Number of Neighbors)')
plt.ylabel('Accuracy')
plt.title('K vs Accuracy')
plt.legend()
plt.grid(True)
plt.show()
# Best K
best_k = k_values[np.argmax(test_scores)]
print(f"Best K: {best_k}")
print(f"Best Test Accuracy: {max(test_scores):.3f}")
K Selection Guidelines
- Small K (1-3): Sensitive to noise, complex boundaries, risk overfitting
- Large K (20-50): Smoother boundaries, more stable, risk underfitting
- Odd K: Preferred for binary classification (avoids ties)
- Rule of thumb: K = √n (square root of training samples)
- Best practice: Use cross-validation to find optimal K
📏 Distance Metrics
# Euclidean Distance (L2, default)
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
# Distance = √[(x₁-x₂)² + (y₁-y₂)²]
# Manhattan Distance (L1)
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
# Distance = |x₁-x₂| + |y₁-y₂|
# Minkowski Distance (generalized)
knn_minkowski = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=3)
# p=1: Manhattan, p=2: Euclidean
# Chebyshev Distance (max difference)
knn_chebyshev = KNeighborsClassifier(n_neighbors=5, metric='chebyshev')
# Cosine Similarity (for text/sparse data)
knn_cosine = KNeighborsClassifier(n_neighbors=5, metric='cosine')
# Compare metrics
for name, knn in [('Euclidean', knn_euclidean),
('Manhattan', knn_manhattan),
('Chebyshev', knn_chebyshev)]:
knn.fit(X_train_scaled, y_train)
score = knn.score(X_test_scaled, y_test)
print(f"{name}: {score:.3f}")
⚖️ Weighting Schemes
# Uniform weights (all neighbors contribute equally)
knn_uniform = KNeighborsClassifier(n_neighbors=5, weights='uniform')
knn_uniform.fit(X_train_scaled, y_train)
print(f"Uniform: {knn_uniform.score(X_test_scaled, y_test):.3f}")
# Distance weights (closer neighbors have more influence)
knn_distance = KNeighborsClassifier(n_neighbors=5, weights='distance')
knn_distance.fit(X_train_scaled, y_train)
print(f"Distance: {knn_distance.score(X_test_scaled, y_test):.3f}")
# Custom weight function
def custom_weight(distances):
return 1 / (distances + 1e-5) # Inverse distance
knn_custom = KNeighborsClassifier(n_neighbors=5, weights=custom_weight)
knn_custom.fit(X_train_scaled, y_train)
print(f"Custom: {knn_custom.score(X_test_scaled, y_test):.3f}")
🚀 Optimization Techniques
Algorithm Selection
# Brute Force (exact, O(n))
knn_brute = KNeighborsClassifier(n_neighbors=5, algorithm='brute')
# Good for: Small datasets, high dimensions
# KD-Tree (O(log n) search)
knn_kd = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree')
# Good for: Low dimensions (< 20), medium datasets
# Ball Tree (O(log n) search)
knn_ball = KNeighborsClassifier(n_neighbors=5, algorithm='ball_tree')
# Good for: High dimensions, large datasets
# Auto (chooses best based on data)
knn_auto = KNeighborsClassifier(n_neighbors=5, algorithm='auto')
# Recommended: Let scikit-learn decide
import time
for name, knn in [('Brute', knn_brute), ('KD-Tree', knn_kd),
('Ball Tree', knn_ball), ('Auto', knn_auto)]:
start = time.time()
knn.fit(X_train_scaled, y_train)
knn.predict(X_test_scaled)
elapsed = time.time() - start
print(f"{name}: {elapsed:.4f}s")
⚙️ Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV
# Parameter grid
param_grid = {
'n_neighbors': [3, 5, 7, 9, 11, 15],
'weights': ['uniform', 'distance'],
'metric': ['euclidean', 'manhattan', 'minkowski'],
'p': [1, 2] # for minkowski
}
# Grid search
knn = KNeighborsClassifier()
grid_search = GridSearchCV(
knn, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1
)
grid_search.fit(X_train_scaled, y_train)
# Best parameters
print("Best parameters:", grid_search.best_params_)
print(f"Best CV score: {grid_search.best_score_:.3f}")
# Test best model
best_knn = grid_search.best_estimator_
test_score = best_knn.score(X_test_scaled, y_test)
print(f"Test accuracy: {test_score:.3f}")
✅ Advantages & Disadvantages
| Advantages ✅ | Disadvantages ❌ |
|---|---|
| Simple and intuitive | Slow prediction (must search neighbors) |
| No training phase (lazy learning) | High memory usage (stores all data) |
| No assumptions about data | Sensitive to feature scaling |
| Works for classification & regression | Curse of dimensionality (high dimensions) |
| Naturally handles multiclass | Sensitive to irrelevant features |
| Can update model easily (add points) | Imbalanced data problematic |
⚠️ Important Considerations
1. Feature Scaling is Critical
# Example: Age (20-80) vs Income (20k-200k)
# Income dominates distance calculation!
# Without scaling
knn_no_scale = KNeighborsClassifier()
knn_no_scale.fit(X_train, y_train)
print(f"No scaling: {knn_no_scale.score(X_test, y_test):.3f}")
# With scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
knn_scaled = KNeighborsClassifier()
knn_scaled.fit(X_train_scaled, y_train)
print(f"With scaling: {knn_scaled.score(X_test_scaled, y_test):.3f}")
2. Curse of Dimensionality
KNN performs poorly in high dimensions (>20 features) because distances become meaningless. Use dimensionality reduction first (PCA, feature selection).
3. Imbalanced Data
# Use weighted voting to handle imbalanced classes
from sklearn.utils.class_weight import compute_sample_weight
# Calculate class weights
sample_weights = compute_sample_weight('balanced', y_train)
# Or use distance weighting
knn = KNeighborsClassifier(n_neighbors=5, weights='distance')
knn.fit(X_train_scaled, y_train)
💡 Best Practices
- Always scale: Use StandardScaler or MinMaxScaler
- Start with K=5: Good default, then tune with cross-validation
- Use distance weights: Often performs better than uniform
- Remove irrelevant features: They add noise to distance calculations
- Consider dimensionality: Use PCA if features > 20
- Algorithm='auto': Let scikit-learn choose optimal search method
- Small datasets: KNN works well with 100-10,000 samples
- Real-time prediction needed? Consider other algorithms (KNN is slow)
🎯 When to Use KNN
✅ Use KNN When:
- Small to medium datasets (< 10,000)
- Low dimensionality (< 20 features)
- Non-linear decision boundaries
- Need simple baseline model
- Want interpretable results
- Training speed important (instant)
❌ Avoid KNN When:
- Large datasets (> 100,000)
- High dimensionality (> 50 features)
- Need fast predictions
- Limited memory
- Many irrelevant features
- Highly imbalanced classes
🎯 Key Takeaways
- KNN classifies based on K nearest neighbors
- Lazy learning - no training, all computation at prediction
- Distance-based - feature scaling is critical
- Choose K wisely - use cross-validation
- Distance weights often better than uniform
- Curse of dimensionality - poor in high dimensions
- Simple and interpretable - great for baselines