🎛️ Hyperparameter Tuning

Optimize your model's configuration

What is Hyperparameter Tuning?

Hyperparameter tuning is the process of finding the optimal configuration for your model. Unlike parameters (learned during training), hyperparameters are set before training begins.

Examples of Hyperparameters:

  • Random Forest: n_estimators, max_depth, min_samples_split
  • Neural Networks: learning rate, batch size, number of layers
  • SVM: C, gamma, kernel type
  • KNN: K (number of neighbors), distance metric

🔍 Grid Search

Try all combinations of hyperparameters in a predefined grid.

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Total combinations: 3 × 3 × 3 × 3 = 81

# Create grid search
grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,                    # 5-fold cross-validation
    scoring='accuracy',      # Metric to optimize
    n_jobs=-1,              # Use all CPU cores
    verbose=2,              # Print progress
    return_train_score=True
)

# Fit (this trains 81 × 5 = 405 models!)
grid_search.fit(X_train, y_train)

# Best parameters
print("Best parameters:", grid_search.best_params_)
print(f"Best CV score: {grid_search.best_score_:.3f}")

# Test best model
best_model = grid_search.best_estimator_
test_score = best_model.score(X_test, y_test)
print(f"Test score: {test_score:.3f}")

Inspect Results

import pandas as pd

# Convert results to DataFrame
results = pd.DataFrame(grid_search.cv_results_)

# View top 5 configurations
print(results[['params', 'mean_test_score', 'rank_test_score']].head())

# Sort by score
results_sorted = results.sort_values('mean_test_score', ascending=False)
print("\nTop 3 configurations:")
for idx, row in results_sorted.head(3).iterrows():
    print(f"{row['params']}: {row['mean_test_score']:.3f}")

🎲 Random Search

Sample random combinations from parameter distributions. Often finds good solutions faster than Grid Search!

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Define parameter distributions
param_distributions = {
    'n_estimators': randint(50, 500),           # Random integers
    'max_depth': [5, 10, 15, 20, None],        # Discrete choices
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'max_features': uniform(0.1, 0.9),         # Continuous uniform
}

# Random search
random_search = RandomizedSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_distributions=param_distributions,
    n_iter=100,              # Number of random combinations to try
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=2,
    random_state=42,
    return_train_score=True
)

random_search.fit(X_train, y_train)

print("Best parameters:", random_search.best_params_)
print(f"Best CV score: {random_search.best_score_:.3f}")

# Random Search is better when:
# - Large search space (many hyperparameters)
# - Limited computational budget
# - Some hyperparameters more important than others

🎯 Grid Search vs Random Search

Aspect Grid Search Random Search
Search Strategy Exhaustive (all combinations) Random sampling
Computational Cost High (grows exponentially) Controllable (set n_iter)
Best For Small search space, few params Large search space, many params
Guarantee Finds global optimum in grid May miss optimum
Efficiency Can waste time on bad regions Better exploration of space
Continuous Params Must discretize Can sample continuously

🚀 Bayesian Optimization

Bayesian optimization builds a probabilistic model and intelligently selects which hyperparameters to try next. More efficient than random search!

# Install: pip install scikit-optimize
from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical

# Define search space
search_space = {
    'n_estimators': Integer(50, 500),
    'max_depth': Integer(5, 50),
    'min_samples_split': Integer(2, 20),
    'min_samples_leaf': Integer(1, 10),
    'max_features': Real(0.1, 1.0)
}

# Bayesian optimization
bayes_search = BayesSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    search_spaces=search_space,
    n_iter=50,               # Number of parameter settings sampled
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=2,
    random_state=42
)

bayes_search.fit(X_train, y_train)

print("Best parameters:", bayes_search.best_params_)
print(f"Best CV score: {bayes_search.best_score_:.3f}")

# Advantages:
# - Learns from previous evaluations
# - Focuses on promising regions
# - More efficient than random search
# - Good for expensive-to-evaluate models

⚡ Halving Grid/Random Search

Start with many configs on small data, successively eliminate bad performers and use more data.

from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV, HalvingRandomSearchCV

# Halving Grid Search
halving_grid = HalvingGridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    factor=3,                # Reduce candidates by this factor each iteration
    resource='n_samples',    # Resource to increase (or n_estimators)
    max_resources='auto',    # Max resource to use
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=2,
    random_state=42
)

halving_grid.fit(X_train, y_train)

print("Best parameters:", halving_grid.best_params_)
print(f"Best score: {halving_grid.best_score_:.3f}")

# How it works:
# Iteration 1: Evaluate all configs on 33% of data
# Iteration 2: Keep best 1/3, use 67% of data
# Iteration 3: Keep best 1/3, use 100% of data

# Much faster than regular GridSearchCV!

📊 Multiple Metrics

# Optimize for one metric, but track others
from sklearn.metrics import make_scorer, f1_score

scoring = {
    'accuracy': 'accuracy',
    'precision': 'precision_macro',
    'recall': 'recall_macro',
    'f1': 'f1_macro'
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring=scoring,
    refit='f1',              # Optimize for F1, but track all metrics
    n_jobs=-1,
    return_train_score=True
)

grid_search.fit(X_train, y_train)

# Best model selected based on F1
print(f"Best F1: {grid_search.best_score_:.3f}")

# But can see other metrics too
results = pd.DataFrame(grid_search.cv_results_)
print("\nFor best params:")
best_idx = grid_search.best_index_
for metric in ['accuracy', 'precision', 'recall', 'f1']:
    score = results.loc[best_idx, f'mean_test_{metric}']
    print(f"{metric}: {score:.3f}")

🎛️ Tuning Different Models

Random Forest

param_grid_rf = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None],
    'bootstrap': [True, False]
}

Gradient Boosting (XGBoost)

param_grid_xgb = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.3],
    'max_depth': [3, 5, 7],
    'min_child_weight': [1, 3, 5],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0]
}

SVM

param_grid_svm = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto', 0.001, 0.01, 0.1],
    'kernel': ['rbf', 'poly', 'sigmoid']
}

Neural Network

from sklearn.neural_network import MLPClassifier

param_grid_nn = {
    'hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 50)],
    'activation': ['relu', 'tanh'],
    'alpha': [0.0001, 0.001, 0.01],
    'learning_rate_init': [0.001, 0.01, 0.1],
    'batch_size': [32, 64, 128]
}

💡 Best Practices

⏱️ Efficient Tuning Strategy

  1. Start with defaults: Baseline performance
  2. Random search (wide): 50-100 iterations, broad ranges
  3. Analyze results: Which parameters matter most?
  4. Grid search (narrow): Fine-tune around best values
  5. Validate on test set: Final performance check
# Example workflow
# Step 1: Baseline
rf_default = RandomForestClassifier()
rf_default.fit(X_train, y_train)
baseline = rf_default.score(X_test, y_test)
print(f"Baseline: {baseline:.3f}")

# Step 2: Random search (broad)
random_search = RandomizedSearchCV(...)
random_search.fit(X_train, y_train)

# Step 3: Analyze and refine
best_params = random_search.best_params_
print(f"Random search best: {random_search.best_score_:.3f}")

# Step 4: Grid search (narrow)
param_grid_refined = {
    'n_estimators': [best_params['n_estimators'] - 50,
                     best_params['n_estimators'],
                     best_params['n_estimators'] + 50],
    # ... refine other parameters
}
grid_search = GridSearchCV(...)
grid_search.fit(X_train, y_train)

# Step 5: Final test
final_score = grid_search.best_estimator_.score(X_test, y_test)
print(f"Final test: {final_score:.3f}")

⚠️ Common Pitfalls

💾 Saving and Loading

import joblib

# Save best model
joblib.dump(grid_search.best_estimator_, 'best_model.pkl')

# Save entire grid search object
joblib.dump(grid_search, 'grid_search_results.pkl')

# Load later
loaded_model = joblib.load('best_model.pkl')
loaded_grid = joblib.load('grid_search_results.pkl')

# Use loaded model
predictions = loaded_model.predict(X_new)

🎯 Key Takeaways