What is Cross-Validation?
Cross-validation is a resampling technique that provides a more reliable estimate of model performance by training and testing on different data splits multiple times.
Why Cross-Validation?
- More reliable than single train-test split
- Uses all data for both training and testing
- Reduces variance in performance estimates
- Helps detect overfitting
📊 K-Fold Cross-Validation
Divide data into K folds, train on K-1 folds, test on the remaining fold. Repeat K times.
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
import numpy as np
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# Create model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
# 5-fold cross-validation
scores = cross_val_score(rf, X, y, cv=5, scoring='accuracy')
print(f"Scores for each fold: {scores}")
print(f"Mean accuracy: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
# Output example:
# Scores: [0.967, 1.000, 0.933, 0.933, 0.933]
# Mean: 0.953 (+/- 0.054)
Manual K-Fold
# More control with manual implementation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = []
for fold, (train_idx, test_idx) in enumerate(kfold.split(X)):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# Train
rf.fit(X_train, y_train)
# Evaluate
score = rf.score(X_test, y_test)
scores.append(score)
print(f"Fold {fold + 1}: {score:.3f}")
print(f"\nMean: {np.mean(scores):.3f}")
print(f"Std: {np.std(scores):.3f}")
🎯 Stratified K-Fold
Maintains class distribution in each fold - essential for imbalanced datasets!
from sklearn.model_selection import StratifiedKFold, cross_validate
# Stratified K-Fold preserves class ratios
skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Check class distribution
for fold, (train_idx, test_idx) in enumerate(skfold.split(X, y)):
train_labels = y[train_idx]
test_labels = y[test_idx]
print(f"Fold {fold + 1}:")
print(f" Train: {np.bincount(train_labels)}")
print(f" Test: {np.bincount(test_labels)}")
# Use with cross_val_score
scores = cross_val_score(rf, X, y, cv=skfold, scoring='accuracy')
print(f"\nStratified Mean: {scores.mean():.3f}")
⏰ Time Series Split
For time series data - never shuffle! Test set must be in the future.
from sklearn.model_selection import TimeSeriesSplit
# Time series cross-validation
tscv = TimeSeriesSplit(n_splits=5)
# Visualize splits
for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
print(f"Fold {fold + 1}:")
print(f" Train: samples {train_idx[0]} to {train_idx[-1]}")
print(f" Test: samples {test_idx[0]} to {test_idx[-1]}")
# Output:
# Fold 1: Train [0-24], Test [25-49]
# Fold 2: Train [0-49], Test [50-74]
# Fold 3: Train [0-74], Test [75-99]
# etc.
# Use for scoring
scores = cross_val_score(rf, X, y, cv=tscv, scoring='neg_mean_squared_error')
print(f"\nMSE scores: {-scores}") # Negate to get positive MSE
🎲 Leave-One-Out (LOO)
K = n (number of samples). Each sample is a test set once. Computationally expensive!
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
print(f"Number of folds: {loo.get_n_splits(X)}") # Same as n_samples
# Use for small datasets
scores = cross_val_score(rf, X, y, cv=loo, scoring='accuracy')
print(f"LOO Accuracy: {scores.mean():.3f}")
# Warning: Very slow for large datasets!
# Use only when:
# - Very small dataset (< 100 samples)
# - Need maximum data for training
# - Computational cost acceptable
📈 Multiple Metrics
from sklearn.model_selection import cross_validate
# Evaluate multiple metrics at once
scoring = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']
results = cross_validate(
rf, X, y,
cv=5,
scoring=scoring,
return_train_score=True
)
# Print results
for metric in scoring:
test_scores = results[f'test_{metric}']
print(f"{metric}:")
print(f" Test: {test_scores.mean():.3f} (+/- {test_scores.std() * 2:.3f})")
# Also get training scores to check overfitting
print(f"\nTrain accuracy: {results['train_accuracy'].mean():.3f}")
print(f"Test accuracy: {results['test_accuracy'].mean():.3f}")
🔧 Nested Cross-Validation
Use nested CV for hyperparameter tuning + model evaluation to avoid overfitting.
from sklearn.model_selection import GridSearchCV
# Outer loop: Model evaluation
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)
# Inner loop: Hyperparameter tuning
inner_cv = KFold(n_splits=3, shuffle=True, random_state=42)
# Parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, None]
}
# Nested CV
nested_scores = []
for train_idx, test_idx in outer_cv.split(X):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# Inner CV: Find best parameters
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=inner_cv,
scoring='accuracy'
)
grid_search.fit(X_train, y_train)
# Outer CV: Evaluate best model
score = grid_search.score(X_test, y_test)
nested_scores.append(score)
print(f"Best params: {grid_search.best_params_}, Score: {score:.3f}")
print(f"\nNested CV Mean: {np.mean(nested_scores):.3f}")
print(f"Nested CV Std: {np.std(nested_scores):.3f}")
🎯 Choosing K
| K Value | Train Size | Bias | Variance | Computation |
|---|---|---|---|---|
| K = 2 | 50% | High | Low | Fast |
| K = 5 | 80% | Medium | Medium | Good balance ⭐ |
| K = 10 | 90% | Low | Medium-High | Common choice |
| K = n (LOO) | ~100% | Lowest | Highest | Very slow |
Recommendation: K=5 or K=10 for most cases. K=5 is faster and often sufficient.
⚡ Repeated Cross-Validation
Run K-fold multiple times with different random splits for more robust estimates.
from sklearn.model_selection import RepeatedKFold, RepeatedStratifiedKFold
# Repeat 5-fold CV 10 times (50 total fits)
rskf = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)
scores = cross_val_score(rf, X, y, cv=rskf, scoring='accuracy')
print(f"50 scores (5-fold × 10 repeats): {len(scores)}")
print(f"Mean: {scores.mean():.3f}")
print(f"Std: {scores.std():.3f}")
print(f"95% CI: [{scores.mean() - 1.96*scores.std():.3f}, "
f"{scores.mean() + 1.96*scores.std():.3f}]")
# When to use:
# - Small datasets
# - Need very reliable estimates
# - Computational cost acceptable
💡 Best Practices
- Use Stratified K-Fold: For classification, especially imbalanced data
- K = 5 or 10: Good balance of bias-variance and computation
- Shuffle data: Set shuffle=True (except for time series!)
- Set random_state: For reproducibility
- Time series: Use TimeSeriesSplit, never shuffle
- Report mean ± std: Show variability in results
- Nested CV: For hyperparameter tuning + evaluation
- Small datasets: Use higher K or repeated CV
⚠️ Common Mistakes
- Data leakage: Fit scaler/imputer on full data before CV
# WRONG: Fit on entire dataset scaler = StandardScaler() X_scaled = scaler.fit_transform(X) cross_val_score(model, X_scaled, y, cv=5) # Leakage! # CORRECT: Use Pipeline from sklearn.pipeline import Pipeline pipe = Pipeline([ ('scaler', StandardScaler()), ('model', RandomForestClassifier()) ]) cross_val_score(pipe, X, y, cv=5) # No leakage! - Using LOO for large data: Too slow, use K-fold instead
- Not stratifying: Can get unlucky splits with imbalanced data
- Shuffling time series: Breaks temporal dependency
- Tuning on CV results: Use nested CV or separate validation set
🎯 Key Takeaways
- Cross-validation gives more reliable performance estimates
- K-fold is most common (K=5 or 10)
- Stratified K-fold for classification (maintains class balance)
- TimeSeriesSplit for time series (no shuffling)
- Use Pipeline to avoid data leakage
- Nested CV for hyperparameter tuning + evaluation
- Report mean ± std for transparency