Cross-Validation

What is Cross-Validation?

Cross-validation is a resampling technique that provides a more reliable estimate of model performance by training and testing on different data splits multiple times.

                Why Cross-Validation?
                More reliable than single train-test split
Uses all data for both training and testing
Reduces variance in performance estimates
Helps detect overfitting

            

📊 K-Fold Cross-Validation

Divide data into K folds, train on K-1 folds, test on the remaining fold. Repeat K times.

from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
import numpy as np

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Create model
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# 5-fold cross-validation
scores = cross_val_score(rf, X, y, cv=5, scoring='accuracy')

print(f"Scores for each fold: {scores}")
print(f"Mean accuracy: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

# Output example:
# Scores: [0.967, 1.000, 0.933, 0.933, 0.933]
# Mean: 0.953 (+/- 0.054)

Manual K-Fold

# More control with manual implementation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

scores = []
for fold, (train_idx, test_idx) in enumerate(kfold.split(X)):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    
    # Train
    rf.fit(X_train, y_train)
    
    # Evaluate
    score = rf.score(X_test, y_test)
    scores.append(score)
    print(f"Fold {fold + 1}: {score:.3f}")

print(f"\nMean: {np.mean(scores):.3f}")
print(f"Std: {np.std(scores):.3f}")

🎯 Stratified K-Fold

Maintains class distribution in each fold - essential for imbalanced datasets!

from sklearn.model_selection import StratifiedKFold, cross_validate

# Stratified K-Fold preserves class ratios
skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Check class distribution
for fold, (train_idx, test_idx) in enumerate(skfold.split(X, y)):
    train_labels = y[train_idx]
    test_labels = y[test_idx]
    
    print(f"Fold {fold + 1}:")
    print(f"  Train: {np.bincount(train_labels)}")
    print(f"  Test:  {np.bincount(test_labels)}")

# Use with cross_val_score
scores = cross_val_score(rf, X, y, cv=skfold, scoring='accuracy')
print(f"\nStratified Mean: {scores.mean():.3f}")

⏰ Time Series Split

For time series data - never shuffle! Test set must be in the future.

from sklearn.model_selection import TimeSeriesSplit

# Time series cross-validation
tscv = TimeSeriesSplit(n_splits=5)

# Visualize splits
for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
    print(f"Fold {fold + 1}:")
    print(f"  Train: samples {train_idx[0]} to {train_idx[-1]}")
    print(f"  Test:  samples {test_idx[0]} to {test_idx[-1]}")

# Output:
# Fold 1: Train [0-24], Test [25-49]
# Fold 2: Train [0-49], Test [50-74]
# Fold 3: Train [0-74], Test [75-99]
# etc.

# Use for scoring
scores = cross_val_score(rf, X, y, cv=tscv, scoring='neg_mean_squared_error')
print(f"\nMSE scores: {-scores}")  # Negate to get positive MSE

🎲 Leave-One-Out (LOO)

K = n (number of samples). Each sample is a test set once. Computationally expensive!

from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()
print(f"Number of folds: {loo.get_n_splits(X)}")  # Same as n_samples

# Use for small datasets
scores = cross_val_score(rf, X, y, cv=loo, scoring='accuracy')
print(f"LOO Accuracy: {scores.mean():.3f}")

# Warning: Very slow for large datasets!
# Use only when:
# - Very small dataset (< 100 samples)
# - Need maximum data for training
# - Computational cost acceptable

📈 Multiple Metrics

from sklearn.model_selection import cross_validate

# Evaluate multiple metrics at once
scoring = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']

results = cross_validate(
    rf, X, y,
    cv=5,
    scoring=scoring,
    return_train_score=True
)

# Print results
for metric in scoring:
    test_scores = results[f'test_{metric}']
    print(f"{metric}:")
    print(f"  Test:  {test_scores.mean():.3f} (+/- {test_scores.std() * 2:.3f})")
    
# Also get training scores to check overfitting
print(f"\nTrain accuracy: {results['train_accuracy'].mean():.3f}")
print(f"Test accuracy:  {results['test_accuracy'].mean():.3f}")

🔧 Nested Cross-Validation

Use nested CV for hyperparameter tuning + model evaluation to avoid overfitting.

from sklearn.model_selection import GridSearchCV

# Outer loop: Model evaluation
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Inner loop: Hyperparameter tuning
inner_cv = KFold(n_splits=3, shuffle=True, random_state=42)

# Parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, None]
}

# Nested CV
nested_scores = []

for train_idx, test_idx in outer_cv.split(X):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    
    # Inner CV: Find best parameters
    grid_search = GridSearchCV(
        RandomForestClassifier(random_state=42),
        param_grid,
        cv=inner_cv,
        scoring='accuracy'
    )
    grid_search.fit(X_train, y_train)
    
    # Outer CV: Evaluate best model
    score = grid_search.score(X_test, y_test)
    nested_scores.append(score)
    print(f"Best params: {grid_search.best_params_}, Score: {score:.3f}")

print(f"\nNested CV Mean: {np.mean(nested_scores):.3f}")
print(f"Nested CV Std: {np.std(nested_scores):.3f}")

🎯 Choosing K

K Value	Train Size	Bias	Variance	Computation
K = 2	50%	High	Low	Fast
K = 5	80%	Medium	Medium	Good balance ⭐
K = 10	90%	Low	Medium-High	Common choice
K = n (LOO)	~100%	Lowest	Highest	Very slow

Recommendation: K=5 or K=10 for most cases. K=5 is faster and often sufficient.

⚡ Repeated Cross-Validation

Run K-fold multiple times with different random splits for more robust estimates.

from sklearn.model_selection import RepeatedKFold, RepeatedStratifiedKFold

# Repeat 5-fold CV 10 times (50 total fits)
rskf = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)

scores = cross_val_score(rf, X, y, cv=rskf, scoring='accuracy')

print(f"50 scores (5-fold × 10 repeats): {len(scores)}")
print(f"Mean: {scores.mean():.3f}")
print(f"Std: {scores.std():.3f}")
print(f"95% CI: [{scores.mean() - 1.96*scores.std():.3f}, "
      f"{scores.mean() + 1.96*scores.std():.3f}]")

# When to use:
# - Small datasets
# - Need very reliable estimates
# - Computational cost acceptable

💡 Best Practices

Use Stratified K-Fold: For classification, especially imbalanced data
K = 5 or 10: Good balance of bias-variance and computation
Shuffle data: Set shuffle=True (except for time series!)
Set random_state: For reproducibility
Time series: Use TimeSeriesSplit, never shuffle
Report mean ± std: Show variability in results
Nested CV: For hyperparameter tuning + evaluation
Small datasets: Use higher K or repeated CV

⚠️ Common Mistakes

Data leakage: Fit scaler/imputer on full data before CV

# WRONG: Fit on entire dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
cross_val_score(model, X_scaled, y, cv=5)  # Leakage!

# CORRECT: Use Pipeline
from sklearn.pipeline import Pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier())
])
cross_val_score(pipe, X, y, cv=5)  # No leakage!

Using LOO for large data: Too slow, use K-fold instead
Not stratifying: Can get unlucky splits with imbalanced data
Shuffling time series: Breaks temporal dependency
Tuning on CV results: Use nested CV or separate validation set

🎯 Key Takeaways

Cross-validation gives more reliable performance estimates
K-fold is most common (K=5 or 10)
Stratified K-fold for classification (maintains class balance)
TimeSeriesSplit for time series (no shuffling)
Use Pipeline to avoid data leakage
Nested CV for hyperparameter tuning + evaluation
Report mean ± std for transparency