🌳 Decision Trees & Random Forests

Powerful tree-based models for classification and regression

What are Decision Trees?

Decision Trees are intuitive models that make decisions by asking a series of questions. They're easy to interpret and work well for both classification and regression tasks.

Key Advantages:

  • Easy to understand and visualize
  • No feature scaling needed
  • Handles both numerical and categorical data
  • Captures non-linear relationships
  • Feature importance built-in

🌲 Decision Tree Basics

How It Works

A decision tree splits data based on features to create homogeneous groups:

  1. Start with all data at root node
  2. Find best feature and split point
  3. Divide data into child nodes
  4. Repeat recursively until stopping criteria
  5. Assign predictions at leaf nodes

Classification Tree Example

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train decision tree
clf = DecisionTreeClassifier(
    max_depth=3,           # Limit tree depth
    min_samples_split=5,   # Min samples to split node
    min_samples_leaf=2,    # Min samples in leaf
    random_state=42
)

clf.fit(X_train, y_train)

# Predict
y_pred = clf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")

Regression Tree Example

from sklearn.tree import DecisionTreeRegressor

# Train regression tree
reg = DecisionTreeRegressor(max_depth=5, random_state=42)
reg.fit(X_train, y_train)

# Predict continuous values
predictions = reg.predict(X_test)

# Feature importance
importances = reg.feature_importances_
for i, importance in enumerate(importances):
    print(f"Feature {i}: {importance:.3f}")

📊 Splitting Criteria

Classification: Gini Impurity

# Gini = 1 - Σ(p_i²)
# where p_i is probability of class i

# Example: Node with 40 class A, 60 class B
# Gini = 1 - (0.4² + 0.6²) = 1 - 0.52 = 0.48

# Pure node (all one class): Gini = 0
# Maximum impurity (50-50): Gini = 0.5

Classification: Entropy (Information Gain)

# Entropy = -Σ(p_i * log2(p_i))

clf = DecisionTreeClassifier(criterion='entropy')  # Use entropy instead of gini
clf.fit(X_train, y_train)

Regression: Mean Squared Error

# MSE = (1/n) * Σ(y_i - ŷ)²
# Tree minimizes MSE at each split

reg = DecisionTreeRegressor(criterion='squared_error')  # Default
reg.fit(X_train, y_train)

🎨 Visualizing Decision Trees

from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

# Visualize tree
plt.figure(figsize=(20, 10))
plot_tree(clf, 
          feature_names=['sepal length', 'sepal width', 'petal length', 'petal width'],
          class_names=['setosa', 'versicolor', 'virginica'],
          filled=True,
          rounded=True)
plt.show()

# Export to text
from sklearn.tree import export_text
tree_rules = export_text(clf, feature_names=['sepal length', 'sepal width', 'petal length', 'petal width'])
print(tree_rules)

⚠️ Overfitting Problem

Deep trees memorize training data but fail on new data.

Pruning Techniques

# Pre-pruning (stop tree growth early)
clf = DecisionTreeClassifier(
    max_depth=5,              # Maximum tree depth
    min_samples_split=10,     # Min samples required to split
    min_samples_leaf=5,       # Min samples in leaf node
    max_leaf_nodes=20,        # Limit number of leaves
    min_impurity_decrease=0.01  # Min impurity decrease to split
)

# Post-pruning (cost complexity pruning)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas

# Train trees with different alpha values
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=42, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    clfs.append(clf)

# Select best alpha based on validation score

🌲🌲🌲 Random Forests

Random Forest combines multiple decision trees to reduce overfitting and improve accuracy.

How Random Forests Work

  1. Bootstrap: Create N random samples (with replacement)
  2. Random features: Each tree uses random subset of features
  3. Train: Build full decision tree on each sample
  4. Aggregate: Majority vote (classification) or average (regression)

Classification with Random Forest

from sklearn.ensemble import RandomForestClassifier

# Train random forest
rf = RandomForestClassifier(
    n_estimators=100,        # Number of trees
    max_depth=10,            # Max depth per tree
    min_samples_split=5,
    min_samples_leaf=2,
    max_features='sqrt',     # Sqrt(n_features) at each split
    bootstrap=True,          # Use bootstrap samples
    n_jobs=-1,              # Use all CPU cores
    random_state=42
)

rf.fit(X_train, y_train)

# Predictions
y_pred = rf.predict(X_test)
y_proba = rf.predict_proba(X_test)  # Probability estimates

print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")

Regression with Random Forest

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

rf_reg = RandomForestRegressor(
    n_estimators=200,
    max_depth=15,
    min_samples_split=10,
    random_state=42,
    n_jobs=-1
)

rf_reg.fit(X_train, y_train)
predictions = rf_reg.predict(X_test)

print(f"RMSE: {np.sqrt(mean_squared_error(y_test, predictions)):.3f}")
print(f"R² Score: {r2_score(y_test, predictions):.3f}")

🎯 Feature Importance

import pandas as pd
import matplotlib.pyplot as plt

# Get feature importance
importances = rf.feature_importances_
feature_names = ['feature_1', 'feature_2', 'feature_3', 'feature_4']

# Create dataframe
importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': importances
}).sort_values('importance', ascending=False)

# Plot
plt.figure(figsize=(10, 6))
plt.barh(importance_df['feature'], importance_df['importance'])
plt.xlabel('Importance')
plt.title('Feature Importance')
plt.show()

# Top features
print("Top 5 features:")
print(importance_df.head())

⚙️ Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2']
}

# Grid search
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(
    rf, 
    param_grid, 
    cv=5,              # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1,
    verbose=2
)

grid_search.fit(X_train, y_train)

# Best parameters
print("Best parameters:")
print(grid_search.best_params_)
print(f"Best CV score: {grid_search.best_score_:.3f}")

# Use best model
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)

📊 Out-of-Bag (OOB) Evaluation

# Random Forest has built-in validation using OOB samples
rf = RandomForestClassifier(
    n_estimators=100,
    oob_score=True,  # Enable OOB evaluation
    random_state=42
)

rf.fit(X_train, y_train)

print(f"OOB Score: {rf.oob_score_:.3f}")
# No need for separate validation set!

🆚 Decision Tree vs Random Forest

Aspect Decision Tree Random Forest
Overfitting High (without pruning) Low (ensemble averaging)
Training Time Fast Slower (multiple trees)
Prediction Time Fast Moderate
Interpretability High (visualizable) Low (black box)
Accuracy Good Excellent
Feature Importance Yes Yes (more robust)

💡 Best Practices

🎯 Key Takeaways