Decision Trees & Random Forests

What are Decision Trees?

Decision Trees are intuitive models that make decisions by asking a series of questions. They're easy to interpret and work well for both classification and regression tasks.

                Key Advantages:
                Easy to understand and visualize
No feature scaling needed
Handles both numerical and categorical data
Captures non-linear relationships
Feature importance built-in

            

🌲 Decision Tree Basics

How It Works

A decision tree splits data based on features to create homogeneous groups:

Start with all data at root node
Find best feature and split point
Divide data into child nodes
Repeat recursively until stopping criteria
Assign predictions at leaf nodes

Classification Tree Example

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train decision tree
clf = DecisionTreeClassifier(
    max_depth=3,           # Limit tree depth
    min_samples_split=5,   # Min samples to split node
    min_samples_leaf=2,    # Min samples in leaf
    random_state=42
)

clf.fit(X_train, y_train)

# Predict
y_pred = clf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")

Regression Tree Example

from sklearn.tree import DecisionTreeRegressor

# Train regression tree
reg = DecisionTreeRegressor(max_depth=5, random_state=42)
reg.fit(X_train, y_train)

# Predict continuous values
predictions = reg.predict(X_test)

# Feature importance
importances = reg.feature_importances_
for i, importance in enumerate(importances):
    print(f"Feature {i}: {importance:.3f}")

📊 Splitting Criteria

Classification: Gini Impurity

# Gini = 1 - Σ(p_i²)
# where p_i is probability of class i

# Example: Node with 40 class A, 60 class B
# Gini = 1 - (0.4² + 0.6²) = 1 - 0.52 = 0.48

# Pure node (all one class): Gini = 0
# Maximum impurity (50-50): Gini = 0.5

Classification: Entropy (Information Gain)

# Entropy = -Σ(p_i * log2(p_i))

clf = DecisionTreeClassifier(criterion='entropy')  # Use entropy instead of gini
clf.fit(X_train, y_train)

Regression: Mean Squared Error

# MSE = (1/n) * Σ(y_i - ŷ)²
# Tree minimizes MSE at each split

reg = DecisionTreeRegressor(criterion='squared_error')  # Default
reg.fit(X_train, y_train)

🎨 Visualizing Decision Trees

from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

# Visualize tree
plt.figure(figsize=(20, 10))
plot_tree(clf, 
          feature_names=['sepal length', 'sepal width', 'petal length', 'petal width'],
          class_names=['setosa', 'versicolor', 'virginica'],
          filled=True,
          rounded=True)
plt.show()

# Export to text
from sklearn.tree import export_text
tree_rules = export_text(clf, feature_names=['sepal length', 'sepal width', 'petal length', 'petal width'])
print(tree_rules)

⚠️ Overfitting Problem

Deep trees memorize training data but fail on new data.

Pruning Techniques

# Pre-pruning (stop tree growth early)
clf = DecisionTreeClassifier(
    max_depth=5,              # Maximum tree depth
    min_samples_split=10,     # Min samples required to split
    min_samples_leaf=5,       # Min samples in leaf node
    max_leaf_nodes=20,        # Limit number of leaves
    min_impurity_decrease=0.01  # Min impurity decrease to split
)

# Post-pruning (cost complexity pruning)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas

# Train trees with different alpha values
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=42, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    clfs.append(clf)

# Select best alpha based on validation score

🌲🌲🌲 Random Forests

Random Forest combines multiple decision trees to reduce overfitting and improve accuracy.

How Random Forests Work

Bootstrap: Create N random samples (with replacement)
Random features: Each tree uses random subset of features
Train: Build full decision tree on each sample
Aggregate: Majority vote (classification) or average (regression)

Classification with Random Forest

from sklearn.ensemble import RandomForestClassifier

# Train random forest
rf = RandomForestClassifier(
    n_estimators=100,        # Number of trees
    max_depth=10,            # Max depth per tree
    min_samples_split=5,
    min_samples_leaf=2,
    max_features='sqrt',     # Sqrt(n_features) at each split
    bootstrap=True,          # Use bootstrap samples
    n_jobs=-1,              # Use all CPU cores
    random_state=42
)

rf.fit(X_train, y_train)

# Predictions
y_pred = rf.predict(X_test)
y_proba = rf.predict_proba(X_test)  # Probability estimates

print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")

Regression with Random Forest

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

rf_reg = RandomForestRegressor(
    n_estimators=200,
    max_depth=15,
    min_samples_split=10,
    random_state=42,
    n_jobs=-1
)

rf_reg.fit(X_train, y_train)
predictions = rf_reg.predict(X_test)

print(f"RMSE: {np.sqrt(mean_squared_error(y_test, predictions)):.3f}")
print(f"R² Score: {r2_score(y_test, predictions):.3f}")

🎯 Feature Importance

import pandas as pd
import matplotlib.pyplot as plt

# Get feature importance
importances = rf.feature_importances_
feature_names = ['feature_1', 'feature_2', 'feature_3', 'feature_4']

# Create dataframe
importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': importances
}).sort_values('importance', ascending=False)

# Plot
plt.figure(figsize=(10, 6))
plt.barh(importance_df['feature'], importance_df['importance'])
plt.xlabel('Importance')
plt.title('Feature Importance')
plt.show()

# Top features
print("Top 5 features:")
print(importance_df.head())

⚙️ Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2']
}

# Grid search
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(
    rf, 
    param_grid, 
    cv=5,              # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1,
    verbose=2
)

grid_search.fit(X_train, y_train)

# Best parameters
print("Best parameters:")
print(grid_search.best_params_)
print(f"Best CV score: {grid_search.best_score_:.3f}")

# Use best model
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)

📊 Out-of-Bag (OOB) Evaluation

# Random Forest has built-in validation using OOB samples
rf = RandomForestClassifier(
    n_estimators=100,
    oob_score=True,  # Enable OOB evaluation
    random_state=42
)

rf.fit(X_train, y_train)

print(f"OOB Score: {rf.oob_score_:.3f}")
# No need for separate validation set!

🆚 Decision Tree vs Random Forest

Aspect	Decision Tree	Random Forest
Overfitting	High (without pruning)	Low (ensemble averaging)
Training Time	Fast	Slower (multiple trees)
Prediction Time	Fast	Moderate
Interpretability	High (visualizable)	Low (black box)
Accuracy	Good	Excellent
Feature Importance	Yes	Yes (more robust)

💡 Best Practices

Decision Trees: Use when interpretability is crucial
Random Forests: Use for best predictive performance
Start with defaults: 100-200 trees usually sufficient
max_features: Use 'sqrt' for classification, n_features/3 for regression
Avoid deep trees: Limit max_depth to prevent overfitting
Use OOB score: Quick validation without separate test set
Parallelize: Set n_jobs=-1 for faster training

🎯 Key Takeaways

Decision Trees are intuitive and interpretable
Random Forests reduce overfitting via ensemble
No scaling needed - works with raw features
Feature importance helps understand model
Prune trees to control complexity
100-200 trees usually sufficient for RF