What are Decision Trees?
Decision Trees are intuitive models that make decisions by asking a series of questions. They're easy to interpret and work well for both classification and regression tasks.
Key Advantages:
- Easy to understand and visualize
- No feature scaling needed
- Handles both numerical and categorical data
- Captures non-linear relationships
- Feature importance built-in
🌲 Decision Tree Basics
How It Works
A decision tree splits data based on features to create homogeneous groups:
- Start with all data at root node
- Find best feature and split point
- Divide data into child nodes
- Repeat recursively until stopping criteria
- Assign predictions at leaf nodes
Classification Tree Example
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train decision tree
clf = DecisionTreeClassifier(
max_depth=3, # Limit tree depth
min_samples_split=5, # Min samples to split node
min_samples_leaf=2, # Min samples in leaf
random_state=42
)
clf.fit(X_train, y_train)
# Predict
y_pred = clf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
Regression Tree Example
from sklearn.tree import DecisionTreeRegressor
# Train regression tree
reg = DecisionTreeRegressor(max_depth=5, random_state=42)
reg.fit(X_train, y_train)
# Predict continuous values
predictions = reg.predict(X_test)
# Feature importance
importances = reg.feature_importances_
for i, importance in enumerate(importances):
print(f"Feature {i}: {importance:.3f}")
📊 Splitting Criteria
Classification: Gini Impurity
# Gini = 1 - Σ(p_i²)
# where p_i is probability of class i
# Example: Node with 40 class A, 60 class B
# Gini = 1 - (0.4² + 0.6²) = 1 - 0.52 = 0.48
# Pure node (all one class): Gini = 0
# Maximum impurity (50-50): Gini = 0.5
Classification: Entropy (Information Gain)
# Entropy = -Σ(p_i * log2(p_i))
clf = DecisionTreeClassifier(criterion='entropy') # Use entropy instead of gini
clf.fit(X_train, y_train)
Regression: Mean Squared Error
# MSE = (1/n) * Σ(y_i - ŷ)²
# Tree minimizes MSE at each split
reg = DecisionTreeRegressor(criterion='squared_error') # Default
reg.fit(X_train, y_train)
🎨 Visualizing Decision Trees
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
# Visualize tree
plt.figure(figsize=(20, 10))
plot_tree(clf,
feature_names=['sepal length', 'sepal width', 'petal length', 'petal width'],
class_names=['setosa', 'versicolor', 'virginica'],
filled=True,
rounded=True)
plt.show()
# Export to text
from sklearn.tree import export_text
tree_rules = export_text(clf, feature_names=['sepal length', 'sepal width', 'petal length', 'petal width'])
print(tree_rules)
⚠️ Overfitting Problem
Deep trees memorize training data but fail on new data.
Pruning Techniques
# Pre-pruning (stop tree growth early)
clf = DecisionTreeClassifier(
max_depth=5, # Maximum tree depth
min_samples_split=10, # Min samples required to split
min_samples_leaf=5, # Min samples in leaf node
max_leaf_nodes=20, # Limit number of leaves
min_impurity_decrease=0.01 # Min impurity decrease to split
)
# Post-pruning (cost complexity pruning)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas
# Train trees with different alpha values
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=42, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train)
clfs.append(clf)
# Select best alpha based on validation score
🌲🌲🌲 Random Forests
Random Forest combines multiple decision trees to reduce overfitting and improve accuracy.
How Random Forests Work
- Bootstrap: Create N random samples (with replacement)
- Random features: Each tree uses random subset of features
- Train: Build full decision tree on each sample
- Aggregate: Majority vote (classification) or average (regression)
Classification with Random Forest
from sklearn.ensemble import RandomForestClassifier
# Train random forest
rf = RandomForestClassifier(
n_estimators=100, # Number of trees
max_depth=10, # Max depth per tree
min_samples_split=5,
min_samples_leaf=2,
max_features='sqrt', # Sqrt(n_features) at each split
bootstrap=True, # Use bootstrap samples
n_jobs=-1, # Use all CPU cores
random_state=42
)
rf.fit(X_train, y_train)
# Predictions
y_pred = rf.predict(X_test)
y_proba = rf.predict_proba(X_test) # Probability estimates
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
Regression with Random Forest
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
rf_reg = RandomForestRegressor(
n_estimators=200,
max_depth=15,
min_samples_split=10,
random_state=42,
n_jobs=-1
)
rf_reg.fit(X_train, y_train)
predictions = rf_reg.predict(X_test)
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, predictions)):.3f}")
print(f"R² Score: {r2_score(y_test, predictions):.3f}")
🎯 Feature Importance
import pandas as pd
import matplotlib.pyplot as plt
# Get feature importance
importances = rf.feature_importances_
feature_names = ['feature_1', 'feature_2', 'feature_3', 'feature_4']
# Create dataframe
importance_df = pd.DataFrame({
'feature': feature_names,
'importance': importances
}).sort_values('importance', ascending=False)
# Plot
plt.figure(figsize=(10, 6))
plt.barh(importance_df['feature'], importance_df['importance'])
plt.xlabel('Importance')
plt.title('Feature Importance')
plt.show()
# Top features
print("Top 5 features:")
print(importance_df.head())
⚙️ Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV
# Define parameter grid
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [10, 20, 30, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'max_features': ['sqrt', 'log2']
}
# Grid search
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(
rf,
param_grid,
cv=5, # 5-fold cross-validation
scoring='accuracy',
n_jobs=-1,
verbose=2
)
grid_search.fit(X_train, y_train)
# Best parameters
print("Best parameters:")
print(grid_search.best_params_)
print(f"Best CV score: {grid_search.best_score_:.3f}")
# Use best model
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)
📊 Out-of-Bag (OOB) Evaluation
# Random Forest has built-in validation using OOB samples
rf = RandomForestClassifier(
n_estimators=100,
oob_score=True, # Enable OOB evaluation
random_state=42
)
rf.fit(X_train, y_train)
print(f"OOB Score: {rf.oob_score_:.3f}")
# No need for separate validation set!
🆚 Decision Tree vs Random Forest
| Aspect | Decision Tree | Random Forest |
|---|---|---|
| Overfitting | High (without pruning) | Low (ensemble averaging) |
| Training Time | Fast | Slower (multiple trees) |
| Prediction Time | Fast | Moderate |
| Interpretability | High (visualizable) | Low (black box) |
| Accuracy | Good | Excellent |
| Feature Importance | Yes | Yes (more robust) |
💡 Best Practices
- Decision Trees: Use when interpretability is crucial
- Random Forests: Use for best predictive performance
- Start with defaults: 100-200 trees usually sufficient
- max_features: Use 'sqrt' for classification, n_features/3 for regression
- Avoid deep trees: Limit max_depth to prevent overfitting
- Use OOB score: Quick validation without separate test set
- Parallelize: Set n_jobs=-1 for faster training
🎯 Key Takeaways
- Decision Trees are intuitive and interpretable
- Random Forests reduce overfitting via ensemble
- No scaling needed - works with raw features
- Feature importance helps understand model
- Prune trees to control complexity
- 100-200 trees usually sufficient for RF