What is Feature Selection?
Feature selection is the process of identifying and selecting the most relevant features for your model. It reduces overfitting, improves accuracy, and speeds up training.
Benefits of Feature Selection:
- Reduces overfitting: Fewer features = less noise
- Improves accuracy: Removes irrelevant/redundant features
- Speeds up training: Fewer computations needed
- Better interpretability: Simpler models easier to understand
- Reduces storage: Less data to store and process
🔍 Filter Methods
Evaluate features independently using statistical tests. Fast and model-agnostic.
1. Variance Threshold
from sklearn.feature_selection import VarianceThreshold
from sklearn.datasets import load_iris
import numpy as np
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# Remove low-variance features
selector = VarianceThreshold(threshold=0.1)
X_selected = selector.fit_transform(X)
print(f"Original features: {X.shape[1]}")
print(f"Selected features: {X_selected.shape[1]}")
print(f"Feature variances: {selector.variances_}")
print(f"Selected mask: {selector.get_support()}")
2. Correlation with Target
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Create dataframe
df = pd.DataFrame(X, columns=iris.feature_names)
df['target'] = y
# Correlation matrix
corr = df.corr()
# Correlation with target
target_corr = corr['target'].drop('target').abs().sort_values(ascending=False)
print("Correlation with target:")
print(target_corr)
# Visualize
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.show()
# Select features with correlation > threshold
threshold = 0.3
selected_features = target_corr[target_corr > threshold].index.tolist()
print(f"\nSelected features (|corr| > {threshold}): {selected_features}")
3. Statistical Tests
from sklearn.feature_selection import SelectKBest, f_classif, chi2, mutual_info_classif
# ANOVA F-test (for classification)
selector = SelectKBest(f_classif, k=2) # Select top 2 features
X_selected = selector.fit_transform(X, y)
print(f"F-scores: {selector.scores_}")
print(f"P-values: {selector.pvalues_}")
print(f"Selected features: {np.array(iris.feature_names)[selector.get_support()]}")
# Chi-squared test (for non-negative features)
selector_chi2 = SelectKBest(chi2, k=2)
X_selected_chi2 = selector_chi2.fit_transform(X, y)
print(f"\nChi2 scores: {selector_chi2.scores_}")
# Mutual Information (captures non-linear relationships)
selector_mi = SelectKBest(mutual_info_classif, k=2)
X_selected_mi = selector_mi.fit_transform(X, y)
print(f"\nMutual Information scores: {selector_mi.scores_}")
4. SelectPercentile
from sklearn.feature_selection import SelectPercentile
# Select top 50% of features
selector = SelectPercentile(f_classif, percentile=50)
X_selected = selector.fit_transform(X, y)
print(f"Selected {X_selected.shape[1]} out of {X.shape[1]} features")
print(f"Selected features: {np.array(iris.feature_names)[selector.get_support()]}")
🔄 Wrapper Methods
Use a model to evaluate feature subsets. More accurate but computationally expensive.
1. Recursive Feature Elimination (RFE)
from sklearn.feature_selection import RFE, RFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# RFE: Recursively remove least important features
estimator = RandomForestClassifier(n_estimators=100, random_state=42)
selector = RFE(estimator, n_features_to_select=2)
selector.fit(X_train, y_train)
print(f"Selected features: {np.array(iris.feature_names)[selector.get_support()]}")
print(f"Feature ranking: {selector.ranking_}")
# Evaluate
from sklearn.metrics import accuracy_score
y_pred = selector.predict(X_test)
print(f"Accuracy with {selector.n_features_} features: {accuracy_score(y_test, y_pred):.3f}")
2. RFE with Cross-Validation
# RFECV: Automatically find optimal number of features
selector = RFECV(
estimator=RandomForestClassifier(random_state=42),
step=1, # Remove 1 feature per iteration
cv=5, # 5-fold cross-validation
scoring='accuracy',
n_jobs=-1
)
selector.fit(X_train, y_train)
print(f"Optimal number of features: {selector.n_features_}")
print(f"Selected features: {np.array(iris.feature_names)[selector.get_support()]}")
print(f"Cross-validation scores: {selector.cv_results_['mean_test_score']}")
# Plot number of features vs accuracy
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(selector.cv_results_['mean_test_score']) + 1),
selector.cv_results_['mean_test_score'], 'bo-')
plt.xlabel('Number of Features')
plt.ylabel('Cross-Validation Score')
plt.title('RFECV: Optimal Number of Features')
plt.grid(True)
plt.show()
3. Sequential Feature Selection
from sklearn.feature_selection import SequentialFeatureSelector
# Forward selection: Start with 0, add features one by one
sfs_forward = SequentialFeatureSelector(
RandomForestClassifier(random_state=42),
n_features_to_select=2,
direction='forward',
cv=5,
n_jobs=-1
)
sfs_forward.fit(X_train, y_train)
print(f"Forward selection: {np.array(iris.feature_names)[sfs_forward.get_support()]}")
# Backward selection: Start with all, remove features one by one
sfs_backward = SequentialFeatureSelector(
RandomForestClassifier(random_state=42),
n_features_to_select=2,
direction='backward',
cv=5,
n_jobs=-1
)
sfs_backward.fit(X_train, y_train)
print(f"Backward selection: {np.array(iris.feature_names)[sfs_backward.get_support()]}")
🌳 Embedded Methods
Feature selection built into the model training process.
1. Tree-Based Feature Importance
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
# Random Forest importance
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
# Get feature importances
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]
print("Feature ranking:")
for i in range(X.shape[1]):
print(f"{i+1}. {iris.feature_names[indices[i]]} ({importances[indices[i]]:.3f})")
# Plot
plt.figure(figsize=(10, 6))
plt.bar(range(X.shape[1]), importances[indices])
plt.xticks(range(X.shape[1]), [iris.feature_names[i] for i in indices], rotation=45)
plt.title('Feature Importances')
plt.ylabel('Importance')
plt.tight_layout()
plt.show()
# Select top k features
from sklearn.feature_selection import SelectFromModel
selector = SelectFromModel(rf, threshold='median', prefit=True)
X_selected = selector.transform(X_train)
print(f"\nSelected {X_selected.shape[1]} features above median importance")
2. L1 Regularization (Lasso)
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# L1 regularization performs feature selection
logreg_l1 = LogisticRegression(penalty='l1', C=0.1, solver='liblinear', max_iter=1000)
logreg_l1.fit(X_train_scaled, y_train)
# Features with non-zero coefficients
selected_features = np.abs(logreg_l1.coef_).sum(axis=0) > 0
print(f"Selected features: {np.array(iris.feature_names)[selected_features]}")
print(f"Number of selected features: {selected_features.sum()}")
# Use SelectFromModel
selector = SelectFromModel(logreg_l1, prefit=True)
X_selected = selector.transform(X_train_scaled)
print(f"Selected shape: {X_selected.shape}")
3. Regularization Path
from sklearn.linear_model import LassoCV
# Try different regularization strengths
alphas = np.logspace(-3, 1, 50)
lasso_cv = LassoCV(alphas=alphas, cv=5)
lasso_cv.fit(X_train_scaled, y_train)
print(f"Best alpha: {lasso_cv.alpha_:.4f}")
print(f"Coefficients: {lasso_cv.coef_}")
print(f"Non-zero features: {np.sum(lasso_cv.coef_ != 0)}")
# Plot regularization path
from sklearn.linear_model import lasso_path
alphas_path, coefs_path, _ = lasso_path(X_train_scaled, y_train, alphas=alphas)
plt.figure(figsize=(10, 6))
for i, feature_name in enumerate(iris.feature_names):
plt.plot(alphas_path, coefs_path[i, :], label=feature_name)
plt.xscale('log')
plt.xlabel('Alpha (Regularization Strength)')
plt.ylabel('Coefficients')
plt.title('Lasso Regularization Path')
plt.legend()
plt.grid(True)
plt.show()
📊 Feature Selection Comparison
| Method | Type | Speed | Accuracy | Best For |
|---|---|---|---|---|
| Variance Threshold | Filter | Very Fast | Low | Remove constants |
| Correlation | Filter | Fast | Medium | Linear relationships |
| Statistical Tests | Filter | Fast | Medium | Quick screening |
| RFE/RFECV | Wrapper | Slow | High | Optimal subset |
| Sequential Selection | Wrapper | Very Slow | High | Small datasets |
| Tree Importance | Embedded | Fast | High | Non-linear data |
| L1 Regularization | Embedded | Medium | High | Linear models |
🔗 Removing Correlated Features
# Remove highly correlated features
def remove_correlated_features(df, threshold=0.9):
corr_matrix = df.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
return df.drop(columns=to_drop), to_drop
# Example
df = pd.DataFrame(X, columns=iris.feature_names)
df_reduced, dropped = remove_correlated_features(df, threshold=0.9)
print(f"Dropped features: {dropped}")
print(f"Remaining features: {df_reduced.columns.tolist()}")
# Visualize correlation before and after
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', ax=axes[0])
axes[0].set_title('Before Removing Correlated Features')
sns.heatmap(df_reduced.corr(), annot=True, cmap='coolwarm', ax=axes[1])
axes[1].set_title('After Removing Correlated Features')
plt.tight_layout()
plt.show()
🎯 Complete Feature Selection Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
# Create pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('feature_selection', SelectKBest(f_classif, k=2)),
('classifier', RandomForestClassifier(random_state=42))
])
# Fit and evaluate
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print(f"Pipeline accuracy: {score:.3f}")
# Get selected features
selector = pipeline.named_steps['feature_selection']
selected_features = np.array(iris.feature_names)[selector.get_support()]
print(f"Selected features: {selected_features}")
# Compare with all features
rf_all = RandomForestClassifier(random_state=42)
rf_all.fit(X_train, y_train)
score_all = rf_all.score(X_test, y_test)
print(f"Accuracy with all features: {score_all:.3f}")
print(f"Accuracy with selected features: {score:.3f}")
💡 Best Practices
- Start with filter methods: Fast exploration
- Use wrapper methods for accuracy: Better but slower
- Try embedded methods: Good balance of speed and accuracy
- Use cross-validation: Avoid overfitting to validation set
- Keep domain knowledge: Don't blindly remove features
- Check for leakage: Fit selector on train data only
- Compare performance: Validate that selection helps
- Consider interpretability: Fewer features = easier to explain
⚠️ Common Mistakes
- Selecting on entire dataset: Causes data leakage
# WRONG: Select on entire dataset selector = SelectKBest(k=5) X_selected = selector.fit_transform(X, y) X_train, X_test = train_test_split(X_selected, y) # CORRECT: Use pipeline or fit on train only X_train, X_test, y_train, y_test = train_test_split(X, y) selector = SelectKBest(k=5) X_train_selected = selector.fit_transform(X_train, y_train) X_test_selected = selector.transform(X_test) - Not scaling before correlation: Features with larger scales dominate
- Removing too many features: Can lose important information
- Ignoring multicollinearity: Correlated features provide redundant info
- Over-relying on importance: Can be misleading with correlated features
- Not validating results: Always check if selection improves model
🔍 Feature Selection Workflow
- Remove constant features: VarianceThreshold
- Remove duplicates: Check for identical columns
- Remove highly correlated: Threshold > 0.9
- Filter methods: Quick screening with statistical tests
- Embedded methods: Tree importance or L1 regularization
- Wrapper methods: RFE/RFECV for optimal subset (if time permits)
- Validate: Check model performance with selected features
- Iterate: Try different combinations and thresholds
🎯 Key Takeaways
- Feature selection reduces overfitting and improves performance
- Filter methods are fast but less accurate
- Wrapper methods are accurate but computationally expensive
- Embedded methods offer good balance (tree importance, L1)
- Always use pipeline or fit on train data only to avoid leakage
- Remove correlated features to reduce redundancy
- Validate impact on model performance