🎯 Feature Selection

Choose the most important features

What is Feature Selection?

Feature selection is the process of identifying and selecting the most relevant features for your model. It reduces overfitting, improves accuracy, and speeds up training.

Benefits of Feature Selection:

  • Reduces overfitting: Fewer features = less noise
  • Improves accuracy: Removes irrelevant/redundant features
  • Speeds up training: Fewer computations needed
  • Better interpretability: Simpler models easier to understand
  • Reduces storage: Less data to store and process

🔍 Filter Methods

Evaluate features independently using statistical tests. Fast and model-agnostic.

1. Variance Threshold

from sklearn.feature_selection import VarianceThreshold
from sklearn.datasets import load_iris
import numpy as np

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Remove low-variance features
selector = VarianceThreshold(threshold=0.1)
X_selected = selector.fit_transform(X)

print(f"Original features: {X.shape[1]}")
print(f"Selected features: {X_selected.shape[1]}")
print(f"Feature variances: {selector.variances_}")
print(f"Selected mask: {selector.get_support()}")

2. Correlation with Target

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Create dataframe
df = pd.DataFrame(X, columns=iris.feature_names)
df['target'] = y

# Correlation matrix
corr = df.corr()

# Correlation with target
target_corr = corr['target'].drop('target').abs().sort_values(ascending=False)
print("Correlation with target:")
print(target_corr)

# Visualize
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.show()

# Select features with correlation > threshold
threshold = 0.3
selected_features = target_corr[target_corr > threshold].index.tolist()
print(f"\nSelected features (|corr| > {threshold}): {selected_features}")

3. Statistical Tests

from sklearn.feature_selection import SelectKBest, f_classif, chi2, mutual_info_classif

# ANOVA F-test (for classification)
selector = SelectKBest(f_classif, k=2)  # Select top 2 features
X_selected = selector.fit_transform(X, y)

print(f"F-scores: {selector.scores_}")
print(f"P-values: {selector.pvalues_}")
print(f"Selected features: {np.array(iris.feature_names)[selector.get_support()]}")

# Chi-squared test (for non-negative features)
selector_chi2 = SelectKBest(chi2, k=2)
X_selected_chi2 = selector_chi2.fit_transform(X, y)
print(f"\nChi2 scores: {selector_chi2.scores_}")

# Mutual Information (captures non-linear relationships)
selector_mi = SelectKBest(mutual_info_classif, k=2)
X_selected_mi = selector_mi.fit_transform(X, y)
print(f"\nMutual Information scores: {selector_mi.scores_}")

4. SelectPercentile

from sklearn.feature_selection import SelectPercentile

# Select top 50% of features
selector = SelectPercentile(f_classif, percentile=50)
X_selected = selector.fit_transform(X, y)

print(f"Selected {X_selected.shape[1]} out of {X.shape[1]} features")
print(f"Selected features: {np.array(iris.feature_names)[selector.get_support()]}")

🔄 Wrapper Methods

Use a model to evaluate feature subsets. More accurate but computationally expensive.

1. Recursive Feature Elimination (RFE)

from sklearn.feature_selection import RFE, RFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# RFE: Recursively remove least important features
estimator = RandomForestClassifier(n_estimators=100, random_state=42)
selector = RFE(estimator, n_features_to_select=2)
selector.fit(X_train, y_train)

print(f"Selected features: {np.array(iris.feature_names)[selector.get_support()]}")
print(f"Feature ranking: {selector.ranking_}")

# Evaluate
from sklearn.metrics import accuracy_score
y_pred = selector.predict(X_test)
print(f"Accuracy with {selector.n_features_} features: {accuracy_score(y_test, y_pred):.3f}")

2. RFE with Cross-Validation

# RFECV: Automatically find optimal number of features
selector = RFECV(
    estimator=RandomForestClassifier(random_state=42),
    step=1,              # Remove 1 feature per iteration
    cv=5,                # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1
)

selector.fit(X_train, y_train)

print(f"Optimal number of features: {selector.n_features_}")
print(f"Selected features: {np.array(iris.feature_names)[selector.get_support()]}")
print(f"Cross-validation scores: {selector.cv_results_['mean_test_score']}")

# Plot number of features vs accuracy
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(selector.cv_results_['mean_test_score']) + 1),
         selector.cv_results_['mean_test_score'], 'bo-')
plt.xlabel('Number of Features')
plt.ylabel('Cross-Validation Score')
plt.title('RFECV: Optimal Number of Features')
plt.grid(True)
plt.show()

3. Sequential Feature Selection

from sklearn.feature_selection import SequentialFeatureSelector

# Forward selection: Start with 0, add features one by one
sfs_forward = SequentialFeatureSelector(
    RandomForestClassifier(random_state=42),
    n_features_to_select=2,
    direction='forward',
    cv=5,
    n_jobs=-1
)
sfs_forward.fit(X_train, y_train)
print(f"Forward selection: {np.array(iris.feature_names)[sfs_forward.get_support()]}")

# Backward selection: Start with all, remove features one by one
sfs_backward = SequentialFeatureSelector(
    RandomForestClassifier(random_state=42),
    n_features_to_select=2,
    direction='backward',
    cv=5,
    n_jobs=-1
)
sfs_backward.fit(X_train, y_train)
print(f"Backward selection: {np.array(iris.feature_names)[sfs_backward.get_support()]}")

🌳 Embedded Methods

Feature selection built into the model training process.

1. Tree-Based Feature Importance

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Random Forest importance
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Get feature importances
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]

print("Feature ranking:")
for i in range(X.shape[1]):
    print(f"{i+1}. {iris.feature_names[indices[i]]} ({importances[indices[i]]:.3f})")

# Plot
plt.figure(figsize=(10, 6))
plt.bar(range(X.shape[1]), importances[indices])
plt.xticks(range(X.shape[1]), [iris.feature_names[i] for i in indices], rotation=45)
plt.title('Feature Importances')
plt.ylabel('Importance')
plt.tight_layout()
plt.show()

# Select top k features
from sklearn.feature_selection import SelectFromModel

selector = SelectFromModel(rf, threshold='median', prefit=True)
X_selected = selector.transform(X_train)
print(f"\nSelected {X_selected.shape[1]} features above median importance")

2. L1 Regularization (Lasso)

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# L1 regularization performs feature selection
logreg_l1 = LogisticRegression(penalty='l1', C=0.1, solver='liblinear', max_iter=1000)
logreg_l1.fit(X_train_scaled, y_train)

# Features with non-zero coefficients
selected_features = np.abs(logreg_l1.coef_).sum(axis=0) > 0
print(f"Selected features: {np.array(iris.feature_names)[selected_features]}")
print(f"Number of selected features: {selected_features.sum()}")

# Use SelectFromModel
selector = SelectFromModel(logreg_l1, prefit=True)
X_selected = selector.transform(X_train_scaled)
print(f"Selected shape: {X_selected.shape}")

3. Regularization Path

from sklearn.linear_model import LassoCV

# Try different regularization strengths
alphas = np.logspace(-3, 1, 50)
lasso_cv = LassoCV(alphas=alphas, cv=5)
lasso_cv.fit(X_train_scaled, y_train)

print(f"Best alpha: {lasso_cv.alpha_:.4f}")
print(f"Coefficients: {lasso_cv.coef_}")
print(f"Non-zero features: {np.sum(lasso_cv.coef_ != 0)}")

# Plot regularization path
from sklearn.linear_model import lasso_path
alphas_path, coefs_path, _ = lasso_path(X_train_scaled, y_train, alphas=alphas)

plt.figure(figsize=(10, 6))
for i, feature_name in enumerate(iris.feature_names):
    plt.plot(alphas_path, coefs_path[i, :], label=feature_name)
plt.xscale('log')
plt.xlabel('Alpha (Regularization Strength)')
plt.ylabel('Coefficients')
plt.title('Lasso Regularization Path')
plt.legend()
plt.grid(True)
plt.show()

📊 Feature Selection Comparison

Method Type Speed Accuracy Best For
Variance Threshold Filter Very Fast Low Remove constants
Correlation Filter Fast Medium Linear relationships
Statistical Tests Filter Fast Medium Quick screening
RFE/RFECV Wrapper Slow High Optimal subset
Sequential Selection Wrapper Very Slow High Small datasets
Tree Importance Embedded Fast High Non-linear data
L1 Regularization Embedded Medium High Linear models

🔗 Removing Correlated Features

# Remove highly correlated features
def remove_correlated_features(df, threshold=0.9):
    corr_matrix = df.corr().abs()
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    
    to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
    
    return df.drop(columns=to_drop), to_drop

# Example
df = pd.DataFrame(X, columns=iris.feature_names)
df_reduced, dropped = remove_correlated_features(df, threshold=0.9)

print(f"Dropped features: {dropped}")
print(f"Remaining features: {df_reduced.columns.tolist()}")

# Visualize correlation before and after
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', ax=axes[0])
axes[0].set_title('Before Removing Correlated Features')
sns.heatmap(df_reduced.corr(), annot=True, cmap='coolwarm', ax=axes[1])
axes[1].set_title('After Removing Correlated Features')
plt.tight_layout()
plt.show()

🎯 Complete Feature Selection Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('feature_selection', SelectKBest(f_classif, k=2)),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Fit and evaluate
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print(f"Pipeline accuracy: {score:.3f}")

# Get selected features
selector = pipeline.named_steps['feature_selection']
selected_features = np.array(iris.feature_names)[selector.get_support()]
print(f"Selected features: {selected_features}")

# Compare with all features
rf_all = RandomForestClassifier(random_state=42)
rf_all.fit(X_train, y_train)
score_all = rf_all.score(X_test, y_test)
print(f"Accuracy with all features: {score_all:.3f}")
print(f"Accuracy with selected features: {score:.3f}")

💡 Best Practices

⚠️ Common Mistakes

🔍 Feature Selection Workflow

  1. Remove constant features: VarianceThreshold
  2. Remove duplicates: Check for identical columns
  3. Remove highly correlated: Threshold > 0.9
  4. Filter methods: Quick screening with statistical tests
  5. Embedded methods: Tree importance or L1 regularization
  6. Wrapper methods: RFE/RFECV for optimal subset (if time permits)
  7. Validate: Check model performance with selected features
  8. Iterate: Try different combinations and thresholds

🎯 Key Takeaways