Feature Selection

What is Feature Selection?

Feature selection is the process of identifying and selecting the most relevant features for your model. It reduces overfitting, improves accuracy, and speeds up training.

                Benefits of Feature Selection:
                Reduces overfitting: Fewer features = less noise
Improves accuracy: Removes irrelevant/redundant features
Speeds up training: Fewer computations needed
Better interpretability: Simpler models easier to understand
Reduces storage: Less data to store and process

            

🔍 Filter Methods

Evaluate features independently using statistical tests. Fast and model-agnostic.

1. Variance Threshold

from sklearn.feature_selection import VarianceThreshold
from sklearn.datasets import load_iris
import numpy as np

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Remove low-variance features
selector = VarianceThreshold(threshold=0.1)
X_selected = selector.fit_transform(X)

print(f"Original features: {X.shape[1]}")
print(f"Selected features: {X_selected.shape[1]}")
print(f"Feature variances: {selector.variances_}")
print(f"Selected mask: {selector.get_support()}")

2. Correlation with Target

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Create dataframe
df = pd.DataFrame(X, columns=iris.feature_names)
df['target'] = y

# Correlation matrix
corr = df.corr()

# Correlation with target
target_corr = corr['target'].drop('target').abs().sort_values(ascending=False)
print("Correlation with target:")
print(target_corr)

# Visualize
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.show()

# Select features with correlation > threshold
threshold = 0.3
selected_features = target_corr[target_corr > threshold].index.tolist()
print(f"\nSelected features (|corr| > {threshold}): {selected_features}")

3. Statistical Tests

from sklearn.feature_selection import SelectKBest, f_classif, chi2, mutual_info_classif

# ANOVA F-test (for classification)
selector = SelectKBest(f_classif, k=2)  # Select top 2 features
X_selected = selector.fit_transform(X, y)

print(f"F-scores: {selector.scores_}")
print(f"P-values: {selector.pvalues_}")
print(f"Selected features: {np.array(iris.feature_names)[selector.get_support()]}")

# Chi-squared test (for non-negative features)
selector_chi2 = SelectKBest(chi2, k=2)
X_selected_chi2 = selector_chi2.fit_transform(X, y)
print(f"\nChi2 scores: {selector_chi2.scores_}")

# Mutual Information (captures non-linear relationships)
selector_mi = SelectKBest(mutual_info_classif, k=2)
X_selected_mi = selector_mi.fit_transform(X, y)
print(f"\nMutual Information scores: {selector_mi.scores_}")

4. SelectPercentile

from sklearn.feature_selection import SelectPercentile

# Select top 50% of features
selector = SelectPercentile(f_classif, percentile=50)
X_selected = selector.fit_transform(X, y)

print(f"Selected {X_selected.shape[1]} out of {X.shape[1]} features")
print(f"Selected features: {np.array(iris.feature_names)[selector.get_support()]}")

🔄 Wrapper Methods

Use a model to evaluate feature subsets. More accurate but computationally expensive.

1. Recursive Feature Elimination (RFE)

from sklearn.feature_selection import RFE, RFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# RFE: Recursively remove least important features
estimator = RandomForestClassifier(n_estimators=100, random_state=42)
selector = RFE(estimator, n_features_to_select=2)
selector.fit(X_train, y_train)

print(f"Selected features: {np.array(iris.feature_names)[selector.get_support()]}")
print(f"Feature ranking: {selector.ranking_}")

# Evaluate
from sklearn.metrics import accuracy_score
y_pred = selector.predict(X_test)
print(f"Accuracy with {selector.n_features_} features: {accuracy_score(y_test, y_pred):.3f}")

2. RFE with Cross-Validation

# RFECV: Automatically find optimal number of features
selector = RFECV(
    estimator=RandomForestClassifier(random_state=42),
    step=1,              # Remove 1 feature per iteration
    cv=5,                # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1
)

selector.fit(X_train, y_train)

print(f"Optimal number of features: {selector.n_features_}")
print(f"Selected features: {np.array(iris.feature_names)[selector.get_support()]}")
print(f"Cross-validation scores: {selector.cv_results_['mean_test_score']}")

# Plot number of features vs accuracy
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(selector.cv_results_['mean_test_score']) + 1),
         selector.cv_results_['mean_test_score'], 'bo-')
plt.xlabel('Number of Features')
plt.ylabel('Cross-Validation Score')
plt.title('RFECV: Optimal Number of Features')
plt.grid(True)
plt.show()

3. Sequential Feature Selection

from sklearn.feature_selection import SequentialFeatureSelector

# Forward selection: Start with 0, add features one by one
sfs_forward = SequentialFeatureSelector(
    RandomForestClassifier(random_state=42),
    n_features_to_select=2,
    direction='forward',
    cv=5,
    n_jobs=-1
)
sfs_forward.fit(X_train, y_train)
print(f"Forward selection: {np.array(iris.feature_names)[sfs_forward.get_support()]}")

# Backward selection: Start with all, remove features one by one
sfs_backward = SequentialFeatureSelector(
    RandomForestClassifier(random_state=42),
    n_features_to_select=2,
    direction='backward',
    cv=5,
    n_jobs=-1
)
sfs_backward.fit(X_train, y_train)
print(f"Backward selection: {np.array(iris.feature_names)[sfs_backward.get_support()]}")

🌳 Embedded Methods

Feature selection built into the model training process.

1. Tree-Based Feature Importance

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Random Forest importance
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Get feature importances
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]

print("Feature ranking:")
for i in range(X.shape[1]):
    print(f"{i+1}. {iris.feature_names[indices[i]]} ({importances[indices[i]]:.3f})")

# Plot
plt.figure(figsize=(10, 6))
plt.bar(range(X.shape[1]), importances[indices])
plt.xticks(range(X.shape[1]), [iris.feature_names[i] for i in indices], rotation=45)
plt.title('Feature Importances')
plt.ylabel('Importance')
plt.tight_layout()
plt.show()

# Select top k features
from sklearn.feature_selection import SelectFromModel

selector = SelectFromModel(rf, threshold='median', prefit=True)
X_selected = selector.transform(X_train)
print(f"\nSelected {X_selected.shape[1]} features above median importance")

2. L1 Regularization (Lasso)

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# L1 regularization performs feature selection
logreg_l1 = LogisticRegression(penalty='l1', C=0.1, solver='liblinear', max_iter=1000)
logreg_l1.fit(X_train_scaled, y_train)

# Features with non-zero coefficients
selected_features = np.abs(logreg_l1.coef_).sum(axis=0) > 0
print(f"Selected features: {np.array(iris.feature_names)[selected_features]}")
print(f"Number of selected features: {selected_features.sum()}")

# Use SelectFromModel
selector = SelectFromModel(logreg_l1, prefit=True)
X_selected = selector.transform(X_train_scaled)
print(f"Selected shape: {X_selected.shape}")

3. Regularization Path

from sklearn.linear_model import LassoCV

# Try different regularization strengths
alphas = np.logspace(-3, 1, 50)
lasso_cv = LassoCV(alphas=alphas, cv=5)
lasso_cv.fit(X_train_scaled, y_train)

print(f"Best alpha: {lasso_cv.alpha_:.4f}")
print(f"Coefficients: {lasso_cv.coef_}")
print(f"Non-zero features: {np.sum(lasso_cv.coef_ != 0)}")

# Plot regularization path
from sklearn.linear_model import lasso_path
alphas_path, coefs_path, _ = lasso_path(X_train_scaled, y_train, alphas=alphas)

plt.figure(figsize=(10, 6))
for i, feature_name in enumerate(iris.feature_names):
    plt.plot(alphas_path, coefs_path[i, :], label=feature_name)
plt.xscale('log')
plt.xlabel('Alpha (Regularization Strength)')
plt.ylabel('Coefficients')
plt.title('Lasso Regularization Path')
plt.legend()
plt.grid(True)
plt.show()

📊 Feature Selection Comparison

Method	Type	Speed	Accuracy	Best For
Variance Threshold	Filter	Very Fast	Low	Remove constants
Correlation	Filter	Fast	Medium	Linear relationships
Statistical Tests	Filter	Fast	Medium	Quick screening
RFE/RFECV	Wrapper	Slow	High	Optimal subset
Sequential Selection	Wrapper	Very Slow	High	Small datasets
Tree Importance	Embedded	Fast	High	Non-linear data
L1 Regularization	Embedded	Medium	High	Linear models

🔗 Removing Correlated Features

# Remove highly correlated features
def remove_correlated_features(df, threshold=0.9):
    corr_matrix = df.corr().abs()
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    
    to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
    
    return df.drop(columns=to_drop), to_drop

# Example
df = pd.DataFrame(X, columns=iris.feature_names)
df_reduced, dropped = remove_correlated_features(df, threshold=0.9)

print(f"Dropped features: {dropped}")
print(f"Remaining features: {df_reduced.columns.tolist()}")

# Visualize correlation before and after
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', ax=axes[0])
axes[0].set_title('Before Removing Correlated Features')
sns.heatmap(df_reduced.corr(), annot=True, cmap='coolwarm', ax=axes[1])
axes[1].set_title('After Removing Correlated Features')
plt.tight_layout()
plt.show()

🎯 Complete Feature Selection Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('feature_selection', SelectKBest(f_classif, k=2)),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Fit and evaluate
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print(f"Pipeline accuracy: {score:.3f}")

# Get selected features
selector = pipeline.named_steps['feature_selection']
selected_features = np.array(iris.feature_names)[selector.get_support()]
print(f"Selected features: {selected_features}")

# Compare with all features
rf_all = RandomForestClassifier(random_state=42)
rf_all.fit(X_train, y_train)
score_all = rf_all.score(X_test, y_test)
print(f"Accuracy with all features: {score_all:.3f}")
print(f"Accuracy with selected features: {score:.3f}")

💡 Best Practices

Start with filter methods: Fast exploration
Use wrapper methods for accuracy: Better but slower
Try embedded methods: Good balance of speed and accuracy
Use cross-validation: Avoid overfitting to validation set
Keep domain knowledge: Don't blindly remove features
Check for leakage: Fit selector on train data only
Compare performance: Validate that selection helps
Consider interpretability: Fewer features = easier to explain

⚠️ Common Mistakes

Selecting on entire dataset: Causes data leakage

# WRONG: Select on entire dataset
selector = SelectKBest(k=5)
X_selected = selector.fit_transform(X, y)
X_train, X_test = train_test_split(X_selected, y)

# CORRECT: Use pipeline or fit on train only
X_train, X_test, y_train, y_test = train_test_split(X, y)
selector = SelectKBest(k=5)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

Not scaling before correlation: Features with larger scales dominate
Removing too many features: Can lose important information
Ignoring multicollinearity: Correlated features provide redundant info
Over-relying on importance: Can be misleading with correlated features
Not validating results: Always check if selection improves model

🔍 Feature Selection Workflow

Remove constant features: VarianceThreshold
Remove duplicates: Check for identical columns
Remove highly correlated: Threshold > 0.9
Filter methods: Quick screening with statistical tests
Embedded methods: Tree importance or L1 regularization
Wrapper methods: RFE/RFECV for optimal subset (if time permits)
Validate: Check model performance with selected features
Iterate: Try different combinations and thresholds

🎯 Key Takeaways

Feature selection reduces overfitting and improves performance
Filter methods are fast but less accurate
Wrapper methods are accurate but computationally expensive
Embedded methods offer good balance (tree importance, L1)
Always use pipeline or fit on train data only to avoid leakage
Remove correlated features to reduce redundancy
Validate impact on model performance