⚙️ Feature Engineering

Create powerful features from raw data

What is Feature Engineering?

Feature engineering is the art of creating new features from existing data to improve model performance. It's often the difference between a mediocre and outstanding model.

"Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering."

— Andrew Ng

🔢 Numerical Feature Engineering

1. Binning (Discretization)

import pandas as pd
import numpy as np
from sklearn.preprocessing import KBinsDiscretizer

# Example: Age feature
df = pd.DataFrame({'age': [18, 25, 35, 45, 55, 65, 75, 85]})

# Manual binning
df['age_group'] = pd.cut(df['age'], 
                         bins=[0, 18, 30, 50, 100],
                         labels=['child', 'young', 'middle', 'senior'])

# Equal-width binning
df['age_bin_width'] = pd.cut(df['age'], bins=4)

# Equal-frequency binning (quantiles)
df['age_bin_freq'] = pd.qcut(df['age'], q=4)

# KBinsDiscretizer from sklearn
discretizer = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='quantile')
df['age_bin_sklearn'] = discretizer.fit_transform(df[['age']])

print(df)

2. Mathematical Transformations

# Log transform (for right-skewed data)
df['income_log'] = np.log1p(df['income'])  # log(1 + x) to handle zeros

# Square root
df['distance_sqrt'] = np.sqrt(df['distance'])

# Square (polynomial features)
df['age_squared'] = df['age'] ** 2

# Reciprocal
df['time_inv'] = 1 / (df['time'] + 1e-5)  # Avoid division by zero

# Box-Cox transformation (must be positive)
from scipy.stats import boxcox
df['price_boxcox'], lambda_param = boxcox(df['price'] + 1)

# Standardization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['feature_scaled'] = scaler.fit_transform(df[['feature']])

3. Aggregations and Statistics

# Rolling statistics (time series)
df['sales_rolling_mean_7d'] = df['sales'].rolling(window=7).mean()
df['sales_rolling_std_7d'] = df['sales'].rolling(window=7).std()
df['sales_rolling_max_30d'] = df['sales'].rolling(window=30).max()

# Lag features
df['sales_lag_1'] = df['sales'].shift(1)
df['sales_lag_7'] = df['sales'].shift(7)

# Difference features
df['sales_diff_1'] = df['sales'].diff(1)
df['sales_pct_change'] = df['sales'].pct_change()

# Expanding window statistics
df['sales_expanding_mean'] = df['sales'].expanding().mean()

# Group statistics
df['user_avg_spend'] = df.groupby('user_id')['spend'].transform('mean')
df['user_total_orders'] = df.groupby('user_id')['order_id'].transform('count')
df['product_avg_rating'] = df.groupby('product_id')['rating'].transform('mean')

📝 Categorical Feature Engineering

1. Target Encoding

# Mean target encoding (use with cross-validation!)
target_means = df.groupby('category')['target'].mean()
df['category_target_mean'] = df['category'].map(target_means)

# Smoothed target encoding (handles rare categories)
def target_encode_smooth(series, target, alpha=10):
    global_mean = target.mean()
    agg = pd.DataFrame({'count': series.groupby(series).size(),
                        'mean': target.groupby(series).mean()})
    smoothed = (agg['count'] * agg['mean'] + alpha * global_mean) / (agg['count'] + alpha)
    return series.map(smoothed)

df['category_smooth'] = target_encode_smooth(df['category'], df['target'])

# Important: Use cross-validation to avoid leakage!
from category_encoders import TargetEncoder
encoder = TargetEncoder()
# Fit on train, transform on test separately

2. Frequency Encoding

# Count how often each category appears
freq = df['category'].value_counts()
df['category_freq'] = df['category'].map(freq)

# Normalized frequency
df['category_freq_norm'] = df['category'].map(freq / len(df))

# Rare category indicator
df['is_rare_category'] = (df['category_freq'] < 10).astype(int)

3. Combination Features

# Concatenate multiple categoricals
df['city_state'] = df['city'] + '_' + df['state']
df['product_brand'] = df['product'] + '_' + df['brand']

# Count of combinations
df['user_product_count'] = df.groupby(['user_id', 'product_id']).cumcount()

# Binary interactions
df['is_weekend_premium'] = ((df['is_weekend'] == 1) & (df['is_premium'] == 1)).astype(int)

📅 Date & Time Features

# Convert to datetime
df['date'] = pd.to_datetime(df['date'])

# Extract components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['dayofweek'] = df['date'].dt.dayofweek  # 0=Monday
df['dayofyear'] = df['date'].dt.dayofyear
df['quarter'] = df['date'].dt.quarter
df['week'] = df['date'].dt.isocalendar().week
df['hour'] = df['date'].dt.hour

# Cyclical encoding (for periodic features)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)

# Boolean flags
df['is_weekend'] = df['dayofweek'].isin([5, 6]).astype(int)
df['is_month_start'] = df['date'].dt.is_month_start.astype(int)
df['is_month_end'] = df['date'].dt.is_month_end.astype(int)
df['is_quarter_end'] = df['date'].dt.is_quarter_end.astype(int)

# Time since event
df['days_since_first'] = (df['date'] - df['date'].min()).dt.days
df['days_until_end'] = (df['date'].max() - df['date']).dt.days

# Time between events
df['days_since_last_purchase'] = df.groupby('user_id')['date'].diff().dt.days

📍 Geospatial Features

# Haversine distance between two points
def haversine_distance(lat1, lon1, lat2, lon2):
    R = 6371  # Earth radius in km
    
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a))
    return R * c

# Distance to city center
city_center_lat, city_center_lon = 40.7128, -74.0060  # NYC
df['distance_to_center'] = haversine_distance(
    df['latitude'], df['longitude'],
    city_center_lat, city_center_lon
)

# Clustering coordinates
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=10, random_state=42)
df['location_cluster'] = kmeans.fit_predict(df[['latitude', 'longitude']])

# Grid binning
df['lat_bin'] = pd.cut(df['latitude'], bins=20)
df['lon_bin'] = pd.cut(df['longitude'], bins=20)
df['location_grid'] = df['lat_bin'].astype(str) + '_' + df['lon_bin'].astype(str)

🔤 Text Features

# Basic text statistics
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()
df['char_count'] = df['text'].str.len()
df['avg_word_length'] = df['char_count'] / df['word_count']

# Special characters
df['num_uppercase'] = df['text'].str.count(r'[A-Z]')
df['num_digits'] = df['text'].str.count(r'\d')
df['num_punctuation'] = df['text'].str.count(r'[.,!?;:]')
df['has_url'] = df['text'].str.contains('http', case=False).astype(int)
df['has_email'] = df['text'].str.contains(r'\S+@\S+').astype(int)

# Sentiment (requires TextBlob: pip install textblob)
from textblob import TextBlob
df['sentiment_polarity'] = df['text'].apply(lambda x: TextBlob(x).sentiment.polarity)
df['sentiment_subjectivity'] = df['text'].apply(lambda x: TextBlob(x).sentiment.subjectivity)

# TF-IDF features
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=100, stop_words='english')
tfidf_features = tfidf.fit_transform(df['text'])

# Add to dataframe
tfidf_df = pd.DataFrame(tfidf_features.toarray(), 
                        columns=[f'tfidf_{i}' for i in range(100)])
df = pd.concat([df, tfidf_df], axis=1)

🎨 Interaction Features

# Arithmetic interactions
df['price_per_sqft'] = df['price'] / df['sqft']
df['total_rooms'] = df['bedrooms'] + df['bathrooms']
df['room_ratio'] = df['bedrooms'] / (df['bathrooms'] + 1)

# Polynomial features (automated)
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=True)
interactions = poly.fit_transform(df[['feature1', 'feature2', 'feature3']])

# Ratio features
df['conversion_rate'] = df['conversions'] / (df['clicks'] + 1)
df['ctr'] = df['clicks'] / (df['impressions'] + 1)

# Difference features
df['price_diff_from_avg'] = df['price'] - df['price'].mean()
df['age_diff'] = df['age'] - df['age'].median()

# Domain-specific features
# Example: E-commerce
df['is_first_purchase'] = (df.groupby('user_id').cumcount() == 0).astype(int)
df['purchase_recency'] = df.groupby('user_id')['date'].diff().dt.days
df['avg_basket_size'] = df.groupby('user_id')['items'].transform('mean')

⚡ Automated Feature Engineering

Featuretools

# Install: pip install featuretools
import featuretools as ft

# Create entity set
es = ft.EntitySet(id='transactions')

# Add entities
es = es.add_dataframe(
    dataframe_name='transactions',
    dataframe=transactions_df,
    index='transaction_id',
    time_index='timestamp'
)

es = es.add_dataframe(
    dataframe_name='customers',
    dataframe=customers_df,
    index='customer_id'
)

# Define relationship
es = es.add_relationship('customers', 'customer_id',
                         'transactions', 'customer_id')

# Generate features automatically
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name='transactions',
    max_depth=2,
    verbose=True
)

print(f"Generated {len(feature_defs)} features")
print(feature_matrix.head())

💡 Best Practices

⚠️ Common Mistakes

🔍 Feature Engineering Checklist

  1. Understand the data: Explore distributions, missing values, outliers
  2. Handle missing values: Impute or create indicators
  3. Encode categoricals: One-hot, target, frequency encoding
  4. Scale numerical features: Standardize or normalize
  5. Create date features: Extract year, month, day, cyclical
  6. Aggregate features: Group statistics, rolling windows
  7. Interaction features: Ratios, products, domain-specific
  8. Transform skewed features: Log, sqrt, Box-Cox
  9. Bin numerical features: Sometimes helps tree models
  10. Create text features: Length, TF-IDF, sentiment
  11. Validate features: Check feature importance and model performance
  12. Remove redundant features: Highly correlated or zero variance

🎯 Key Takeaways