What is Feature Engineering?
Feature engineering is the art of creating new features from existing data to improve model performance. It's often the difference between a mediocre and outstanding model.
"Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering."
— Andrew Ng
🔢 Numerical Feature Engineering
1. Binning (Discretization)
import pandas as pd
import numpy as np
from sklearn.preprocessing import KBinsDiscretizer
# Example: Age feature
df = pd.DataFrame({'age': [18, 25, 35, 45, 55, 65, 75, 85]})
# Manual binning
df['age_group'] = pd.cut(df['age'],
bins=[0, 18, 30, 50, 100],
labels=['child', 'young', 'middle', 'senior'])
# Equal-width binning
df['age_bin_width'] = pd.cut(df['age'], bins=4)
# Equal-frequency binning (quantiles)
df['age_bin_freq'] = pd.qcut(df['age'], q=4)
# KBinsDiscretizer from sklearn
discretizer = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='quantile')
df['age_bin_sklearn'] = discretizer.fit_transform(df[['age']])
print(df)
2. Mathematical Transformations
# Log transform (for right-skewed data)
df['income_log'] = np.log1p(df['income']) # log(1 + x) to handle zeros
# Square root
df['distance_sqrt'] = np.sqrt(df['distance'])
# Square (polynomial features)
df['age_squared'] = df['age'] ** 2
# Reciprocal
df['time_inv'] = 1 / (df['time'] + 1e-5) # Avoid division by zero
# Box-Cox transformation (must be positive)
from scipy.stats import boxcox
df['price_boxcox'], lambda_param = boxcox(df['price'] + 1)
# Standardization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['feature_scaled'] = scaler.fit_transform(df[['feature']])
3. Aggregations and Statistics
# Rolling statistics (time series)
df['sales_rolling_mean_7d'] = df['sales'].rolling(window=7).mean()
df['sales_rolling_std_7d'] = df['sales'].rolling(window=7).std()
df['sales_rolling_max_30d'] = df['sales'].rolling(window=30).max()
# Lag features
df['sales_lag_1'] = df['sales'].shift(1)
df['sales_lag_7'] = df['sales'].shift(7)
# Difference features
df['sales_diff_1'] = df['sales'].diff(1)
df['sales_pct_change'] = df['sales'].pct_change()
# Expanding window statistics
df['sales_expanding_mean'] = df['sales'].expanding().mean()
# Group statistics
df['user_avg_spend'] = df.groupby('user_id')['spend'].transform('mean')
df['user_total_orders'] = df.groupby('user_id')['order_id'].transform('count')
df['product_avg_rating'] = df.groupby('product_id')['rating'].transform('mean')
📝 Categorical Feature Engineering
1. Target Encoding
# Mean target encoding (use with cross-validation!)
target_means = df.groupby('category')['target'].mean()
df['category_target_mean'] = df['category'].map(target_means)
# Smoothed target encoding (handles rare categories)
def target_encode_smooth(series, target, alpha=10):
global_mean = target.mean()
agg = pd.DataFrame({'count': series.groupby(series).size(),
'mean': target.groupby(series).mean()})
smoothed = (agg['count'] * agg['mean'] + alpha * global_mean) / (agg['count'] + alpha)
return series.map(smoothed)
df['category_smooth'] = target_encode_smooth(df['category'], df['target'])
# Important: Use cross-validation to avoid leakage!
from category_encoders import TargetEncoder
encoder = TargetEncoder()
# Fit on train, transform on test separately
2. Frequency Encoding
# Count how often each category appears
freq = df['category'].value_counts()
df['category_freq'] = df['category'].map(freq)
# Normalized frequency
df['category_freq_norm'] = df['category'].map(freq / len(df))
# Rare category indicator
df['is_rare_category'] = (df['category_freq'] < 10).astype(int)
3. Combination Features
# Concatenate multiple categoricals
df['city_state'] = df['city'] + '_' + df['state']
df['product_brand'] = df['product'] + '_' + df['brand']
# Count of combinations
df['user_product_count'] = df.groupby(['user_id', 'product_id']).cumcount()
# Binary interactions
df['is_weekend_premium'] = ((df['is_weekend'] == 1) & (df['is_premium'] == 1)).astype(int)
📅 Date & Time Features
# Convert to datetime
df['date'] = pd.to_datetime(df['date'])
# Extract components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['dayofweek'] = df['date'].dt.dayofweek # 0=Monday
df['dayofyear'] = df['date'].dt.dayofyear
df['quarter'] = df['date'].dt.quarter
df['week'] = df['date'].dt.isocalendar().week
df['hour'] = df['date'].dt.hour
# Cyclical encoding (for periodic features)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
# Boolean flags
df['is_weekend'] = df['dayofweek'].isin([5, 6]).astype(int)
df['is_month_start'] = df['date'].dt.is_month_start.astype(int)
df['is_month_end'] = df['date'].dt.is_month_end.astype(int)
df['is_quarter_end'] = df['date'].dt.is_quarter_end.astype(int)
# Time since event
df['days_since_first'] = (df['date'] - df['date'].min()).dt.days
df['days_until_end'] = (df['date'].max() - df['date']).dt.days
# Time between events
df['days_since_last_purchase'] = df.groupby('user_id')['date'].diff().dt.days
📍 Geospatial Features
# Haversine distance between two points
def haversine_distance(lat1, lon1, lat2, lon2):
R = 6371 # Earth radius in km
lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
dlat = lat2 - lat1
dlon = lon2 - lon1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
return R * c
# Distance to city center
city_center_lat, city_center_lon = 40.7128, -74.0060 # NYC
df['distance_to_center'] = haversine_distance(
df['latitude'], df['longitude'],
city_center_lat, city_center_lon
)
# Clustering coordinates
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=10, random_state=42)
df['location_cluster'] = kmeans.fit_predict(df[['latitude', 'longitude']])
# Grid binning
df['lat_bin'] = pd.cut(df['latitude'], bins=20)
df['lon_bin'] = pd.cut(df['longitude'], bins=20)
df['location_grid'] = df['lat_bin'].astype(str) + '_' + df['lon_bin'].astype(str)
🔤 Text Features
# Basic text statistics
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()
df['char_count'] = df['text'].str.len()
df['avg_word_length'] = df['char_count'] / df['word_count']
# Special characters
df['num_uppercase'] = df['text'].str.count(r'[A-Z]')
df['num_digits'] = df['text'].str.count(r'\d')
df['num_punctuation'] = df['text'].str.count(r'[.,!?;:]')
df['has_url'] = df['text'].str.contains('http', case=False).astype(int)
df['has_email'] = df['text'].str.contains(r'\S+@\S+').astype(int)
# Sentiment (requires TextBlob: pip install textblob)
from textblob import TextBlob
df['sentiment_polarity'] = df['text'].apply(lambda x: TextBlob(x).sentiment.polarity)
df['sentiment_subjectivity'] = df['text'].apply(lambda x: TextBlob(x).sentiment.subjectivity)
# TF-IDF features
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=100, stop_words='english')
tfidf_features = tfidf.fit_transform(df['text'])
# Add to dataframe
tfidf_df = pd.DataFrame(tfidf_features.toarray(),
columns=[f'tfidf_{i}' for i in range(100)])
df = pd.concat([df, tfidf_df], axis=1)
🎨 Interaction Features
# Arithmetic interactions
df['price_per_sqft'] = df['price'] / df['sqft']
df['total_rooms'] = df['bedrooms'] + df['bathrooms']
df['room_ratio'] = df['bedrooms'] / (df['bathrooms'] + 1)
# Polynomial features (automated)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=True)
interactions = poly.fit_transform(df[['feature1', 'feature2', 'feature3']])
# Ratio features
df['conversion_rate'] = df['conversions'] / (df['clicks'] + 1)
df['ctr'] = df['clicks'] / (df['impressions'] + 1)
# Difference features
df['price_diff_from_avg'] = df['price'] - df['price'].mean()
df['age_diff'] = df['age'] - df['age'].median()
# Domain-specific features
# Example: E-commerce
df['is_first_purchase'] = (df.groupby('user_id').cumcount() == 0).astype(int)
df['purchase_recency'] = df.groupby('user_id')['date'].diff().dt.days
df['avg_basket_size'] = df.groupby('user_id')['items'].transform('mean')
⚡ Automated Feature Engineering
Featuretools
# Install: pip install featuretools
import featuretools as ft
# Create entity set
es = ft.EntitySet(id='transactions')
# Add entities
es = es.add_dataframe(
dataframe_name='transactions',
dataframe=transactions_df,
index='transaction_id',
time_index='timestamp'
)
es = es.add_dataframe(
dataframe_name='customers',
dataframe=customers_df,
index='customer_id'
)
# Define relationship
es = es.add_relationship('customers', 'customer_id',
'transactions', 'customer_id')
# Generate features automatically
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name='transactions',
max_depth=2,
verbose=True
)
print(f"Generated {len(feature_defs)} features")
print(feature_matrix.head())
💡 Best Practices
- Start simple: Basic features first, then complex
- Domain knowledge: Best features come from understanding the problem
- Avoid leakage: Don't use future information or test data
- Handle missing values: Create "is_missing" indicator features
- Normalize/scale: After creating features, scale if needed
- Iterate: Create → Test → Analyze → Repeat
- Feature importance: Use to identify valuable features
- Document: Keep track of feature engineering logic
⚠️ Common Mistakes
- Target leakage: Using information not available at prediction time
- Look-ahead bias: Using future information in time series
- Not handling train/test separately: Fit encoders on train only
- Creating too many features: Causes overfitting and slow training
- Ignoring missing values: Can indicate important patterns
- Not validating impact: Always check if new features help
🔍 Feature Engineering Checklist
- ✅ Understand the data: Explore distributions, missing values, outliers
- ✅ Handle missing values: Impute or create indicators
- ✅ Encode categoricals: One-hot, target, frequency encoding
- ✅ Scale numerical features: Standardize or normalize
- ✅ Create date features: Extract year, month, day, cyclical
- ✅ Aggregate features: Group statistics, rolling windows
- ✅ Interaction features: Ratios, products, domain-specific
- ✅ Transform skewed features: Log, sqrt, Box-Cox
- ✅ Bin numerical features: Sometimes helps tree models
- ✅ Create text features: Length, TF-IDF, sentiment
- ✅ Validate features: Check feature importance and model performance
- ✅ Remove redundant features: Highly correlated or zero variance
🎯 Key Takeaways
- Feature engineering often more important than algorithm choice
- Domain knowledge is crucial for creating meaningful features
- Date features: Extract components, create cyclical encodings
- Aggregations: Group statistics, rolling windows, lag features
- Avoid leakage: No future information, fit on train only
- Interaction features: Combinations can capture complex patterns
- Iterate and validate: Test impact of new features