Why Preprocessing Matters
Data preprocessing is crucial for ML success. Raw data is messy: missing values, different scales, categorical variables, outliers. Good preprocessing can improve model accuracy by 10-30%!
Key Preprocessing Steps:
- Handle missing values
- Encode categorical variables
- Scale/normalize features
- Handle outliers
- Feature transformation
π Loading and Exploring Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load data
df = pd.read_csv('data.csv')
# Quick overview
print(df.head())
print(df.info())
print(df.describe())
# Check for missing values
print(df.isnull().sum())
# Visualize distributions
df.hist(figsize=(12, 8))
plt.tight_layout()
plt.show()
π Handling Missing Values
Strategy 1: Remove Missing Data
# Drop rows with any missing values
df_clean = df.dropna()
# Drop rows where specific column is missing
df_clean = df.dropna(subset=['important_column'])
# Drop columns with too many missing values (>50%)
threshold = len(df) * 0.5
df_clean = df.dropna(thresh=threshold, axis=1)
Strategy 2: Imputation
from sklearn.impute import SimpleImputer
# Mean imputation for numerical features
imputer = SimpleImputer(strategy='mean')
df['age'] = imputer.fit_transform(df[['age']])
# Median (better for outliers)
imputer = SimpleImputer(strategy='median')
df['income'] = imputer.fit_transform(df[['income']])
# Mode for categorical features
imputer = SimpleImputer(strategy='most_frequent')
df['category'] = imputer.fit_transform(df[['category']])
# Forward fill (time series)
df['price'] = df['price'].fillna(method='ffill')
# Custom value
df['discount'] = df['discount'].fillna(0)
Strategy 3: Advanced Imputation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# Use other features to predict missing values
imputer = IterativeImputer(random_state=42)
df_imputed = pd.DataFrame(
imputer.fit_transform(df),
columns=df.columns
)
π·οΈ Encoding Categorical Variables
Label Encoding (Ordinal)
from sklearn.preprocessing import LabelEncoder
# For ordinal categories (low, medium, high)
le = LabelEncoder()
df['education'] = le.fit_transform(df['education'])
# Output: 0, 1, 2
# Manual mapping for custom order
education_map = {'High School': 0, 'Bachelor': 1, 'Master': 2, 'PhD': 3}
df['education'] = df['education'].map(education_map)
One-Hot Encoding (Nominal)
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
# Pandas method (simple)
df_encoded = pd.get_dummies(df, columns=['color', 'brand'], drop_first=True)
# Scikit-learn method (more control)
encoder = OneHotEncoder(sparse=False, drop='first')
encoded = encoder.fit_transform(df[['color', 'brand']])
encoded_df = pd.DataFrame(
encoded,
columns=encoder.get_feature_names_out()
)
# Example: color=['red', 'blue', 'green']
# Result: color_blue, color_green (red is baseline)
Frequency Encoding
# Replace categories with their frequency
freq_map = df['city'].value_counts().to_dict()
df['city_freq'] = df['city'].map(freq_map)
# Good for high-cardinality features
Target Encoding
# Replace category with mean of target variable
target_mean = df.groupby('city')['target'].mean()
df['city_encoded'] = df['city'].map(target_mean)
# Warning: Can cause overfitting, use cross-validation
π Feature Scaling
Why Scale?
Algorithms like SVM, KNN, neural networks are sensitive to feature scales. Scaling ensures all features contribute equally.
Min-Max Normalization
from sklearn.preprocessing import MinMaxScaler
# Scale to [0, 1]
scaler = MinMaxScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])
# Scale to custom range [-1, 1]
scaler = MinMaxScaler(feature_range=(-1, 1))
df_scaled = scaler.fit_transform(df)
Standardization (Z-score)
from sklearn.preprocessing import StandardScaler
# Mean=0, Std=1
scaler = StandardScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])
# Better for data with outliers
# Works well with: SVM, Logistic Regression, Neural Networks
Robust Scaling
from sklearn.preprocessing import RobustScaler
# Uses median and IQR (immune to outliers)
scaler = RobustScaler()
df_scaled = scaler.fit_transform(df)
# Best when data has many outliers
When to Use Each?
- MinMaxScaler: Neural networks, image data, bounded features
- StandardScaler: Most ML algorithms (SVM, logistic regression)
- RobustScaler: Data with outliers
- No scaling: Tree-based models (Random Forest, XGBoost)
π― Handling Outliers
Detect Outliers
import numpy as np
# Method 1: IQR (Interquartile Range)
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['price'] < lower_bound) | (df['price'] > upper_bound)]
print(f"Found {len(outliers)} outliers")
# Method 2: Z-score
from scipy import stats
z_scores = np.abs(stats.zscore(df['price']))
outliers = df[z_scores > 3] # >3 standard deviations
# Visualize
import seaborn as sns
sns.boxplot(x=df['price'])
plt.show()
Handle Outliers
# Option 1: Remove outliers
df_clean = df[(df['price'] >= lower_bound) & (df['price'] <= upper_bound)]
# Option 2: Cap outliers (winsorization)
df['price'] = df['price'].clip(lower=lower_bound, upper=upper_bound)
# Option 3: Transform (log, sqrt)
df['price_log'] = np.log1p(df['price']) # log(1+x) handles zeros
# Option 4: Binning
df['price_bin'] = pd.cut(df['price'], bins=[0, 100, 500, 1000, np.inf], labels=['low', 'medium', 'high', 'premium'])
π Feature Transformation
Log Transform (for skewed data)
import numpy as np
# Original data is right-skewed
df['income_log'] = np.log1p(df['income']) # log(1+x)
# Square root transform (less aggressive)
df['price_sqrt'] = np.sqrt(df['price'])
# Box-Cox transform (finds optimal transformation)
from scipy.stats import boxcox
df['sales_transformed'], lambda_param = boxcox(df['sales'] + 1)
Polynomial Features
from sklearn.preprocessing import PolynomialFeatures
# Create interaction terms and powers
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(df[['age', 'income']])
# Original: [age, income]
# New: [age, income, ageΒ², ageΓincome, incomeΒ²]
βοΈ Complete Preprocessing Pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
# Define column types
numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['gender', 'occupation', 'city']
# Numeric pipeline
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Categorical pipeline
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(drop='first', handle_unknown='ignore'))
])
# Combine pipelines
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Use in a full ML pipeline
from sklearn.ensemble import RandomForestClassifier
full_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(random_state=42))
])
# Fit and predict (preprocessing happens automatically!)
full_pipeline.fit(X_train, y_train)
predictions = full_pipeline.predict(X_test)
β Preprocessing Checklist
- β Loaded and explored data (shape, types, distributions)
- β Handled missing values (removed or imputed)
- β Encoded categorical variables (label/one-hot/target encoding)
- β Scaled numerical features (standardization/normalization)
- β Detected and handled outliers
- β Transformed skewed features (log, sqrt, Box-Cox)
- β Created derived features (if needed)
- β Split data into train/test sets
- β Built preprocessing pipeline for reproducibility
π― Key Takeaways
- Missing values: Impute or remove based on percentage and importance
- Categorical encoding: One-hot for nominal, label for ordinal
- Scaling: Essential for distance-based algorithms, not for trees
- Outliers: Detect with IQR/Z-score, handle based on domain knowledge
- Pipelines: Use sklearn pipelines for reproducible preprocessing
- Train/test: Fit preprocessing only on training data!