🎯 Project Overview
Build a machine learning model to predict house prices based on various features like size, location, number of bedrooms, etc. This is a classic regression problem using the famous Boston Housing or California Housing dataset.
What You'll Learn:
- Data exploration and visualization
- Feature engineering and selection
- Multiple linear regression
- Model evaluation and optimization
- Handling real-world data issues
🛠️ Technologies
Python 3.8+
Pandas
NumPy
Scikit-learn
Matplotlib
Seaborn
📦 Step 1: Setup and Import Libraries
# Install required packages
# pip install pandas numpy scikit-learn matplotlib seaborn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
📊 Step 2: Load and Explore Data
# Load California Housing dataset
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['Price'] = housing.target
# Display basic information
print("Dataset Shape:", df.shape)
print("\nFirst few rows:")
print(df.head())
print("\nDataset Info:")
print(df.info())
print("\nStatistical Summary:")
print(df.describe())
print("\nMissing Values:")
print(df.isnull().sum())
print("\nFeature Descriptions:")
for feature, description in zip(housing.feature_names, housing.DESCR.split('\n')[9:18]):
print(f"{feature}: {description}")
Output:
Dataset Shape: (20640, 9)
First few rows:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude Price
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3.585
Features:
- MedInc: Median income in block
- HouseAge: Median house age in block
- AveRooms: Average number of rooms
- AveBedrms: Average number of bedrooms
- Population: Block population
- AveOccup: Average house occupancy
- Latitude: Block latitude
- Longitude: Block longitude
📈 Step 3: Exploratory Data Analysis (EDA)
# 1. Distribution of target variable
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.hist(df['Price'], bins=50, edgecolor='black')
plt.xlabel('House Price ($100k)')
plt.ylabel('Frequency')
plt.title('Distribution of House Prices')
plt.subplot(1, 2, 2)
plt.boxplot(df['Price'])
plt.ylabel('House Price ($100k)')
plt.title('Box Plot of House Prices')
plt.tight_layout()
plt.show()
# 2. Correlation matrix
plt.figure(figsize=(10, 8))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm',
square=True, linewidths=0.5)
plt.title('Feature Correlation Heatmap')
plt.tight_layout()
plt.show()
# 3. Key correlations with price
print("\nCorrelations with Price:")
print(correlation_matrix['Price'].sort_values(ascending=False))
# 4. Scatter plots for top features
top_features = correlation_matrix['Price'].abs().sort_values(ascending=False)[1:5].index
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()
for idx, feature in enumerate(top_features):
axes[idx].scatter(df[feature], df['Price'], alpha=0.5)
axes[idx].set_xlabel(feature)
axes[idx].set_ylabel('Price')
axes[idx].set_title(f'{feature} vs Price')
plt.tight_layout()
plt.show()
# 5. Geographic scatter plot
plt.figure(figsize=(10, 8))
scatter = plt.scatter(df['Longitude'], df['Latitude'],
c=df['Price'], s=df['Population']/100,
alpha=0.4, cmap='viridis')
plt.colorbar(scatter, label='Price ($100k)')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('California Housing Prices by Location')
plt.show()
Key Insights from EDA:
- MedInc has the strongest correlation with price (0.69)
- Latitude and Longitude show geographic patterns
- Price distribution is right-skewed (most houses in lower price range)
- Some outliers exist but they're legitimate high-value properties
🔧 Step 4: Feature Engineering
# Create new features
df['RoomsPerHousehold'] = df['AveRooms'] / df['AveOccup']
df['BedroomsPerRoom'] = df['AveBedrms'] / df['AveRooms']
df['PopulationPerHousehold'] = df['Population'] / df['AveOccup']
# Create location-based features
df['DistanceToCenter'] = np.sqrt(
(df['Latitude'] - 37.8)**2 +
(df['Longitude'] + 122.4)**2
)
# Income categories
df['IncomeCategory'] = pd.cut(df['MedInc'],
bins=[0, 2.5, 4.5, 6.0, np.inf],
labels=['Low', 'Medium', 'High', 'Very High'])
# One-hot encode categorical features
df_encoded = pd.get_dummies(df, columns=['IncomeCategory'], drop_first=True)
print("New features created:")
print(df_encoded.columns.tolist())
print("\nCorrelations with Price (new features):")
new_feature_corr = df_encoded[[
'RoomsPerHousehold', 'BedroomsPerRoom',
'PopulationPerHousehold', 'DistanceToCenter'
]].corrwith(df_encoded['Price'])
print(new_feature_corr.sort_values(ascending=False))
🎯 Step 5: Prepare Data for Training
# Separate features and target
X = df_encoded.drop('Price', axis=1)
y = df_encoded['Price']
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")
# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("\nFeature scaling completed")
print("Mean after scaling:", X_train_scaled.mean())
print("Std after scaling:", X_train_scaled.std())
🤖 Step 6: Train Multiple Models
# Initialize models
models = {
'Linear Regression': LinearRegression(),
'Ridge Regression': Ridge(alpha=1.0),
'Lasso Regression': Lasso(alpha=0.1),
'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42)
}
# Train and evaluate each model
results = {}
for name, model in models.items():
print(f"\n{'='*50}")
print(f"Training {name}...")
print('='*50)
# Train
model.fit(X_train_scaled, y_train)
# Predictions
y_train_pred = model.predict(X_train_scaled)
y_test_pred = model.predict(X_test_scaled)
# Evaluate
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
test_mae = mean_absolute_error(y_test, y_test_pred)
# Cross-validation
cv_scores = cross_val_score(model, X_train_scaled, y_train,
cv=5, scoring='r2')
results[name] = {
'Train R²': train_r2,
'Test R²': test_r2,
'Train RMSE': train_rmse,
'Test RMSE': test_rmse,
'Test MAE': test_mae,
'CV R² Mean': cv_scores.mean(),
'CV R² Std': cv_scores.std()
}
print(f"Train R²: {train_r2:.4f}")
print(f"Test R²: {test_r2:.4f}")
print(f"Test RMSE: ${test_rmse*100:.2f}k")
print(f"Test MAE: ${test_mae*100:.2f}k")
print(f"Cross-Val R² (mean ± std): {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
# Create results DataFrame
results_df = pd.DataFrame(results).T
print("\n" + "="*70)
print("MODEL COMPARISON")
print("="*70)
print(results_df)
Expected Output:
==================================================
Training Linear Regression...
==================================================
Train R²: 0.6063
Test R²: 0.5979
Test RMSE: $73.42k
Test MAE: $52.91k
Cross-Val R² (mean ± std): 0.6012 ± 0.0234
==================================================
Training Random Forest...
==================================================
Train R²: 0.9756
Test R²: 0.8123
Test RMSE: $50.12k
Test MAE: $32.67k
Cross-Val R² (mean ± std): 0.8089 ± 0.0189
📊 Step 7: Model Evaluation and Visualization
# Select best model (Random Forest in this case)
best_model = models['Random Forest']
y_pred = best_model.predict(X_test_scaled)
# 1. Actual vs Predicted
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],
'r--', lw=2)
plt.xlabel('Actual Price ($100k)')
plt.ylabel('Predicted Price ($100k)')
plt.title('Actual vs Predicted House Prices')
# 2. Residuals plot
residuals = y_test - y_pred
plt.subplot(1, 2, 2)
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Price ($100k)')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.tight_layout()
plt.show()
# 3. Feature importance (for Random Forest)
feature_importance = pd.DataFrame({
'Feature': X.columns,
'Importance': best_model.feature_importances_
}).sort_values('Importance', ascending=False)
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'][:10],
feature_importance['Importance'][:10])
plt.xlabel('Importance')
plt.title('Top 10 Feature Importances')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
print("\nTop 10 Important Features:")
print(feature_importance.head(10))
🎯 Step 8: Make Predictions on New Data
# Create a function to predict house price
def predict_house_price(med_inc, house_age, ave_rooms, ave_bedrms,
population, ave_occup, latitude, longitude):
"""
Predict house price based on features
"""
# Create feature dictionary
features = pd.DataFrame({
'MedInc': [med_inc],
'HouseAge': [house_age],
'AveRooms': [ave_rooms],
'AveBedrms': [ave_bedrms],
'Population': [population],
'AveOccup': [ave_occup],
'Latitude': [latitude],
'Longitude': [longitude]
})
# Engineer features
features['RoomsPerHousehold'] = features['AveRooms'] / features['AveOccup']
features['BedroomsPerRoom'] = features['AveBedrms'] / features['AveRooms']
features['PopulationPerHousehold'] = features['Population'] / features['AveOccup']
features['DistanceToCenter'] = np.sqrt(
(features['Latitude'] - 37.8)**2 +
(features['Longitude'] + 122.4)**2
)
# Income category
income_cat = pd.cut(features['MedInc'],
bins=[0, 2.5, 4.5, 6.0, np.inf],
labels=['Low', 'Medium', 'High', 'Very High'])
features['IncomeCategory'] = income_cat
# One-hot encode
features_encoded = pd.get_dummies(features, columns=['IncomeCategory'])
# Ensure all columns match training data
for col in X.columns:
if col not in features_encoded.columns:
features_encoded[col] = 0
features_encoded = features_encoded[X.columns]
# Scale features
features_scaled = scaler.transform(features_encoded)
# Predict
prediction = best_model.predict(features_scaled)[0]
return prediction
# Example predictions
examples = [
{
'name': 'Affordable Suburban House',
'med_inc': 3.5, 'house_age': 25, 'ave_rooms': 5.5,
'ave_bedrms': 1.0, 'population': 1200, 'ave_occup': 3.0,
'latitude': 34.0, 'longitude': -118.0
},
{
'name': 'Luxury Bay Area House',
'med_inc': 8.5, 'house_age': 10, 'ave_rooms': 7.0,
'ave_bedrms': 1.2, 'population': 800, 'ave_occup': 2.5,
'latitude': 37.8, 'longitude': -122.4
},
{
'name': 'Rural House',
'med_inc': 2.8, 'house_age': 35, 'ave_rooms': 4.5,
'ave_bedrms': 1.1, 'population': 500, 'ave_occup': 2.8,
'latitude': 39.5, 'longitude': -121.5
}
]
print("\n" + "="*70)
print("HOUSE PRICE PREDICTIONS")
print("="*70)
for example in examples:
name = example.pop('name')
price = predict_house_price(**example)
print(f"\n{name}:")
print(f" Predicted Price: ${price*100:.2f}k (${price*100000:,.0f})")
Output:
======================================================================
HOUSE PRICE PREDICTIONS
======================================================================
Affordable Suburban House:
Predicted Price: $228.45k ($228,450)
Luxury Bay Area House:
Predicted Price: $487.32k ($487,320)
Rural House:
Predicted Price: $156.78k ($156,780)
💾 Step 9: Save the Model
import joblib
# Save model and scaler
joblib.dump(best_model, 'house_price_model.pkl')
joblib.dump(scaler, 'feature_scaler.pkl')
print("Model saved successfully!")
# To load later:
# loaded_model = joblib.load('house_price_model.pkl')
# loaded_scaler = joblib.load('feature_scaler.pkl')
🎓 Key Takeaways
- Feature Engineering: Creating new features significantly improved model performance
- Model Selection: Random Forest outperformed linear models (R² = 0.81 vs 0.60)
- Feature Importance: MedInc, Location (Lat/Long), and HouseAge were most important
- Evaluation: Used multiple metrics (R², RMSE, MAE) and cross-validation
- Visualization: Plots helped identify patterns and validate model performance
🚀 Next Steps
- Try advanced models (XGBoost, Neural Networks)
- Perform hyperparameter tuning with GridSearchCV
- Handle outliers more carefully
- Create a web interface with Flask or Streamlit
- Deploy model to cloud (AWS, Azure, GCP)