House Price Prediction Project

🎯 Project Overview

Build a machine learning model to predict house prices based on various features like size, location, number of bedrooms, etc. This is a classic regression problem using the famous Boston Housing or California Housing dataset.

What You'll Learn:

Data exploration and visualization
Feature engineering and selection
Multiple linear regression
Model evaluation and optimization
Handling real-world data issues

🛠️ Technologies

Python 3.8+ Pandas NumPy Scikit-learn Matplotlib Seaborn

📦 Step 1: Setup and Import Libraries

# Install required packages
# pip install pandas numpy scikit-learn matplotlib seaborn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

📊 Step 2: Load and Explore Data

# Load California Housing dataset
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['Price'] = housing.target

# Display basic information
print("Dataset Shape:", df.shape)
print("\nFirst few rows:")
print(df.head())

print("\nDataset Info:")
print(df.info())

print("\nStatistical Summary:")
print(df.describe())

print("\nMissing Values:")
print(df.isnull().sum())

print("\nFeature Descriptions:")
for feature, description in zip(housing.feature_names, housing.DESCR.split('\n')[9:18]):
    print(f"{feature}: {description}")

Output:

Dataset Shape: (20640, 9)

First few rows:
   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  Longitude  Price
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88    -122.23   4.526
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86    -122.22   3.585

Features:
- MedInc: Median income in block
- HouseAge: Median house age in block
- AveRooms: Average number of rooms
- AveBedrms: Average number of bedrooms
- Population: Block population
- AveOccup: Average house occupancy
- Latitude: Block latitude
- Longitude: Block longitude

📈 Step 3: Exploratory Data Analysis (EDA)

# 1. Distribution of target variable
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(df['Price'], bins=50, edgecolor='black')
plt.xlabel('House Price ($100k)')
plt.ylabel('Frequency')
plt.title('Distribution of House Prices')

plt.subplot(1, 2, 2)
plt.boxplot(df['Price'])
plt.ylabel('House Price ($100k)')
plt.title('Box Plot of House Prices')
plt.tight_layout()
plt.show()

# 2. Correlation matrix
plt.figure(figsize=(10, 8))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            square=True, linewidths=0.5)
plt.title('Feature Correlation Heatmap')
plt.tight_layout()
plt.show()

# 3. Key correlations with price
print("\nCorrelations with Price:")
print(correlation_matrix['Price'].sort_values(ascending=False))

# 4. Scatter plots for top features
top_features = correlation_matrix['Price'].abs().sort_values(ascending=False)[1:5].index

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()

for idx, feature in enumerate(top_features):
    axes[idx].scatter(df[feature], df['Price'], alpha=0.5)
    axes[idx].set_xlabel(feature)
    axes[idx].set_ylabel('Price')
    axes[idx].set_title(f'{feature} vs Price')
    
plt.tight_layout()
plt.show()

# 5. Geographic scatter plot
plt.figure(figsize=(10, 8))
scatter = plt.scatter(df['Longitude'], df['Latitude'], 
                     c=df['Price'], s=df['Population']/100,
                     alpha=0.4, cmap='viridis')
plt.colorbar(scatter, label='Price ($100k)')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('California Housing Prices by Location')
plt.show()

Key Insights from EDA:

MedInc has the strongest correlation with price (0.69)
Latitude and Longitude show geographic patterns
Price distribution is right-skewed (most houses in lower price range)
Some outliers exist but they're legitimate high-value properties

🔧 Step 4: Feature Engineering

# Create new features
df['RoomsPerHousehold'] = df['AveRooms'] / df['AveOccup']
df['BedroomsPerRoom'] = df['AveBedrms'] / df['AveRooms']
df['PopulationPerHousehold'] = df['Population'] / df['AveOccup']

# Create location-based features
df['DistanceToCenter'] = np.sqrt(
    (df['Latitude'] - 37.8)**2 + 
    (df['Longitude'] + 122.4)**2
)

# Income categories
df['IncomeCategory'] = pd.cut(df['MedInc'], 
                               bins=[0, 2.5, 4.5, 6.0, np.inf],
                               labels=['Low', 'Medium', 'High', 'Very High'])

# One-hot encode categorical features
df_encoded = pd.get_dummies(df, columns=['IncomeCategory'], drop_first=True)

print("New features created:")
print(df_encoded.columns.tolist())

print("\nCorrelations with Price (new features):")
new_feature_corr = df_encoded[[
    'RoomsPerHousehold', 'BedroomsPerRoom', 
    'PopulationPerHousehold', 'DistanceToCenter'
]].corrwith(df_encoded['Price'])
print(new_feature_corr.sort_values(ascending=False))

🎯 Step 5: Prepare Data for Training

# Separate features and target
X = df_encoded.drop('Price', axis=1)
y = df_encoded['Price']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("\nFeature scaling completed")
print("Mean after scaling:", X_train_scaled.mean())
print("Std after scaling:", X_train_scaled.std())

🤖 Step 6: Train Multiple Models

# Initialize models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Lasso Regression': Lasso(alpha=0.1),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42)
}

# Train and evaluate each model
results = {}

for name, model in models.items():
    print(f"\n{'='*50}")
    print(f"Training {name}...")
    print('='*50)
    
    # Train
    model.fit(X_train_scaled, y_train)
    
    # Predictions
    y_train_pred = model.predict(X_train_scaled)
    y_test_pred = model.predict(X_test_scaled)
    
    # Evaluate
    train_r2 = r2_score(y_train, y_train_pred)
    test_r2 = r2_score(y_test, y_test_pred)
    train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
    test_mae = mean_absolute_error(y_test, y_test_pred)
    
    # Cross-validation
    cv_scores = cross_val_score(model, X_train_scaled, y_train, 
                                cv=5, scoring='r2')
    
    results[name] = {
        'Train R²': train_r2,
        'Test R²': test_r2,
        'Train RMSE': train_rmse,
        'Test RMSE': test_rmse,
        'Test MAE': test_mae,
        'CV R² Mean': cv_scores.mean(),
        'CV R² Std': cv_scores.std()
    }
    
    print(f"Train R²: {train_r2:.4f}")
    print(f"Test R²: {test_r2:.4f}")
    print(f"Test RMSE: ${test_rmse*100:.2f}k")
    print(f"Test MAE: ${test_mae*100:.2f}k")
    print(f"Cross-Val R² (mean ± std): {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

# Create results DataFrame
results_df = pd.DataFrame(results).T
print("\n" + "="*70)
print("MODEL COMPARISON")
print("="*70)
print(results_df)

Expected Output:

==================================================
Training Linear Regression...
==================================================
Train R²: 0.6063
Test R²: 0.5979
Test RMSE: $73.42k
Test MAE: $52.91k
Cross-Val R² (mean ± std): 0.6012 ± 0.0234

==================================================
Training Random Forest...
==================================================
Train R²: 0.9756
Test R²: 0.8123
Test RMSE: $50.12k
Test MAE: $32.67k
Cross-Val R² (mean ± std): 0.8089 ± 0.0189

📊 Step 7: Model Evaluation and Visualization

# Select best model (Random Forest in this case)
best_model = models['Random Forest']
y_pred = best_model.predict(X_test_scaled)

# 1. Actual vs Predicted
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
         'r--', lw=2)
plt.xlabel('Actual Price ($100k)')
plt.ylabel('Predicted Price ($100k)')
plt.title('Actual vs Predicted House Prices')

# 2. Residuals plot
residuals = y_test - y_pred
plt.subplot(1, 2, 2)
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Price ($100k)')
plt.ylabel('Residuals')
plt.title('Residual Plot')

plt.tight_layout()
plt.show()

# 3. Feature importance (for Random Forest)
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': best_model.feature_importances_
}).sort_values('Importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'][:10], 
         feature_importance['Importance'][:10])
plt.xlabel('Importance')
plt.title('Top 10 Feature Importances')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print("\nTop 10 Important Features:")
print(feature_importance.head(10))

🎯 Step 8: Make Predictions on New Data

# Create a function to predict house price
def predict_house_price(med_inc, house_age, ave_rooms, ave_bedrms, 
                       population, ave_occup, latitude, longitude):
    """
    Predict house price based on features
    """
    # Create feature dictionary
    features = pd.DataFrame({
        'MedInc': [med_inc],
        'HouseAge': [house_age],
        'AveRooms': [ave_rooms],
        'AveBedrms': [ave_bedrms],
        'Population': [population],
        'AveOccup': [ave_occup],
        'Latitude': [latitude],
        'Longitude': [longitude]
    })
    
    # Engineer features
    features['RoomsPerHousehold'] = features['AveRooms'] / features['AveOccup']
    features['BedroomsPerRoom'] = features['AveBedrms'] / features['AveRooms']
    features['PopulationPerHousehold'] = features['Population'] / features['AveOccup']
    features['DistanceToCenter'] = np.sqrt(
        (features['Latitude'] - 37.8)**2 + 
        (features['Longitude'] + 122.4)**2
    )
    
    # Income category
    income_cat = pd.cut(features['MedInc'], 
                       bins=[0, 2.5, 4.5, 6.0, np.inf],
                       labels=['Low', 'Medium', 'High', 'Very High'])
    features['IncomeCategory'] = income_cat
    
    # One-hot encode
    features_encoded = pd.get_dummies(features, columns=['IncomeCategory'])
    
    # Ensure all columns match training data
    for col in X.columns:
        if col not in features_encoded.columns:
            features_encoded[col] = 0
    features_encoded = features_encoded[X.columns]
    
    # Scale features
    features_scaled = scaler.transform(features_encoded)
    
    # Predict
    prediction = best_model.predict(features_scaled)[0]
    
    return prediction

# Example predictions
examples = [
    {
        'name': 'Affordable Suburban House',
        'med_inc': 3.5, 'house_age': 25, 'ave_rooms': 5.5, 
        'ave_bedrms': 1.0, 'population': 1200, 'ave_occup': 3.0,
        'latitude': 34.0, 'longitude': -118.0
    },
    {
        'name': 'Luxury Bay Area House',
        'med_inc': 8.5, 'house_age': 10, 'ave_rooms': 7.0, 
        'ave_bedrms': 1.2, 'population': 800, 'ave_occup': 2.5,
        'latitude': 37.8, 'longitude': -122.4
    },
    {
        'name': 'Rural House',
        'med_inc': 2.8, 'house_age': 35, 'ave_rooms': 4.5, 
        'ave_bedrms': 1.1, 'population': 500, 'ave_occup': 2.8,
        'latitude': 39.5, 'longitude': -121.5
    }
]

print("\n" + "="*70)
print("HOUSE PRICE PREDICTIONS")
print("="*70)

for example in examples:
    name = example.pop('name')
    price = predict_house_price(**example)
    print(f"\n{name}:")
    print(f"  Predicted Price: ${price*100:.2f}k (${price*100000:,.0f})")

Output:

======================================================================
HOUSE PRICE PREDICTIONS
======================================================================

Affordable Suburban House:
  Predicted Price: $228.45k ($228,450)

Luxury Bay Area House:
  Predicted Price: $487.32k ($487,320)

Rural House:
  Predicted Price: $156.78k ($156,780)

💾 Step 9: Save the Model

import joblib

# Save model and scaler
joblib.dump(best_model, 'house_price_model.pkl')
joblib.dump(scaler, 'feature_scaler.pkl')
print("Model saved successfully!")

# To load later:
# loaded_model = joblib.load('house_price_model.pkl')
# loaded_scaler = joblib.load('feature_scaler.pkl')

🎓 Key Takeaways

Feature Engineering: Creating new features significantly improved model performance
Model Selection: Random Forest outperformed linear models (R² = 0.81 vs 0.60)
Feature Importance: MedInc, Location (Lat/Long), and HouseAge were most important
Evaluation: Used multiple metrics (R², RMSE, MAE) and cross-validation
Visualization: Plots helped identify patterns and validate model performance

🚀 Next Steps

Try advanced models (XGBoost, Neural Networks)
Perform hyperparameter tuning with GridSearchCV
Handle outliers more carefully
Create a web interface with Flask or Streamlit
Deploy model to cloud (AWS, Azure, GCP)