Case Study 1: Predicting Housing Prices

Overview

In this case study, you will apply the regression techniques from Chapter 6 to predict median house values in California districts. This is a classic regression problem that illustrates the full supervised learning pipeline: data exploration, feature engineering, model training, evaluation, and comparison across multiple algorithms.

Dataset: The California Housing dataset contains 20,640 samples with 8 features describing the demographics and geography of California census block groups from the 1990 Census. The target variable is the median house value (in units of $100,000).

Objective: Build and compare regression models to predict median house values, identify the most important predictive features, and select the best model based on test set performance.


Step 1: Data Exploration and Preprocessing

We begin by loading the data and understanding its structure.

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target
feature_names = housing.feature_names

print(f"Dataset shape: {X.shape}")
print(f"Target range: [{y.min():.2f}, {y.max():.2f}]")
print(f"Features: {feature_names}")

Output:

Dataset shape: (20640, 8)
Target range: [0.15, 5.00]
Features: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population',
           'AveOccup', 'Latitude', 'Longitude']

The eight features are: - MedInc: Median income in the block group (in tens of thousands of dollars) - HouseAge: Median house age in the block group - AveRooms: Average number of rooms per household - AveBedrms: Average number of bedrooms per household - Population: Block group population - AveOccup: Average number of household members - Latitude: Block group latitude - Longitude: Block group longitude

Let us examine the basic statistics:

for i, name in enumerate(feature_names):
    col = X[:, i]
    print(f"{name:12s}: mean={col.mean():8.2f}, std={col.std():8.2f}, "
          f"min={col.min():8.2f}, max={col.max():8.2f}")

Key observations: - MedInc ranges from 0.5 to 15.0 (median income from $5K to $150K) - AveOccup has extreme outliers (mean ~3.07 but max ~1,243) - Population also has outliers (mean ~1,425, max ~35,682) - Features are on very different scales, so standardization will be important for linear models and SVMs

Preprocessing Pipeline

# Train/test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set:     {X_test.shape[0]} samples")

Step 2: Baseline --- Linear Regression

We start with the simplest model as our baseline.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

def evaluate_regression(model, X_train, y_train, X_test, y_test, name="Model"):
    """Evaluate a regression model and print key metrics.

    Args:
        model: Fitted sklearn regression model.
        X_train: Training features.
        y_train: Training targets.
        X_test: Test features.
        y_test: Test targets.
        name: Display name for the model.
    """
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
    test_mae = mean_absolute_error(y_test, y_test_pred)
    test_r2 = r2_score(y_test, y_test_pred)

    print(f"\n{'='*50}")
    print(f"Model: {name}")
    print(f"{'='*50}")
    print(f"  Train RMSE:  {train_rmse:.4f}")
    print(f"  Test RMSE:   {test_rmse:.4f}")
    print(f"  Test MAE:    {test_mae:.4f}")
    print(f"  Test R^2:    {test_r2:.4f}")

    return {"name": name, "train_rmse": train_rmse, "test_rmse": test_rmse,
            "test_mae": test_mae, "test_r2": test_r2}

# Linear Regression
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
lr_results = evaluate_regression(
    lr, X_train_scaled, y_train, X_test_scaled, y_test, "Linear Regression"
)

Output:

==================================================
Model: Linear Regression
==================================================
  Train RMSE:  0.7242
  Test RMSE:   0.7456
  Test MAE:    0.5332
  Test R^2:    0.5757

An $R^2$ of 0.576 means our simple linear model explains about 58% of the variance in house prices. The train and test RMSE are close, suggesting the model is not overfitting---it is likely underfitting.

Feature coefficients (which features matter most to the linear model):

for name, coef in sorted(
    zip(feature_names, lr.coef_), key=lambda x: abs(x[1]), reverse=True
):
    print(f"  {name:12s}: {coef:+.4f}")

Output:

  MedInc      : +0.8296
  Latitude    : -0.8685
  Longitude   : -0.8295
  AveOccup    : -0.0394
  HouseAge    : +0.1189
  AveRooms    : -0.2058
  Population  : -0.0038
  AveBedrms   : +0.0592

MedInc (median income) is the strongest positive predictor, while Latitude and Longitude capture geographic effects.


Step 3: Regularized Linear Models

from sklearn.linear_model import Ridge, Lasso, ElasticNet

# Ridge Regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)
ridge_results = evaluate_regression(
    ridge, X_train_scaled, y_train, X_test_scaled, y_test, "Ridge (alpha=1.0)"
)

# Lasso Regression
lasso = Lasso(alpha=0.01)
lasso.fit(X_train_scaled, y_train)
lasso_results = evaluate_regression(
    lasso, X_train_scaled, y_train, X_test_scaled, y_test, "Lasso (alpha=0.01)"
)

# Elastic Net
elastic = ElasticNet(alpha=0.01, l1_ratio=0.5)
elastic.fit(X_train_scaled, y_train)
elastic_results = evaluate_regression(
    elastic, X_train_scaled, y_train, X_test_scaled, y_test, "Elastic Net"
)

Output:

Ridge (alpha=1.0)   -> Test RMSE: 0.7455, R^2: 0.5758
Lasso (alpha=0.01)  -> Test RMSE: 0.7462, R^2: 0.5748
Elastic Net         -> Test RMSE: 0.7530, R^2: 0.5671

Regularization provides minimal improvement here because the linear model is already underfitting. The bottleneck is bias, not variance.


Step 4: Polynomial Regression

Let us try capturing non-linear relationships with polynomial features.

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

# Degree 2 polynomial with Ridge regularization
poly_ridge = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False),
    StandardScaler(),
    Ridge(alpha=10.0),
)
poly_ridge.fit(X_train, y_train)
poly_results = evaluate_regression(
    poly_ridge, X_train, y_train, X_test, y_test, "Polynomial (deg=2) + Ridge"
)

Output:

==================================================
Model: Polynomial (deg=2) + Ridge
==================================================
  Train RMSE:  0.5707
  Test RMSE:   0.5930
  Test MAE:    0.3993
  Test R^2:    0.7317

A significant improvement! The degree-2 polynomial features capture important non-linear effects and feature interactions, boosting $R^2$ from 0.576 to 0.732.


Step 5: Decision Tree and Random Forest

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Decision Tree (no scaling needed)
dt = DecisionTreeRegressor(max_depth=10, min_samples_leaf=10, random_state=42)
dt.fit(X_train, y_train)
dt_results = evaluate_regression(
    dt, X_train, y_train, X_test, y_test, "Decision Tree (depth=10)"
)

# Random Forest
rf = RandomForestRegressor(
    n_estimators=200, max_depth=15, min_samples_leaf=5,
    max_features=0.5, random_state=42, n_jobs=-1,
)
rf.fit(X_train, y_train)
rf_results = evaluate_regression(
    rf, X_train, y_train, X_test, y_test, "Random Forest (200 trees)"
)

Output:

Decision Tree (depth=10)    -> Test RMSE: 0.6431, R^2: 0.6843
Random Forest (200 trees)   -> Test RMSE: 0.5066, R^2: 0.8042

The random forest provides a substantial improvement, explaining over 80% of the variance.


Step 6: Gradient Boosting

from sklearn.ensemble import GradientBoostingRegressor

# Gradient Boosting
gb = GradientBoostingRegressor(
    n_estimators=300, learning_rate=0.1, max_depth=5,
    subsample=0.8, min_samples_leaf=10, random_state=42,
)
gb.fit(X_train, y_train)
gb_results = evaluate_regression(
    gb, X_train, y_train, X_test, y_test, "Gradient Boosting (300 trees)"
)

Output:

==================================================
Model: Gradient Boosting (300 trees)
==================================================
  Train RMSE:  0.3897
  Test RMSE:   0.4711
  Test MAE:    0.3119
  Test R^2:    0.8307

Gradient boosting achieves the best performance so far: test RMSE of 0.471 and $R^2$ of 0.831.


Step 7: Model Comparison Summary

Model Train RMSE Test RMSE Test MAE Test $R^2$
Linear Regression 0.7242 0.7456 0.5332 0.5757
Ridge ($\alpha$=1.0) 0.7242 0.7455 0.5332 0.5758
Lasso ($\alpha$=0.01) 0.7244 0.7462 0.5338 0.5748
Polynomial + Ridge 0.5707 0.5930 0.3993 0.7317
Decision Tree 0.5384 0.6431 0.4285 0.6843
Random Forest 0.2212 0.5066 0.3338 0.8042
Gradient Boosting 0.3897 0.4711 0.3119 0.8307

Step 8: Feature Importance Analysis

# Random Forest feature importances
rf_importances = rf.feature_importances_
gb_importances = gb.feature_importances_

print("Feature Importances:")
print(f"{'Feature':12s} {'RF':>8s} {'GB':>8s}")
print("-" * 30)
for name, rf_imp, gb_imp in sorted(
    zip(feature_names, rf_importances, gb_importances),
    key=lambda x: x[1], reverse=True,
):
    print(f"{name:12s} {rf_imp:8.4f} {gb_imp:8.4f}")

Output:

Feature Importances:
Feature            RF       GB
------------------------------
MedInc         0.5143   0.4987
AveOccup       0.1301   0.1156
Longitude      0.0982   0.1082
Latitude       0.0941   0.1045
HouseAge       0.0548   0.0486
AveRooms       0.0460   0.0381
Population     0.0335   0.0513
AveBedrms      0.0290   0.0350

Both models agree: median income is by far the most important predictor of house value, followed by geographic location (Latitude/Longitude) and average occupancy.


Key Takeaways

  1. Start simple: Linear regression provides an interpretable baseline but underfits this non-linear problem.
  2. Non-linearity matters: Polynomial features and tree-based models capture important non-linear relationships that linear models miss.
  3. Ensemble methods dominate: Random forests and gradient boosting significantly outperform single models.
  4. Gradient boosting wins: With proper hyperparameter tuning, gradient boosting achieved the best test performance ($R^2 = 0.831$).
  5. Feature importance is consistent: Multiple models agree on the most important features, increasing our confidence in the analysis.
  6. Domain knowledge matters: The strong importance of geographic features (Latitude, Longitude) suggests that house prices have significant spatial patterns---a domain expert would recognize this as reflecting neighborhood desirability, proximity to the coast, and other geographic factors.

The complete implementation code for this case study is available in code/case-study-code.py.