Case Study 1: Predicting House Prices — The Classic Regression Problem

Contributors to Introduction to Data Science

Case Study 1: Predicting House Prices — The Classic Regression Problem

Tier 1 — Verified Concepts: This case study uses housing prices as a canonical regression example. The idea that square footage, location, and property features predict price is well-established in real estate economics and appraisal literature. The specific data is simulated for pedagogical purposes, but the patterns (the positive relationship between size and price, the diminishing marginal value of additional square footage, and the importance of location) are well-documented in housing market research.

The Most Famous Dataset in Machine Learning

If linear regression has a "Hello, World" problem, it's predicting house prices. Nearly every machine learning textbook, every online course, and every Kaggle competition for beginners has used housing data at some point. There's a reason: houses have features everyone understands (bedrooms, square footage, location), the target is a clear number (sale price), and the patterns are intuitive enough that you can check whether the model's answers make sense.

This case study walks through a complete regression analysis of housing data, from initial exploration to final model evaluation. Along the way, you'll see every concept from Chapter 26 in action.

The Data

We'll work with a simulated dataset inspired by real housing markets:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score

np.random.seed(42)
n = 500

houses = pd.DataFrame({
    'sqft': np.random.normal(1800, 500, n).clip(600, 5000),
    'bedrooms': np.random.choice([1, 2, 3, 4, 5], n,
                p=[0.05, 0.2, 0.4, 0.25, 0.1]),
    'bathrooms': np.random.choice([1, 1.5, 2, 2.5, 3], n,
                  p=[0.1, 0.2, 0.35, 0.25, 0.1]),
    'age_years': np.random.uniform(0, 80, n),
    'neighborhood': np.random.choice(
        ['Downtown', 'Suburbs', 'Rural'], n,
        p=[0.3, 0.5, 0.2])
})

# Price based on features (with realistic relationships)
nbhd_premium = houses['neighborhood'].map({
    'Downtown': 50000, 'Suburbs': 20000, 'Rural': -10000
})

houses['price'] = (
    50000 +
    150 * houses['sqft'] +
    15000 * houses['bedrooms'] +
    25000 * houses['bathrooms'] -
    800 * houses['age_years'] +
    nbhd_premium +
    np.random.normal(0, 30000, n)
).clip(50000, None).astype(int)

print(houses.describe().round(0))
print(f"\nNeighborhood counts:")
print(houses['neighborhood'].value_counts())

Step 1: Explore Before You Model

Before fitting any model, look at the data. How are prices distributed? Which features seem related to price?

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

axes[0, 0].scatter(houses['sqft'], houses['price'] / 1000,
                   alpha=0.3, s=10)
axes[0, 0].set_xlabel('Square Footage')
axes[0, 0].set_ylabel('Price ($K)')
axes[0, 0].set_title('Price vs. Square Footage')

axes[0, 1].scatter(houses['age_years'], houses['price'] / 1000,
                   alpha=0.3, s=10)
axes[0, 1].set_xlabel('Age (years)')
axes[0, 1].set_ylabel('Price ($K)')
axes[0, 1].set_title('Price vs. Age')

houses.boxplot(column='price', by='bedrooms', ax=axes[1, 0])
axes[1, 0].set_ylabel('Price ($)')
axes[1, 0].set_title('Price by Bedrooms')

houses.boxplot(column='price', by='neighborhood', ax=axes[1, 1])
axes[1, 1].set_ylabel('Price ($)')
axes[1, 1].set_title('Price by Neighborhood')

plt.tight_layout()
plt.show()

The scatter plots reveal the relationships: price increases with square footage (positive linear trend), price decreases with age (negative trend), more bedrooms means higher price, and downtown properties cost more.

Step 2: Start Simple

Always start with the simplest possible model. One feature, one line.

# Simple regression: price from square footage only
X_simple = houses[['sqft']]
y = houses['price']

X_train, X_test, y_train, y_test = train_test_split(
    X_simple, y, test_size=0.2, random_state=42
)

# Baseline
baseline_pred = y_train.mean()
baseline_mae = mean_absolute_error(y_test,
    [baseline_pred] * len(y_test))

# Simple model
simple_model = LinearRegression()
simple_model.fit(X_train, y_train)
simple_pred = simple_model.predict(X_test)

print("=== Simple Model (sqft only) ===")
print(f"Slope: ${simple_model.coef_[0]:,.0f} per sqft")
print(f"Intercept: ${simple_model.intercept_:,.0f}")
print(f"Training R²: {simple_model.score(X_train, y_train):.3f}")
print(f"Test R²:     {simple_model.score(X_test, y_test):.3f}")
print(f"Test MAE:    ${mean_absolute_error(y_test, simple_pred):,.0f}")
print(f"Baseline MAE: ${baseline_mae:,.0f}")

The simple model tells us: each additional square foot adds about $150 to the price (the slope), and the model explains roughly 50-60% of the variance in prices (R²). That's not bad for one feature. But there's clearly more to the story.

Step 3: Add Features

Now add bedrooms, bathrooms, and age:

# Multiple regression: numerical features
X_multi = houses[['sqft', 'bedrooms', 'bathrooms', 'age_years']]

X_train_m, X_test_m, y_train, y_test = train_test_split(
    X_multi, y, test_size=0.2, random_state=42
)

multi_model = LinearRegression()
multi_model.fit(X_train_m, y_train)
multi_pred = multi_model.predict(X_test_m)

print("=== Multiple Model (4 features) ===")
print(f"Training R²: {multi_model.score(X_train_m, y_train):.3f}")
print(f"Test R²:     {multi_model.score(X_test_m, y_test):.3f}")
print(f"Test MAE:    ${mean_absolute_error(y_test, multi_pred):,.0f}")
print(f"Baseline MAE: ${baseline_mae:,.0f}")

print("\nCoefficients:")
for feat, coef in zip(X_multi.columns, multi_model.coef_):
    print(f"  {feat}: ${coef:,.0f}")

With four features, the model improves. R² increases, and MAE decreases. Each coefficient tells a story:

sqft: Each additional square foot adds about $150 (similar to the simple model, because sqft carries the most information)
bedrooms: Each additional bedroom adds about $15,000, holding size constant
bathrooms: Each additional bathroom adds about $25,000
age_years: Each year of age reduces price by about $800 (depreciation)

Step 4: Handle Categorical Features

What about neighborhood? It's categorical — Downtown, Suburbs, Rural — not numerical. You can't directly multiply a word by a coefficient.

The solution is one-hot encoding: converting each category into a binary (0/1) column.

# One-hot encode neighborhood
houses_encoded = pd.get_dummies(houses, columns=['neighborhood'],
                                drop_first=True)

X_full = houses_encoded[['sqft', 'bedrooms', 'bathrooms',
    'age_years', 'neighborhood_Rural', 'neighborhood_Suburbs']]

X_train_f, X_test_f, y_train, y_test = train_test_split(
    X_full, y, test_size=0.2, random_state=42
)

full_model = LinearRegression()
full_model.fit(X_train_f, y_train)
full_pred = full_model.predict(X_test_f)

print("=== Full Model (with neighborhood) ===")
print(f"Training R²: {full_model.score(X_train_f, y_train):.3f}")
print(f"Test R²:     {full_model.score(X_test_f, y_test):.3f}")
print(f"Test MAE:    ${mean_absolute_error(y_test, full_pred):,.0f}")

print("\nCoefficients:")
for feat, coef in zip(X_full.columns, full_model.coef_):
    print(f"  {feat}: ${coef:,.0f}")

The neighborhood coefficients are interpreted relative to the omitted category (Downtown, since we used drop_first=True):

neighborhood_Suburbs: Suburbs are about $30,000 less than Downtown, all else equal
neighborhood_Rural: Rural is about $60,000 less than Downtown, all else equal

Adding neighborhood further improves the model because location genuinely matters for house prices.

Step 5: Check the Model's Work

Let's create the key diagnostic visualizations:

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Predicted vs. Actual
axes[0].scatter(y_test, full_pred, alpha=0.3, s=15)
lims = [y.min(), y.max()]
axes[0].plot(lims, lims, 'r--', linewidth=2)
axes[0].set_xlabel('Actual Price')
axes[0].set_ylabel('Predicted Price')
axes[0].set_title('Predicted vs. Actual')

# Residual plot
residuals = y_test - full_pred
axes[1].scatter(full_pred, residuals, alpha=0.3, s=15)
axes[1].axhline(0, color='red', linestyle='--')
axes[1].set_xlabel('Predicted Price')
axes[1].set_ylabel('Residual')
axes[1].set_title('Residual Plot')

plt.tight_layout()
plt.show()

A good predicted-vs-actual plot shows points clustered around the diagonal line. The residual plot should show random scatter around zero with no obvious pattern.

The Results: What Did We Learn?

Here's the progression:

Model	Features	Test R²	Test MAE
Baseline	None (predict mean)	0.00	~$90K
Simple	sqft only	~0.55	~$55K
Multiple	sqft + beds + baths + age	~0.70	~$42K
Full	All + neighborhood	~0.75	~$38K

Each addition improved the model, and none showed serious overfitting (training and test R² stayed close). The full model reduces prediction error by more than half compared to the baseline.

What the Model Tells Us (and What It Doesn't)

The model tells us that square footage is the strongest predictor of price, followed by location, bathrooms, bedrooms, and age. These make intuitive sense.

The model does not tell us: - That adding a bedroom will increase your home's value by $15,000 (the model describes an association, not a causal effect — maybe homes with more bedrooms are just bigger) - What the price will be in a neighborhood not in the data - Whether the linear assumption holds for mansions above 5,000 sqft (we have little data there) - Anything about factors not in the model (school district quality, view, renovation status)

The Appraiser's Perspective

Professional real estate appraisers essentially do a version of what we just did — they consider comparable sales, adjust for features, and arrive at an estimated value. Our model automates this process. An appraiser might note:

"The model gets most of the story right. Size and location dominate. But it misses the intangibles — the charm of a well-maintained historic home, the value of a corner lot with a view, the penalty for being next to a highway. Those factors are real, and they explain the 25% of variance the model doesn't capture."

This is the "all models are wrong, but some are useful" principle in action. The model captures the big patterns. The residuals — the errors — represent everything else.

Discussion Questions

The model assigns a negative coefficient to house age, suggesting older houses are worth less. But in some markets, very old houses are more valuable (historic charm). How would you modify the model to capture this nonlinear relationship?
Square footage and number of bedrooms are correlated. How might this affect the individual coefficients? Would you expect the sqft coefficient to change if you removed bedrooms from the model?
The model was trained on data from one city. Would you trust it to predict prices in a different city? Why or why not?
A homeowner asks you: "If I add a bathroom, will my house be worth $25,000 more?" How would you respond, using what you know about regression coefficients and causation?

Key Takeaways from This Case Study

Start simple (one feature), then add complexity incrementally
Each addition should be justified by improved test performance, not just training performance
Categorical variables need encoding (one-hot) before regression can use them
Coefficient interpretation requires the "holding other variables constant" qualifier
Diagnostic plots (predicted vs. actual, residuals) reveal model problems that R² alone doesn't capture
The gap between what the model explains and what it doesn't (the residuals) represents real factors that aren't in the data