Chapter 26 Exercises: Linear Regression — Your First Predictive Model

How to use these exercises: Part A focuses on understanding the intuition and interpretation of linear regression without code. Part B applies regression concepts to real-world scenarios. Part C is hands-on coding with scikit-learn. Part D challenges you to think critically about when linear regression works, when it doesn't, and what the results actually mean.

Difficulty key: ⭐ Foundational | ⭐⭐ Intermediate | ⭐⭐⭐ Advanced | ⭐⭐⭐⭐ Extension


Part A: Conceptual Understanding


Exercise 26.1Interpreting slope and intercept

A linear regression model predicts apartment rent (in dollars per month) from apartment size (in square feet):

Rent = 450 + 1.25 * SquareFeet
  1. What is the slope? Interpret it in plain English.
  2. What is the intercept? Interpret it. Is this interpretation realistic?
  3. What rent does the model predict for a 900 sq ft apartment?
  4. What rent does the model predict for a 1,500 sq ft apartment?
Guidance 1. Slope = 1.25. For each additional square foot of apartment size, the predicted monthly rent increases by $1.25. 2. Intercept = 450. The model predicts $450/month for a 0 sq ft apartment. This isn't realistic — no apartment has 0 sq ft. The intercept is a mathematical anchor point, not a meaningful prediction in this context. 3. 450 + 1.25 * 900 = $1,575/month. 4. 450 + 1.25 * 1500 = $2,325/month.

Exercise 26.2Understanding residuals

Using the model from Exercise 26.1, compute the residual for each apartment:

Apartment Size (sq ft) Actual Rent Predicted Rent Residual
A 750 $1,400 ? ?
B 1,100 $1,800 ? ?
C 600 $1,350 ? ?
D 1,400 $2,100 ? ?
  1. Fill in the predicted rent and residual columns.
  2. Which apartment's rent is most underestimated by the model?
  3. Which apartment's rent is most overestimated?
  4. Do the residuals sum to approximately zero? Why might that be expected?
Guidance | Apartment | Size | Actual | Predicted | Residual | |---|---|---|---|---| | A | 750 | $1,400 | $1,387.50 | +$12.50 | | B | 1,100 | $1,800 | $1,825.00 | -$25.00 | | C | 600 | $1,350 | $1,200.00 | +$150.00 | | D | 1,400 | $2,100 | $2,200.00 | -$100.00 | 1. See table above (Predicted = 450 + 1.25 * Size; Residual = Actual - Predicted). 2. Apartment C is most underestimated (residual +$150 — actual rent is much higher than predicted). 3. Apartment D is most overestimated (residual -$100 — actual rent is lower than predicted). 4. Sum = 12.50 - 25 + 150 - 100 = +37.50. Not exactly zero because these are only 4 of many points. For the full dataset, residuals from least squares regression sum to exactly zero.

Exercise 26.3Interpreting R-squared

For each scenario, interpret the R² value in plain English:

  1. A model predicting house prices from square footage: R² = 0.72
  2. A model predicting daily stock returns from yesterday's returns: R² = 0.02
  3. A model predicting student test scores from hours studied: R² = 0.45
  4. A model predicting a person's height from their arm span: R² = 0.95
Guidance 1. 72% of the variation in house prices is explained by square footage. This is a strong relationship — size explains most but not all of the price differences (location, condition, and other factors explain the remaining 28%). 2. Only 2% of the variation in today's stock returns is explained by yesterday's returns. Almost no predictive power — stock returns are nearly unpredictable from past returns, consistent with the efficient market hypothesis. 3. 45% of variation in test scores is explained by study hours. A moderate relationship — studying helps, but other factors (prior knowledge, test-taking ability, sleep quality) explain more than half the variation. 4. 95% of variation in height is explained by arm span. A very strong relationship — arm span and height are closely linked biologically.

Exercise 26.4Least squares intuition

Explain in your own words why linear regression minimizes the sum of squared residuals rather than the sum of absolute residuals. What practical difference does squaring make?

Guidance Squaring has two key effects: (1) It ensures all residuals are positive, preventing positive and negative residuals from canceling each other out. (2) It penalizes large errors disproportionately — a residual of 10 contributes 100 to the sum, while a residual of 5 contributes only 25. This means the line is especially pulled toward reducing big errors. Practically, this makes the model sensitive to outliers — a single data point far from the line can pull it significantly. The alternative (minimizing absolute residuals) is more robust to outliers but doesn't have as clean a mathematical solution.

Exercise 26.5Multiple regression coefficients ⭐⭐

A multiple regression model predicts a car's fuel efficiency (MPG) from three features:

MPG = 45.0 - 0.005 * Weight_lbs - 1.2 * Engine_liters + 3.5 * Hybrid_flag

Interpret each coefficient: 1. What does -0.005 mean for Weight_lbs? 2. What does -1.2 mean for Engine_liters? 3. What does 3.5 mean for Hybrid_flag (where 1 = hybrid, 0 = not)? 4. Predict MPG for a 3,000 lb car with a 2.0L engine that is not a hybrid. 5. Predict MPG for the same car if it were a hybrid.

Guidance 1. For each additional pound of weight, MPG decreases by 0.005, holding engine size and hybrid status constant. (Or: each additional 200 lbs reduces MPG by 1.) 2. For each additional liter of engine displacement, MPG decreases by 1.2, holding weight and hybrid status constant. 3. Hybrid vehicles get an additional 3.5 MPG compared to non-hybrid vehicles, holding weight and engine size constant. 4. MPG = 45.0 - 0.005(3000) - 1.2(2.0) + 3.5(0) = 45.0 - 15.0 - 2.4 + 0 = 27.6 MPG. 5. MPG = 45.0 - 0.005(3000) - 1.2(2.0) + 3.5(1) = 45.0 - 15.0 - 2.4 + 3.5 = 31.1 MPG.

Exercise 26.6When linear regression fails ⭐⭐

For each scenario, predict whether linear regression will work well or poorly, and explain why:

  1. Predicting temperature from altitude (temperature generally decreases linearly with altitude)
  2. Predicting population growth from time (growth is exponential)
  3. Predicting crop yield from rainfall (too little rain is bad, too much rain is bad)
  4. Predicting a person's weight from their height (roughly linear in adults)
Guidance 1. **Works well.** Temperature decreases approximately linearly with altitude (about 6.5 degrees C per 1000m). Linear regression captures this well. 2. **Works poorly.** Exponential growth is nonlinear — a straight line can't capture the accelerating curve. Try log-transforming population first, or use a nonlinear model. 3. **Works poorly.** The relationship is U-shaped (quadratic) — both extremes are bad. Linear regression, which assumes a monotonic relationship, can't capture this. Adding a rainfall-squared term could help. 4. **Works reasonably well.** The relationship is roughly linear in adults, though with significant scatter. R² will be moderate (maybe 0.3-0.5).

Exercise 26.7Multicollinearity ⭐⭐

You're building a model to predict house prices using these features: total square footage, number of bedrooms, number of bathrooms, and lot size.

  1. Which pairs of features are likely to be highly correlated? Why?
  2. How would multicollinearity affect your coefficient for "number of bedrooms"?
  3. Would multicollinearity affect the model's overall prediction accuracy?
  4. If your goal is prediction, should you worry about multicollinearity? What if your goal is explanation?
Guidance 1. Square footage and number of bedrooms are likely correlated (bigger houses have more bedrooms). Square footage and bathrooms similarly. These features carry overlapping information. 2. The coefficient for bedrooms becomes unstable and hard to interpret. It might be positive, negative, or near zero depending on the specific sample — not because bedrooms don't matter, but because the model can't separate the effect of bedrooms from the effect of square footage. 3. Not significantly. Multicollinearity affects individual coefficients but usually not overall prediction quality. The model distributes the effect across correlated features differently, but the combined prediction stays similar. 4. For prediction: generally don't worry about it. For explanation: it's a serious concern because individual coefficients become unreliable. You might remove one of the correlated features or use techniques designed to handle multicollinearity.

Part B: Applied Scenarios ⭐⭐


Exercise 26.8Choosing features for a regression model

You want to predict a restaurant's monthly revenue. You have these potential features:

  1. Number of seats
  2. Average Yelp rating
  3. Neighborhood median income
  4. Restaurant's Instagram follower count
  5. Number of menu items
  6. Distance to nearest subway station
  7. The owner's zodiac sign
  8. Average temperature that month

For each feature, state whether you'd include it in your model and why. Classify each as: "definitely include," "maybe include," or "definitely exclude."

Guidance 1. **Definitely include.** Capacity directly limits revenue potential. 2. **Definitely include.** Ratings influence customer choice and are likely correlated with quality/popularity. 3. **Maybe include.** Higher-income neighborhoods might support higher prices, but the relationship might be nonlinear and confounded by other factors. 4. **Maybe include.** Social media presence may correlate with foot traffic, but could also be a proxy for restaurant type/age. 5. **Maybe include.** Very few items might limit appeal, but more items don't necessarily mean more revenue. 6. **Maybe include.** Accessibility matters, but the relationship might be nonlinear (too close = noisy, too far = inaccessible). 7. **Definitely exclude.** No plausible causal mechanism. Including it would add noise. 8. **Maybe include.** Weather affects dining out, but this varies by restaurant type (outdoor seating vs. not).

Exercise 26.9Reading a regression report ⭐⭐

A data science team presents the following regression results:

Target: Employee annual salary ($)
Features: years_experience, education_level (1-5), department_code

Training R²: 0.89
Test R²: 0.42

Coefficients:
  years_experience: $2,150
  education_level: $8,300
  department_code: $1,050
  intercept: $28,000
  1. What does the years_experience coefficient mean?
  2. Is this model overfitting? How can you tell?
  3. What's problematic about using department_code as a numerical feature?
  4. What would you recommend the team do before trusting these results?
Guidance 1. Each additional year of experience is associated with $2,150 higher salary, holding education level and department constant. 2. **Yes, severe overfitting.** Training R² (0.89) is much higher than test R² (0.42) — a gap of 0.47. The model has memorized training data rather than learning generalizable patterns. 3. Department codes are categorical, not ordinal — treating them as numbers implies department 4 is "more" than department 2, which is meaningless. Department codes should be one-hot encoded (converted to binary columns). 4. Fix the department encoding. Check the number of features vs. observations. Try simpler models. Consider whether 3 features can reasonably explain salary variation or whether the model needs more data or different features.

Exercise 26.10Baseline comparison ⭐⭐

Your linear regression model predicts daily ice cream sales with the following results:

  • Baseline MAE (predict the mean): $245
  • Model MAE: $210
  • R²: 0.28

Your manager says: "R² of 0.28 is terrible — the model only explains 28% of the variance. We should scrap it."

Write a response defending or challenging this assessment. Consider: Is R² the right metric to focus on? Is the improvement over baseline meaningful? What domain context matters?

Guidance The manager's concern is understandable but potentially misguided. An R² of 0.28 means 28% of variance is explained — modest but not necessarily bad. Key counterpoints: (1) The model reduces MAE from $245 to $210 — a 14% improvement over the baseline. In a business context, consistently reducing forecast error by 14% could save significant money in inventory management. (2) Ice cream sales are influenced by many unpredictable factors (random foot traffic, competing events, social media mentions). In noisy domains, R² values above 0.20 can be practically useful. (3) R² shouldn't be the only metric — the improvement in MAE directly relates to business value. (4) The right question isn't "is R² high?" but "does this model help us make better decisions than guessing the average?" The answer appears to be yes.

Exercise 26.11Feature engineering ⭐⭐

You're predicting house prices and have a feature "year_built" (e.g., 1965, 2003, 2020). Using year_built directly in linear regression has a problem: the model assumes a linear relationship between year and price, but very old houses might actually be worth more than moderately old houses (if they're "historic").

Suggest three ways to engineer better features from year_built:

Guidance 1. **Age:** Create `house_age = current_year - year_built`. This is more intuitive than the year itself and directly represents the concept of aging. 2. **Age categories:** Create bins like "new (0-10 years)," "moderate (10-30 years)," "old (30-60 years)," "historic (60+ years)" and one-hot encode them. This captures the nonlinear relationship. 3. **Decade:** Create a "decade_built" feature (1960s, 1970s, etc.) and one-hot encode. This allows the model to capture era-specific effects (e.g., houses from a particular decade might have asbestos or other issues that depress value).

Exercise 26.12Extrapolation danger ⭐⭐

Your model was trained on countries with GDP per capita between $1,000 and $65,000, and it predicts:

Vaccination Rate = 55 + 0.0004 * GDP_per_capita
  1. What does the model predict for a country with GDP per capita of $50,000? Is this reasonable?
  2. What does the model predict for GDP per capita of $200,000? Is this reasonable?
  3. What does the model predict for GDP per capita of $300? Is this reasonable?
  4. What general lesson does this illustrate about extrapolation?
Guidance 1. 55 + 0.0004 * 50,000 = 75%. This is within the training range and could be reasonable. 2. 55 + 0.0004 * 200,000 = 135%. This exceeds 100% — physically impossible. The model is extrapolating far beyond its training data and producing nonsensical results. 3. 55 + 0.0004 * 300 = 55.12%. This might be too high for an extremely poor country. The linear assumption may not hold at the extremes. 4. Linear regression assumes the linear relationship holds everywhere, including outside the range of training data. This is often false — relationships may curve, plateau, or reverse at extreme values. Never trust predictions that extrapolate far beyond the training data range.

Part C: Coding Exercises ⭐⭐


Exercise 26.13Simple linear regression from scratch

Create a scatter plot showing the relationship between study hours and exam scores, then fit a linear regression model:

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

np.random.seed(42)
study_hours = np.random.uniform(1, 12, 50)
exam_scores = 40 + 4.5 * study_hours + np.random.normal(0, 8, 50)

Tasks: 1. Create a scatter plot of the data 2. Fit a LinearRegression model 3. Add the regression line to the scatter plot 4. Print and interpret the slope and intercept 5. What exam score does the model predict for 8 hours of studying?

Guidance
import matplotlib.pyplot as plt

X = study_hours.reshape(-1, 1)
y = exam_scores

model = LinearRegression().fit(X, y)

plt.scatter(study_hours, exam_scores, alpha=0.6)
x_line = np.linspace(1, 12, 100).reshape(-1, 1)
plt.plot(x_line, model.predict(x_line), color='coral', lw=2)
plt.xlabel('Study Hours')
plt.ylabel('Exam Score')
plt.show()

print(f"Slope: {model.coef_[0]:.2f}")
print(f"Intercept: {model.intercept_:.2f}")
print(f"Predicted score for 8 hours: "
      f"{model.predict([[8]])[0]:.1f}")

Exercise 26.14Full workflow with train-test split

Using the study hours data from Exercise 26.13, implement the complete modeling workflow:

  1. Split into 80/20 train/test sets
  2. Calculate the baseline MAE (mean prediction)
  3. Fit a LinearRegression on the training set
  4. Evaluate on the test set (R² and MAE)
  5. Compare training R² to test R² (check for overfitting)
  6. Compare model MAE to baseline MAE (check for usefulness)
Guidance
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

baseline_mae = mean_absolute_error(
    y_test, [y_train.mean()] * len(y_test))

model = LinearRegression().fit(X_train, y_train)
y_pred = model.predict(X_test)

print(f"Training R²:  {model.score(X_train, y_train):.3f}")
print(f"Test R²:      {model.score(X_test, y_test):.3f}")
print(f"Model MAE:    {mean_absolute_error(y_test, y_pred):.2f}")
print(f"Baseline MAE: {baseline_mae:.2f}")

Exercise 26.15Multiple regression comparison ⭐⭐

Extend the model from Exercise 26.14 by adding features. Generate additional features and compare simple vs. multiple regression:

np.random.seed(42)
n = 200
data = pd.DataFrame({
    'study_hours': np.random.uniform(1, 12, n),
    'sleep_hours': np.random.uniform(4, 10, n),
    'previous_score': np.random.uniform(50, 100, n),
})
data['exam_score'] = (
    20 + 4 * data['study_hours'] +
    3 * data['sleep_hours'] +
    0.4 * data['previous_score'] +
    np.random.normal(0, 6, n)
)

Tasks: 1. Fit a model using only study_hours. Report test R². 2. Fit a model using all three features. Report test R². 3. How much did adding features improve R²? 4. Print the coefficients. Which feature is the strongest predictor? 5. Are any features redundant?

Guidance
X1 = data[['study_hours']]
X3 = data[['study_hours', 'sleep_hours', 'previous_score']]
y = data['exam_score']

X1_tr, X1_te, y_tr, y_te = train_test_split(
    X1, y, test_size=0.2, random_state=42)
X3_tr, X3_te, _, _ = train_test_split(
    X3, y, test_size=0.2, random_state=42)

m1 = LinearRegression().fit(X1_tr, y_tr)
m3 = LinearRegression().fit(X3_tr, y_tr)

print(f"Simple R²: {m1.score(X1_te, y_te):.3f}")
print(f"Multiple R²: {m3.score(X3_te, y_te):.3f}")

for f, c in zip(X3.columns, m3.coef_):
    print(f"  {f}: {c:.3f}")

Exercise 26.16Residual analysis ⭐⭐

Fit a linear regression model and create diagnostic residual plots:

np.random.seed(42)
x = np.random.uniform(0, 10, 100)
y = 2 * x**2 + np.random.normal(0, 10, 100)

Tasks: 1. Fit a linear regression model to this data 2. Create a scatter plot with the regression line 3. Create a residual plot (residuals vs. predicted values) 4. What pattern do you see in the residuals? 5. What does this pattern tell you about the appropriateness of linear regression for this data?

Guidance
X = x.reshape(-1, 1)
model = LinearRegression().fit(X, y)
y_pred = model.predict(X)
residuals = y - y_pred

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].scatter(x, y, alpha=0.6)
axes[0].plot(sorted(x), model.predict(
    np.sort(x).reshape(-1,1)), 'r-', lw=2)
axes[0].set_title('Data with Linear Fit')

axes[1].scatter(y_pred, residuals, alpha=0.6)
axes[1].axhline(0, color='red', linestyle='--')
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Residual Plot')

plt.tight_layout()
plt.show()
The residuals show a clear U-shaped pattern — negative at the extremes, positive in the middle (or vice versa). This means the relationship is quadratic, not linear. Linear regression is inappropriate; adding an x² term would fix the problem.

Exercise 26.17Log transformation ⭐⭐

When the relationship between feature and target is logarithmic (rapid change at low values, diminishing returns at high values), transforming the feature can help:

np.random.seed(42)
gdp = np.random.lognormal(9, 1.2, 120)
vacc = 30 + 8 * np.log(gdp / 1000) + np.random.normal(0, 5, 120)
vacc = np.clip(vacc, 10, 100)

Tasks: 1. Fit linear regression using raw GDP. Report R². 2. Fit linear regression using log(GDP). Report R². 3. Which model fits better? Why? 4. Create scatter plots for both models to visualize the difference.

Guidance
X_raw = gdp.reshape(-1, 1)
X_log = np.log(gdp).reshape(-1, 1)

m_raw = LinearRegression().fit(X_raw, vacc)
m_log = LinearRegression().fit(X_log, vacc)

print(f"Raw GDP R²: {m_raw.score(X_raw, vacc):.3f}")
print(f"Log GDP R²: {m_log.score(X_log, vacc):.3f}")
The log-transformed model should have substantially higher R² because the true relationship is logarithmic. The log transformation "straightens" the curve, making it amenable to linear regression.

Exercise 26.18Feature scaling and coefficient comparison ⭐⭐

Using the multiple regression model from Exercise 26.15, standardize the features and compare coefficients:

Tasks: 1. Fit the model on unscaled data. Print coefficients. 2. Fit the model on StandardScaler-transformed data. Print coefficients. 3. Which feature has the largest standardized coefficient? 4. Why are standardized coefficients more useful for comparing feature importance? 5. Do the model's predictions change after scaling? Verify.

Guidance
from sklearn.preprocessing import StandardScaler

# Unscaled
m_raw = LinearRegression().fit(X3_tr, y_tr)
print("Unscaled coefficients:", dict(zip(X3.columns, m_raw.coef_.round(3))))

# Scaled
scaler = StandardScaler()
X3_tr_s = scaler.fit_transform(X3_tr)
X3_te_s = scaler.transform(X3_te)
m_scaled = LinearRegression().fit(X3_tr_s, y_tr)
print("Scaled coefficients:", dict(zip(X3.columns, m_scaled.coef_.round(3))))

# Compare predictions
pred_raw = m_raw.predict(X3_te)
pred_scaled = m_scaled.predict(X3_te_s)
print(f"Max prediction difference: "
      f"{np.abs(pred_raw - pred_scaled).max():.10f}")
Predictions should be identical (or differ by floating-point rounding only). Standardized coefficients are comparable because they measure the effect of a one-standard-deviation change in each feature, removing the arbitrary influence of measurement units.

Exercise 26.19Overfitting with too many features ⭐⭐⭐

Demonstrate overfitting by adding random noise features to a regression model:

np.random.seed(42)
n = 50
X_real = np.random.uniform(0, 10, (n, 2))
y = 3 * X_real[:, 0] + 2 * X_real[:, 1] + np.random.normal(0, 3, n)

Tasks: 1. Fit a model using only the 2 real features. Report train and test R². 2. Add 20 random noise features (np.random.normal columns). Fit again. Report train and test R². 3. Add 45 random noise features (almost as many as observations). Report train and test R². 4. Plot training R² and test R² as a function of number of features. 5. At what point does overfitting become severe? Why?

Guidance
results = []
for n_noise in [0, 5, 10, 20, 30, 45]:
    X_noise = np.random.normal(0, 1, (n, n_noise))
    X_all = np.hstack([X_real, X_noise]) if n_noise > 0 else X_real
    Xtr, Xte, ytr, yte = train_test_split(
        X_all, y, test_size=0.2, random_state=42)
    m = LinearRegression().fit(Xtr, ytr)
    results.append({
        'n_features': 2 + n_noise,
        'train_r2': m.score(Xtr, ytr),
        'test_r2': m.score(Xte, yte)
    })

res_df = pd.DataFrame(results)
print(res_df)
Training R² increases as you add features (even noise features — the model always finds *something* to fit). Test R² initially stays stable, then drops as noise features dominate. With 45 noise features and only 50 observations, the model can nearly perfectly fit the training data but performs terribly on the test set. Overfitting becomes severe when the number of features approaches the number of observations.

Exercise 26.20Predicted vs. actual plot ⭐⭐

Create a "predicted vs. actual" diagnostic plot for the vaccination rate model:

np.random.seed(42)
n = 150
X = pd.DataFrame({
    'gdp': np.random.lognormal(9, 1.2, n),
    'health': np.random.uniform(2, 12, n),
    'education': np.random.uniform(0.3, 0.95, n)
})
y = 30 + 5*np.log(X['gdp']/1000) + 2*X['health'] + 25*X['education'] + np.random.normal(0, 5, n)
y = y.clip(15, 100)

Tasks: 1. Split, train, and predict 2. Create a predicted vs. actual scatter plot 3. Add the diagonal "perfect prediction" line 4. Calculate and display R² on the plot 5. Are there any obvious patterns in the errors?

Guidance
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression().fit(Xtr, ytr)
ypred = model.predict(Xte)

plt.figure(figsize=(8, 8))
plt.scatter(yte, ypred, alpha=0.6)
lims = [min(yte.min(), ypred.min()), max(yte.max(), ypred.max())]
plt.plot(lims, lims, 'r--', lw=2)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title(f'Predicted vs Actual (R² = {r2_score(yte, ypred):.3f})')
plt.show()

Part D: Synthesis and Critical Thinking ⭐⭐⭐


Exercise 26.21The interpretation trap

A researcher builds a regression model predicting country life expectancy from GDP per capita. The coefficient is positive and significant. They write: "Increasing GDP per capita by $10,000 will increase life expectancy by 2.3 years."

Identify at least three problems with this conclusion. Connect your answer to concepts from both Chapter 24 (correlation/causation) and Chapter 26 (regression interpretation).

Guidance 1. **Causation claim from observational data.** The regression coefficient describes an association, not a causal effect. GDP might not directly cause longer lives — both might be driven by institutional quality, education, or other confounders. 2. **"Holding other variables constant" is violated.** If the model only includes GDP, the coefficient absorbs the effects of all correlated factors. The coefficient for GDP includes the effects of healthcare, education, nutrition, etc. 3. **Linearity assumed.** The model assumes each additional $10,000 has the same effect whether you're going from $5K to $15K or from $85K to $95K. In reality, the relationship likely has diminishing returns. 4. **Extrapolation.** The claim implies a general rule, but the model was trained on a specific range of GDP values. The relationship might not hold outside that range.

Exercise 26.22When to stop adding features ⭐⭐⭐

You're building a model to predict employee performance ratings. You have access to 200 features. A colleague suggests using all of them "to capture as much information as possible."

Write a paragraph explaining why this is likely a bad idea. Reference specific concepts from Chapters 25 and 26 (overfitting, bias-variance tradeoff, multicollinearity, interpretability).

Guidance Using all 200 features is problematic for several reasons. First, with many features relative to observations, the model will almost certainly overfit — it will find patterns in the training data that are just noise, producing high training R² but low test R² (Chapter 25 bias-variance tradeoff). Second, many features will be correlated with each other (multicollinearity), making individual coefficients unreliable and uninterpretable. Third, features like "carpet color in office" or "birth month" have no plausible relationship with performance — including them adds noise without signal. Fourth, an interpretable model with 5-10 well-chosen features is far more useful for decision-making than a 200-feature black box. The recommended approach: start with domain knowledge to select the most plausible features, build a simple model, then add features incrementally only if they improve test performance.

Exercise 26.23Model comparison report ⭐⭐⭐

Build three models of increasing complexity and write a short report comparing them:

  1. Model A: Simple regression with 1 feature
  2. Model B: Multiple regression with 3-4 features
  3. Model C: Multiple regression with 3-4 features plus polynomial terms (degree 2)

For each, report training R², test R², MAE, and whether the model is overfitting, underfitting, or well-fit. Which model would you deploy and why?

Guidance Use any dataset from the exercises above. The report should show that Model A likely underfits (low R² on both sets), Model B likely fits well (good R² on both sets, small gap), and Model C might overfit slightly (higher training R², potentially lower test R² than Model B). The choice should favor Model B if it offers good performance without overfitting, emphasizing the principle of parsimony — the simplest model that captures the important patterns.

Exercise 26.24Real-world regression critique ⭐⭐⭐

Find a news article or blog post that presents results from a regression analysis (search for "regression analysis shows that..." or "our model predicts that..."). Critically evaluate the claims:

  1. What is the target variable? What features are used?
  2. Is this a prediction or explanation task?
  3. Do the authors confuse correlation with causation?
  4. Is R² or any evaluation metric reported? If so, is it the training or test score?
  5. Could there be important confounders not included in the model?
  6. Are the results being extrapolated beyond the range of the data?
Guidance This is an open-ended research exercise. Strong answers will identify specific claims, apply the critical thinking framework from Chapters 24-26, and note where the article falls short of rigorous analysis. Common issues: causal language for observational data, no test set evaluation, missing confounders, and extrapolation beyond the data range.

Exercise 26.25Teaching linear regression ⭐⭐⭐⭐

Write a 200-word explanation of linear regression for someone who has never taken a statistics course. Use an analogy. Do not use any mathematical notation. Your explanation should cover: what the model does, what "best fit" means, and why we need to test on data the model hasn't seen.

Guidance A strong answer will use a clear analogy (like the "drawing a line through dots" idea from the chapter), explain the concept of minimizing errors in everyday language, and use the "memorizing vs. learning" analogy for train-test splits. Avoid jargon — no "residuals," "least squares," or "R-squared." Instead: "errors," "best line," and "how much of the pattern the line captures." The 200-word limit forces conciseness — this is harder than writing a long explanation.