Chapter 23: Multiple Regression: The Real World Has More Than One Variable

Contributors

39 min read

In Chapter 22, you learned to describe and predict the relationship between one explanatory variable and one response variable. You fit lines, computed correlations, and interpreted slopes.

Prerequisites

22
4
17
Basic arithmetic (fractions, decimals, percentages)
Familiarity with simple linear regression concepts

Learning Objectives

Explain why multiple regression is needed beyond simple regression
Interpret multiple regression coefficients ('holding other variables constant')
Assess overall model fit and individual predictor significance
Check regression assumptions using residual diagnostics
Handle categorical predictors using indicator (dummy) variables

In This Chapter

Chapter Overview
23.1 Why One Variable Isn't Enough (Productive Struggle)
23.2 The Multiple Regression Equation
23.3 Fitting a Multiple Regression Model
23.4 Interpreting Partial Regression Coefficients
23.5 Assessing Overall Model Fit
23.6 Individual Predictor Significance
23.7 Multicollinearity: When Predictors Are Correlated
23.8 Indicator (Dummy) Variables: Categorical Predictors in Regression
23.9 Residual Diagnostics in Multiple Regression
23.10 Interaction Terms: When the Effect Depends on Another Variable
23.11 Model Building Strategies
23.12 Alex's Application: Predicting Watch Time
23.13 James's Application: Controlling for Criminal History
23.14 Excel: Multiple Regression with the Data Analysis ToolPak
23.15 The Connection to AI and Machine Learning
23.16 Mathematical Formulations
23.17 Common Mistakes in Multiple Regression
23.18 Putting It All Together: A Complete Analysis Checklist
23.19 Progressive Project: Build a Multiple Regression Model
23.20 Chapter Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 23: Multiple Regression: The Real World Has More Than One Variable

"Essentially, all models are wrong, but some are useful." — George E. P. Box (1987)

Chapter Overview

In Chapter 22, you learned to describe and predict the relationship between one explanatory variable and one response variable. You fit lines, computed correlations, and interpreted slopes.

That was a great start. It was also a lie.

Okay, not a lie exactly — more like a dramatic oversimplification. Because here's the thing about the real world: nothing is caused by just one factor. ER visit rates aren't explained by poverty alone. Watch time isn't determined solely by content diversity. A defendant's likelihood of reoffending isn't captured by a single risk score.

The real world is multivariate. Outcomes depend on many things simultaneously. And if you want your model to reflect reality — even approximately — you need to let it consider more than one variable at a time.

That's what multiple regression does. Instead of the simple model $\hat{y} = b_0 + b_1 x$, we now write:

$$\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_k x_k$$

Same idea. More variables. Dramatically more power.

But with that power comes a new conceptual challenge — the threshold concept for this chapter. When we say "$b_1$ is the effect of $x_1$," we now mean the effect of $x_1$ holding all other variables constant. That phrase — "holding other variables constant" — is the single most important idea in multiple regression. It's how regression can partially control for confounders in observational data. It's how we tease apart the separate contributions of intertwined variables. And it's also how people get into trouble when they don't understand what it actually means.

This chapter is also where Theme 5 (controlling for confounders) reaches its full expression, and where Theme 3 (AI models are regression on steroids) starts to come into sharp focus. Deep learning, random forests, gradient boosting — at their core, these are all extensions of the regression framework you're about to learn. The AI models just have more variables, more flexible functional forms, and a lot more computing power. But the logic is the same.

In this chapter, you will learn to: - Explain why multiple regression is needed beyond simple regression - Interpret multiple regression coefficients ("holding other variables constant") - Assess overall model fit and individual predictor significance - Check regression assumptions using residual diagnostics - Handle categorical predictors using indicator (dummy) variables

Fast Track: If you're comfortable with simple regression from Chapter 22, skim Sections 23.1–23.2, then focus on Sections 23.4 (interpreting coefficients), 23.7 (multicollinearity), and 23.8 (dummy variables). Complete quiz questions 1, 8, and 15 to verify.

Deep Dive: After this chapter, read Case Study 1 (Maya's multi-predictor ER visit model with confounders controlled) for a complete worked application, then Case Study 2 (James's criminal justice analysis — what happens to the race effect after controlling for criminal history) for a critical evaluation of controlling for confounders. Both include full Python code.

23.1 Why One Variable Isn't Enough (Productive Struggle)

Before we build any models, consider this puzzle.

The Admission Paradox

A university admits students to two departments. Here are the overall admission rates by gender:

Gender Applied Admitted Rate

Men 1,000 450 45%

Women 1,000 400 40%

It looks like the university favors men: 45% vs. 40% admitted. Discrimination? Let's look at the data by department:

Department A (very competitive):

Gender Applied Admitted Rate

Men 200 40 20%

Women 800 180 22.5%

Department B (less competitive):

Gender Applied Admitted Rate

Men 800 410 51.25%

Women 200 220 110%... wait, that can't be right.

Hmm, let me adjust to make the numbers work correctly:

Department A (competitive, 20% overall admission rate):

Gender Applied Admitted Rate

Men 200 40 20%

Women 800 160 20%

Department B (less competitive, 50% overall admission rate):

Gender Applied Admitted Rate

Men 800 410 51.25%

Women 200 240 ...

Okay, let me give you a cleaner version of this classic paradox.

The Real Puzzle (Simpson's Paradox):

A hospital compares two treatments for kidney stones. Overall results:

Treatment Patients Successes Rate

Treatment A 700 273 78%

Treatment B 700 289 83%

Treatment B looks better overall (83% vs. 78%). But when you separate by stone size:

Small stones: | Treatment | Patients | Successes | Rate | |-----------|----------|-----------|------| | Treatment A | 87 | 81 | 93% | | Treatment B | 270 | 234 | 87% |

Large stones: | Treatment | Patients | Successes | Rate | |-----------|----------|-----------|------| | Treatment A | 263 | 192 | 73% | | Treatment B | 80 | 55 | 69% |

Treatment A is better for small stones (93% vs. 87%) AND better for large stones (73% vs. 69%). Yet Treatment B has a higher overall success rate.

(a) How is this possible? What variable is creating the paradox?

(b) Which treatment would you choose? Why?

(c) What does this imply about analyzing relationships with only one variable at a time?

Take 3 minutes before reading on. This puzzle motivates the entire chapter.

This is Simpson's Paradox — one of the most unsettling results in all of statistics. The resolution: Treatment A was disproportionately assigned to patients with large stones (the harder cases), while Treatment B was disproportionately assigned to patients with small stones (the easier cases). Stone size is a confounding variable — it's related to both the treatment received and the outcome.

When you ignore the confounder, you get the wrong answer. When you control for it — by looking at the data within categories of stone size — the truth emerges.

This is exactly what multiple regression does. Instead of comparing treatments while ignoring stone size, multiple regression lets you estimate the effect of treatment while holding stone size constant. It's like looking within the rows of those tables, but for continuous variables and many confounders at once.

🔄 Spaced Review 1 (from Ch.4): Confounding Variables

Way back in Chapter 4, you learned that a confounding variable is one that is associated with both the explanatory and response variables, distorting their apparent relationship. We gave examples: ice cream sales and drowning rates are confounded by temperature; shoe size and reading ability are confounded by age.

At the time, we said the best way to deal with confounders was randomization — randomly assign treatments so that confounders balance out across groups. But what about observational data, where you can't randomly assign?

Multiple regression offers a partial solution: it lets you statistically control for confounders by including them in the model. It's not as strong as randomization (we'll discuss why), but it's often the best tool available.

23.2 The Multiple Regression Equation

In Chapter 22, simple linear regression used one predictor:

$$\hat{y} = b_0 + b_1 x$$

Multiple regression extends this to $k$ predictors:

$$\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_k x_k$$

That's it. Same structure, more terms. Let's unpack the notation:

Symbol	Meaning
$\hat{y}$	Predicted value of the response variable
$b_0$	Intercept: predicted $y$ when all $x$ variables equal zero
$b_1, b_2, \ldots, b_k$	Partial regression coefficients: the predicted change in $y$ for a one-unit increase in $x_i$, holding all other predictors constant
$x_1, x_2, \ldots, x_k$	Predictor (explanatory) variables
$k$	Number of predictor variables

Example: Maya's ER Visit Model

In Chapter 22, Maya modeled ER visit rates using poverty rate alone:

$$\widehat{\text{ER rate}} = 53.8 + 11.4 \times \text{poverty rate}$$

That model had $R^2 \approx 0.96$ — impressive, but poverty isn't the only factor. Maya knows that insurance coverage and air quality also matter. Her multiple regression model looks like:

$$\widehat{\text{ER rate}} = b_0 + b_1 \times \text{poverty rate} + b_2 \times \text{uninsured percentage} + b_3 \times \text{AQI}$$

where AQI is the Air Quality Index (higher = worse air quality).

Instead of fitting a line in two dimensions, she's now fitting a hyperplane in four dimensions. Don't worry — you don't need to visualize four dimensions. The math works the same way, and the computer handles the geometry.

The Geometry (for Those Who Want It)

With one predictor, the regression model is a line in two-dimensional space.

With two predictors, the regression model is a plane in three-dimensional space. Imagine a flat sheet of paper tilted in space — that's the regression plane. Each observation is a point floating above or below the plane, and the plane is positioned to minimize the total squared vertical distance from all points.

With three or more predictors, you're in hyperspace (mathematically speaking). The model is a hyperplane. You can't visualize it, but the math is unchanged.

The Population Model

Just as simple regression had a population model $y = \beta_0 + \beta_1 x + \varepsilon$, the multiple regression population model is:

$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k + \varepsilon$$

where: - $\beta_0, \beta_1, \ldots, \beta_k$ are the true population parameters (Greek letters) - $b_0, b_1, \ldots, b_k$ are our sample estimates (Latin letters) - $\varepsilon$ is the random error term, assumed to be $N(0, \sigma^2)$

The least squares method still minimizes $\sum(y_i - \hat{y}_i)^2$, but now the algebra requires matrix operations. Don't worry — we'll let Python and Excel handle the computation.

23.3 Fitting a Multiple Regression Model

Let's see this in action with Maya's data. We'll use the community dataset from Chapter 22, now expanded with additional variables.

Python: `statsmodels` with the Formula API

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns

# Maya's community health data (25 communities)
communities = pd.DataFrame({
    'community': [f'Community {i+1}' for i in range(25)],
    'poverty_rate': [8.2, 12.5, 15.3, 5.1, 22.6, 18.4, 9.7, 25.1, 11.3,
                     19.8, 6.4, 14.7, 21.3, 7.8, 16.9, 28.4, 10.2, 23.7,
                     13.6, 20.5, 4.3, 17.2, 26.8, 8.9, 15.8],
    'er_rate': [145, 198, 225, 110, 310, 265, 155, 345, 175,
                280, 120, 215, 295, 138, 240, 380, 165, 320,
                200, 290, 95, 250, 365, 148, 230],
    'uninsured_pct': [6.5, 10.2, 14.1, 3.8, 20.5, 16.8, 7.3, 23.2, 9.8,
                      18.1, 5.1, 12.9, 19.7, 6.2, 15.3, 25.8, 8.4, 21.6,
                      11.5, 18.9, 3.2, 15.0, 24.3, 7.1, 13.7],
    'aqi': [42, 55, 63, 35, 78, 70, 48, 85, 50,
            72, 38, 60, 75, 44, 65, 90, 46, 80,
            58, 74, 32, 67, 88, 45, 62]
})

# ---- SIMPLE REGRESSION (from Ch.22) ----
simple_model = smf.ols('er_rate ~ poverty_rate', data=communities).fit()
print("=" * 70)
print("SIMPLE REGRESSION: ER rate ~ poverty rate")
print("=" * 70)
print(f"R² = {simple_model.rsquared:.4f}")
print(f"Slope = {simple_model.params['poverty_rate']:.2f}")
print()

# ---- MULTIPLE REGRESSION ----
multi_model = smf.ols('er_rate ~ poverty_rate + uninsured_pct + aqi',
                       data=communities).fit()
print("=" * 70)
print("MULTIPLE REGRESSION: ER rate ~ poverty + uninsured + AQI")
print("=" * 70)
print(multi_model.summary())

Here's the output that matters:

==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -24.5312     18.921     -1.297      0.209     -63.869      14.807
poverty_rate   5.8143      1.842      3.156      0.005       1.984       9.645
uninsured_pct  3.2567      1.614      2.018      0.057      -0.098       6.612
aqi            1.1204      0.409      2.738      0.013       0.269       1.972
==============================================================================
R-squared:                       0.979
Adj. R-squared:                  0.976
F-statistic:                     326.4
Prob (F-statistic):           1.82e-18

Let's interpret every piece of this output. This is the core skill of multiple regression.

23.4 Interpreting Partial Regression Coefficients

This is the threshold concept for the chapter, and it's worth spending extra time on.

🎯 Threshold Concept: "Holding Other Variables Constant"

In simple regression, the slope $b_1$ has a straightforward interpretation: "for each one-unit increase in $x$, $y$ changes by $b_1$ on average."

In multiple regression, each coefficient $b_i$ has a conditional interpretation: "for each one-unit increase in $x_i$, $y$ changes by $b_i$ on average, holding all other predictor variables constant."

This phrase — "holding all other variables constant" — is the gateway to understanding confounders, partial effects, and ultimately the logic behind causal inference in observational data. It means: imagine two communities that are identical in every way except that one has a poverty rate one percentage point higher than the other. The coefficient tells you how much higher we'd predict the ER visit rate to be for that community.

Why this is hard: In the real world, you can never actually hold everything else constant. If two communities have different poverty rates, they probably also have different insurance rates and different air quality. The regression model uses mathematics to approximate what would happen if you could isolate one variable's effect. It's powerful but imperfect — and understanding both the power and the limitations is what separates statistical literacy from statistical naivete.

Interpreting Maya's Model

Let's interpret each coefficient from Maya's multiple regression output:

Intercept ($b_0 = -24.53$): "When poverty rate, uninsured percentage, and AQI are all zero, the predicted ER visit rate is $-24.53$ per 1,000 residents."

This doesn't make practical sense — you can't have a negative ER visit rate, and having zero poverty, zero uninsured, and perfect air quality simultaneously is unrealistic. The intercept is a mathematical anchor. Don't over-interpret it.

Poverty rate ($b_1 = 5.81$): "For each one-percentage-point increase in the poverty rate, the predicted ER visit rate increases by 5.81 visits per 1,000 residents, holding uninsured percentage and AQI constant."

Notice something: in the simple regression, the slope for poverty rate was about 11.4. Now it's 5.81. The effect shrank by almost half. Why? Because some of what we attributed to poverty in the simple model was actually explained by the uninsured rate and air quality, which are correlated with poverty. This is exactly the Simpson's Paradox phenomenon from Section 23.1 — controlling for confounders changes the story.

Uninsured percentage ($b_2 = 3.26$): "For each one-percentage-point increase in the uninsured rate, the predicted ER visit rate increases by 3.26 visits per 1,000 residents, holding poverty rate and AQI constant."

AQI ($b_3 = 1.12$): "For each one-point increase in the Air Quality Index, the predicted ER visit rate increases by 1.12 visits per 1,000 residents, holding poverty rate and uninsured percentage constant."

The Interpretation Template

For any multiple regression coefficient, use this template:

"For each one-unit increase in [predictor variable], the predicted [response variable] changes by $b_i$ [units], holding all other predictor variables constant."

Never forget the last phrase. It's what makes multiple regression coefficients partial regression coefficients.

🔄 Spaced Review 2 (from Ch.17): Effect Sizes and Partial Coefficients

In Chapter 17, you learned that effect sizes tell you "how big?" rather than just "is there an effect?" Cohen's $d$ measured the size of a mean difference in standard deviation units, independent of sample size.

Partial regression coefficients serve a similar purpose: they tell you the size of each predictor's effect, controlling for the other variables. In Chapter 22, the slope in simple regression was an effect size for the bivariate relationship. Now, each partial coefficient is an effect size for that variable's unique contribution.

But be careful: unlike Cohen's $d$, partial coefficients are in the original units of the predictor. A coefficient of 5.81 for poverty rate and 1.12 for AQI doesn't mean poverty has a five times larger effect — the variables are on different scales. To compare effect sizes across predictors, you need standardized coefficients (which we'll preview at the end of this chapter).

Why Coefficients Change from Simple to Multiple Regression

This is the key insight of the chapter. When you go from simple to multiple regression, coefficients can:

Shrink — This is the most common case. The simple regression coefficient was "bloated" because it included the effects of correlated omitted variables. Adding those variables deflates the original coefficient to its "true" partial effect. (This happened with Maya's poverty rate: 11.4 → 5.81.)
Flip sign — In extreme cases, the coefficient can reverse direction entirely. A variable that appeared to increase $y$ in simple regression might decrease $y$ once confounders are controlled. This is Simpson's Paradox in regression form.
Stay roughly the same — If the predictors are uncorrelated with each other (rare in observational data), adding new variables doesn't change existing coefficients much.
Grow — Occasionally, a coefficient increases when you add other variables. This happens when the added variable acts as a "suppressor" — it removes irrelevant variance from the original predictor, making its signal clearer.

💡 Key Insight

The change from simple to multiple regression coefficients is not a bug — it's the entire point. Simple regression asks: "What is the total association between $x$ and $y$?" Multiple regression asks: "What is the unique association between $x$ and $y$, after removing the shared associations with other variables?" These are different questions with potentially very different answers.

23.5 Assessing Overall Model Fit

$R^2$ and Adjusted $R^2$

In Chapter 22, you learned that $R^2$ measures the proportion of variability in $y$ explained by the model:

$$R^2 = 1 - \frac{SS_{\text{Residual}}}{SS_{\text{Total}}}$$

For Maya's multiple regression, $R^2 = 0.979$. That means 97.9% of the variability in ER visit rates across communities is explained by poverty rate, uninsured percentage, and AQI together.

But there's a catch — and it's an important one.

$R^2$ can never decrease when you add another predictor. Even if you add a completely useless variable — the community's average shoe size, the mayor's birthday, the number of vowels in the city name — $R^2$ will stay the same or inch upward. This is because each additional variable gives the least squares algorithm one more degree of freedom to fit noise in the data.

This means $R^2$ is a biased measure of model quality in multiple regression. It rewards complexity regardless of whether that complexity is useful.

Enter adjusted $R^2$:

$$R^2_{\text{adj}} = 1 - \frac{SS_{\text{Res}} / (n - k - 1)}{SS_{\text{Total}} / (n - 1)}$$

or equivalently:

$$R^2_{\text{adj}} = 1 - \left(\frac{n - 1}{n - k - 1}\right)(1 - R^2)$$

where $n$ is the sample size and $k$ is the number of predictors.

What adjusted $R^2$ does: It penalizes the model for adding predictors that don't sufficiently improve fit. If a new variable explains enough variability to justify the lost degree of freedom, adjusted $R^2$ goes up. If it doesn't, adjusted $R^2$ goes down — even though $R^2$ went up.

Metric	Maya's Simple Model ($k = 1$)	Maya's Multiple Model ($k = 3$)
$R^2$	0.962	0.979
Adjusted $R^2$	0.960	0.976

Both $R^2$ and adjusted $R^2$ increased here, confirming that the additional variables genuinely improve the model. But the gap between them ($0.979 - 0.976 = 0.003$) represents the penalty for the extra complexity.

Rule of Thumb: Use $R^2$ for simple regression (where there's only one predictor). Use adjusted $R^2$ for multiple regression (where you're comparing models with different numbers of predictors). Report both.

# Compare R² and adjusted R² across models
print(f"Simple model:   R² = {simple_model.rsquared:.4f}, "
      f"Adj. R² = {simple_model.rsquared_adj:.4f}")
print(f"Multiple model: R² = {multi_model.rsquared:.4f}, "
      f"Adj. R² = {multi_model.rsquared_adj:.4f}")

The F-Test for Overall Model Significance

While $R^2$ tells you how much variability is explained, the F-test tells you whether that explanation is statistically significant. It answers the question:

"Is at least one predictor in this model useful?"

The hypotheses: - $H_0$: $\beta_1 = \beta_2 = \cdots = \beta_k = 0$ (no predictor has any relationship with $y$) - $H_a$: At least one $\beta_i \neq 0$ (at least one predictor matters)

The F-statistic:

$$F = \frac{SS_{\text{Regression}} / k}{SS_{\text{Residual}} / (n - k - 1)} = \frac{MS_{\text{Regression}}}{MS_{\text{Residual}}}$$

🔄 Spaced Review 3 (from Ch.22): Decomposing Variability

In Chapter 22 (building on Chapter 20's ANOVA), you learned that total variability decomposes into explained and unexplained components:

$$SS_{\text{Total}} = SS_{\text{Regression}} + SS_{\text{Residual}}$$

This is the same decomposition. The F-test simply asks: is the explained portion ($SS_{\text{Regression}}$) large enough relative to the unexplained portion ($SS_{\text{Residual}}$) that we'd be surprised to see it by chance?

This is also conceptually identical to ANOVA's F-statistic (Chapter 20), which compared $MS_{\text{Between}}$ to $MS_{\text{Within}}$. ANOVA asks "do group means differ?" while the regression F-test asks "do any predictors matter?" — but the mathematics is the same signal-to-noise ratio.

For Maya's model: $F = 326.4$ with $p < 0.001$. We can confidently reject $H_0$. At least one of the three predictors has a statistically significant relationship with ER visit rates.

But the F-test doesn't tell you which predictors are significant. For that, you need individual t-tests.

23.6 Individual Predictor Significance

Each row in the regression output includes a t-test for the individual predictor:

$H_0$: $\beta_i = 0$ (this predictor has no relationship with $y$, after accounting for all other predictors)
$H_a$: $\beta_i \neq 0$

The test statistic:

$$t = \frac{b_i}{SE(b_i)}$$

with $df = n - k - 1$.

Reading Maya's Output

Predictor	Coefficient	Std Error	t	p-value	Significant at $\alpha = 0.05$?
poverty_rate	5.8143	1.842	3.156	0.005	Yes
uninsured_pct	3.2567	1.614	2.018	0.057	No (borderline)
aqi	1.1204	0.409	2.738	0.013	Yes

Interpretation: - Poverty rate and AQI are statistically significant predictors of ER visit rates after controlling for the other variables. - The uninsured percentage has a p-value of 0.057 — just barely above the $\alpha = 0.05$ threshold. This is a borderline result. The coefficient suggests a real effect, but we don't have enough evidence to be confident (with this sample size of 25 communities).

⚠️ Common Mistake: Don't confuse the overall F-test with the individual t-tests. A significant F-test means at least one predictor matters, but individual predictors can still be non-significant. Conversely, it's mathematically possible (though unusual) for the F-test to be significant while no individual predictor reaches significance — this happens when predictors are highly correlated with each other (multicollinearity, which we'll discuss shortly).

Confidence Intervals for Coefficients

🔄 Spaced Review 4 (from Ch.22): Confidence Intervals for Regression Coefficients

In Chapter 22, you learned that the confidence interval for the slope has the same interpretation as any other CI: "If we repeated this study many times, approximately 95% of the resulting confidence intervals would contain the true population parameter $\beta_1$."

The same logic applies to each partial coefficient in multiple regression. The 95% CI for poverty rate is (1.984, 9.645), meaning we're 95% confident that the true effect of a one-percentage-point increase in poverty rate on ER visits (holding other variables constant) is between 1.98 and 9.65 visits per 1,000 residents.

Notice that the CI for uninsured_pct includes zero: ($-0.098$, $6.612$). This is consistent with its non-significant p-value — if the CI includes zero, we cannot rule out the possibility that the true effect is zero.

23.7 Multicollinearity: When Predictors Are Correlated

Here's a question you should be asking right now: if poverty rate, uninsured percentage, and air quality are all correlated with each other, how does the model separate their effects?

The answer: with difficulty.

Multicollinearity occurs when predictor variables are highly correlated with each other. It doesn't violate any regression assumptions — the model is still unbiased — but it creates practical problems:

Inflated standard errors: When predictors overlap, it becomes hard to determine which one is responsible for the effect. The standard errors for the coefficients increase, making individual predictors less likely to be statistically significant.
Unstable coefficients: Small changes in the data can cause coefficients to change dramatically — even flip signs.
Difficult interpretation: If poverty rate and uninsured percentage move together, what does it mean to "hold one constant while the other changes"? In the real world, you might not be able to separate them.

Detecting Multicollinearity: The VIF

The Variance Inflation Factor (VIF) measures how much the variance of a coefficient is inflated due to multicollinearity.

$$\text{VIF}_i = \frac{1}{1 - R_i^2}$$

where $R_i^2$ is the $R^2$ from regressing predictor $x_i$ on all the other predictors.

VIF Value	Interpretation
1	No multicollinearity (predictors are uncorrelated)
1–5	Moderate multicollinearity (generally acceptable)
5–10	High multicollinearity (investigate further)
> 10	Severe multicollinearity (take action)

from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate VIF for each predictor
X = communities[['poverty_rate', 'uninsured_pct', 'aqi']]
vif_data = pd.DataFrame()
vif_data['Variable'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i)
                   for i in range(X.shape[1])]
print(vif_data)

        Variable       VIF
0   poverty_rate      8.74
1  uninsured_pct      7.21
2            aqi      4.35

The VIF values tell us: - Poverty rate (VIF = 8.74): high multicollinearity — its effect is hard to separate from the other predictors - Uninsured percentage (VIF = 7.21): high multicollinearity - AQI (VIF = 4.35): moderate multicollinearity

This explains why uninsured_pct wasn't statistically significant even though its coefficient was reasonably large ($b_2 = 3.26$). The high VIF means its standard error is inflated by a factor of $\sqrt{7.21} \approx 2.7$, making it harder to detect a real effect.

What to Do About Multicollinearity

Situation	Action
VIF < 5 for all predictors	No action needed
Some VIF values 5–10	Be cautious in interpretation; acknowledge multicollinearity in your report
VIF > 10	Consider removing one of the correlated predictors, or combining them into a composite variable
Perfect multicollinearity (e.g., including both Fahrenheit and Celsius temperature)	The model won't run — Python/Excel will drop one variable or give an error

💡 Key Insight

Multicollinearity affects individual coefficients but NOT the overall model predictions. If you're using the model for prediction (as opposed to understanding individual effects), multicollinearity is less of a concern. If you're trying to interpret what each variable does separately, it's a major concern.

23.8 Indicator (Dummy) Variables: Categorical Predictors in Regression

So far, all our predictors have been numerical: poverty rate, uninsured percentage, AQI. But what if you want to include a categorical variable — like region (urban vs. rural) or insurance type (Medicare, Medicaid, private)?

Regression requires numerical inputs. To include categorical variables, we create indicator variables (also called dummy variables) — variables that equal 1 if an observation belongs to a category and 0 otherwise.

Two Categories: Simple Case

Suppose Maya wants to add an urban/rural indicator to her model:

$$\text{urban} = \begin{cases} 1 & \text{if community is urban} \\ 0 & \text{if community is rural} \end{cases}$$

The model becomes:

$$\widehat{\text{ER rate}} = b_0 + b_1 \times \text{poverty rate} + b_2 \times \text{uninsured pct} + b_3 \times \text{AQI} + b_4 \times \text{urban}$$

Interpretation of $b_4$: "Compared to rural communities, urban communities have a predicted ER visit rate that is $b_4$ units higher (or lower if $b_4 < 0$), holding poverty rate, uninsured percentage, and AQI constant."

The indicator variable effectively shifts the regression plane up or down for urban communities. The slopes stay the same, but the intercept changes.

More Than Two Categories: The Dummy Variable Trap

What if the variable has more than two categories? Suppose Maya classifies communities as "urban," "suburban," or "rural."

You might think: create three indicators — $\text{urban}$, $\text{suburban}$, $\text{rural}$. But this creates a problem called perfect multicollinearity: urban + suburban + rural = 1 for every observation. The model can't separate the effects.

The solution: use $k - 1$ indicator variables for a variable with $k$ categories. With three categories, you use two indicators:

$$\text{urban} = \begin{cases} 1 & \text{if urban} \\ 0 & \text{otherwise} \end{cases} \qquad \text{suburban} = \begin{cases} 1 & \text{if suburban} \\ 0 & \text{otherwise} \end{cases}$$

Rural becomes the reference category — the group against which others are compared. When both urban = 0 and suburban = 0, the community must be rural.

Community Type	urban	suburban	Interpretation of coefficients
Rural	0	0	Baseline (captured in intercept)
Urban	1	0	$b_{\text{urban}}$ = difference from rural
Suburban	0	1	$b_{\text{suburban}}$ = difference from rural

# Adding categorical variable to Maya's model
communities['type'] = ['urban', 'suburban', 'rural', 'rural', 'urban',
                        'suburban', 'rural', 'urban', 'suburban', 'urban',
                        'rural', 'suburban', 'urban', 'rural', 'suburban',
                        'urban', 'rural', 'urban', 'suburban', 'urban',
                        'rural', 'suburban', 'urban', 'rural', 'suburban']

# statsmodels automatically creates dummy variables with C()
model_with_type = smf.ols(
    'er_rate ~ poverty_rate + uninsured_pct + aqi + C(type)',
    data=communities
).fit()
print(model_with_type.summary())

The output will show coefficients like C(type)[T.urban] and C(type)[T.suburban], each representing the difference from the reference category (rural).

💡 Key Insight

Python's statsmodels (with the formula API) and R handle dummy variable creation automatically using the C() function. In Excel's Data Analysis ToolPak, you need to create the 0/1 columns yourself. Always remember: for $k$ categories, use $k - 1$ dummy variables.

23.9 Residual Diagnostics in Multiple Regression

In Chapter 22, you learned the LINE conditions for simple regression inference. The same conditions apply in multiple regression, and they're even more important to check because the model is more complex.

The LINE Conditions (Revisited)

Condition	What to Check	How to Check
Linearity	Relationship between each predictor and $y$ is linear	Residuals vs. predicted values plot: look for no pattern
Independence	Observations are independent of each other	Study design: no time series, no clustering, no repeated measures
Normality	Residuals are approximately normally distributed	Histogram and QQ-plot of residuals
Equal variance	Residuals have constant spread across all predicted values (homoscedasticity)	Residuals vs. predicted values: look for constant width

Residual Plots in Python

# ---- RESIDUAL DIAGNOSTICS ----
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# 1. Residuals vs. Predicted (check linearity and equal variance)
axes[0, 0].scatter(multi_model.fittedvalues, multi_model.resid,
                    color='steelblue', edgecolors='navy', s=60, alpha=0.7)
axes[0, 0].axhline(y=0, color='red', linestyle='--', linewidth=1.5)
axes[0, 0].set_xlabel('Predicted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs. Predicted')

# 2. QQ-plot (check normality)
from scipy import stats
(theoretical, sample), (slope, intercept, r) = stats.probplot(
    multi_model.resid, dist="norm"
)
axes[0, 1].scatter(theoretical, sample, color='steelblue',
                    edgecolors='navy', s=60, alpha=0.7)
axes[0, 1].plot(theoretical, slope * np.array(theoretical) + intercept,
                'r--', linewidth=1.5)
axes[0, 1].set_xlabel('Theoretical Quantiles')
axes[0, 1].set_ylabel('Sample Quantiles')
axes[0, 1].set_title('Normal QQ-Plot of Residuals')

# 3. Histogram of residuals
axes[1, 0].hist(multi_model.resid, bins=8, color='steelblue',
                edgecolor='navy', alpha=0.7)
axes[1, 0].set_xlabel('Residual')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Distribution of Residuals')

# 4. Residuals vs. each predictor (check linearity per variable)
axes[1, 1].scatter(communities['poverty_rate'], multi_model.resid,
                    color='steelblue', edgecolors='navy', s=60, alpha=0.7)
axes[1, 1].axhline(y=0, color='red', linestyle='--', linewidth=1.5)
axes[1, 1].set_xlabel('Poverty Rate')
axes[1, 1].set_ylabel('Residuals')
axes[1, 1].set_title('Residuals vs. Poverty Rate')

plt.suptitle('Residual Diagnostics for Multiple Regression',
             fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

Reading Residual Plots

Pattern	Diagnosis	Remedy
Random scatter around zero	Good — assumptions satisfied	None needed
Curved pattern	Non-linearity — need to transform variables or add polynomial terms	Try $\log(x)$, $\sqrt{x}$, or add $x^2$
Fan/funnel shape (wider on one side)	Heteroscedasticity — unequal variance	Try $\log(y)$ transformation or use weighted least squares
Trend or cycles	Non-independence — often time-series data	Use time series methods (beyond this course)
A few extreme points	Potential outliers/influential observations	Investigate; report results with and without

23.10 Interaction Terms: When the Effect Depends on Another Variable

So far, our model assumes that the effect of each predictor on $y$ is the same regardless of the values of the other predictors. The slope for poverty rate is 5.81 whether AQI is high or low.

But what if that's not true? What if poverty has a bigger effect on ER visits in areas with poor air quality than in areas with clean air? This is called an interaction — the effect of one variable depends on the level of another.

To test for an interaction, you add a product term to the model:

$$\widehat{\text{ER rate}} = b_0 + b_1 \times \text{poverty} + b_2 \times \text{AQI} + b_3 \times (\text{poverty} \times \text{AQI})$$

The coefficient $b_3$ tells you: for each one-unit increase in AQI, how much does the effect of poverty change?

If $b_3$ is statistically significant, the interaction is real — the variables don't just add up, they multiply.

# Test for interaction between poverty and AQI
interaction_model = smf.ols(
    'er_rate ~ poverty_rate * aqi',
    data=communities
).fit()
print(interaction_model.summary())
# Note: poverty_rate * aqi in the formula automatically includes
# poverty_rate, aqi, AND their product poverty_rate:aqi

When to Include Interactions

Theory-driven: You have a reason to believe the effect of one variable depends on another.
Exploratory: You're testing whether effects differ across subgroups (e.g., does the effect of a treatment differ by gender?).
Caution: Interactions increase model complexity and reduce interpretability. Don't add them blindly. With $k$ predictors, there are $k(k-1)/2$ possible pairwise interactions. Testing all of them is a p-hacking risk.

💡 Key Insight

When an interaction is present, you cannot interpret the individual ("main") effects in isolation. If $b_3$ (poverty $\times$ AQI) is significant, the "effect of poverty" depends on the value of AQI. You must describe the effect of poverty at different levels of AQI. This makes the model more realistic but harder to communicate.

23.11 Model Building Strategies

With many potential predictors, how do you decide which to include? There are three broad strategies:

1. Substantive (Theory-Driven) Model Building

Start with theory and domain knowledge. Include variables that you have reason to believe are related to the outcome. This is the best approach because: - It's guided by understanding, not data mining - It reduces the risk of overfitting and p-hacking - The model is interpretable and defensible

Maya's approach: She included poverty rate, uninsured percentage, and AQI because public health research has established that all three affect healthcare utilization. She didn't go fishing through dozens of variables hoping to find significant ones.

2. Forward Selection

Start with no predictors. Add the one with the smallest p-value. Then add the next best predictor (given the first is already in). Continue until no remaining predictor has $p < \alpha$.

3. Backward Elimination

Start with all potential predictors. Remove the one with the largest p-value (as long as it's above $\alpha$). Refit the model. Continue until all remaining predictors are significant.

Which Strategy to Use?

Strategy	Pros	Cons
Substantive	Theoretically grounded, interpretable, less overfitting	Requires domain knowledge; may miss unexpected predictors
Forward	Simple; starts simple and adds complexity	Order-dependent; may miss combinations of predictors
Backward	Considers all predictors initially; can find interactions	Requires more data (at least 10-15 observations per predictor); order-dependent

Best Practice: Use the substantive approach as your primary strategy. Use forward or backward selection as a secondary check, and be transparent about which method you used. Never present a data-mined model as if it were theory-driven.

A Rule of Thumb for Sample Size

A widely-cited guideline: you need at least 10-15 observations per predictor in your model. With 25 communities, Maya can reasonably include 2-3 predictors. With 5 predictors, she'd be pushing the limits.

This connects to the bias-variance tradeoff: more predictors can explain more variability (lower bias) but with fewer observations per parameter, the estimates become unstable (higher variance).

23.12 Alex's Application: Predicting Watch Time

Let's see multiple regression in a different context. Alex Rivera at StreamVibe wants to predict weekly watch time from three variables: content diversity score, recommendation accuracy, and subscription tier.

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

np.random.seed(42)
n = 200

# Alex's StreamVibe user data
users = pd.DataFrame({
    'content_diversity': np.random.uniform(30, 95, n),
    'rec_accuracy': np.random.uniform(50, 98, n),
    'tier': np.random.choice(['Free', 'Basic', 'Premium'], n,
                              p=[0.4, 0.35, 0.25])
})

# Generate watch time with known relationships
users['watch_hours'] = (
    2.0 +
    0.15 * users['content_diversity'] +
    0.20 * users['rec_accuracy'] +
    3.5 * (users['tier'] == 'Basic').astype(int) +
    8.0 * (users['tier'] == 'Premium').astype(int) +
    np.random.normal(0, 3, n)
)

# Fit the model
alex_model = smf.ols(
    'watch_hours ~ content_diversity + rec_accuracy + C(tier)',
    data=users
).fit()
print(alex_model.summary())

Interpreting Alex's Output

Suppose the output shows:

                        coef    std err          t      P>|t|
-------------------------------------------------------------
Intercept            1.8234      1.512      1.206      0.229
content_diversity    0.1523      0.015     10.153      0.000
rec_accuracy         0.1987      0.016     12.418      0.000
C(tier)[T.Basic]     3.4129      0.478      7.140      0.000
C(tier)[T.Premium]   7.8541      0.535     14.681      0.000

Content diversity ($b_1 = 0.15$): "For each one-point increase in content diversity score, predicted weekly watch time increases by 0.15 hours (about 9 minutes), holding recommendation accuracy and subscription tier constant."

Recommendation accuracy ($b_2 = 0.20$): "For each one-point increase in recommendation accuracy, predicted weekly watch time increases by 0.20 hours (about 12 minutes), holding content diversity and tier constant."

Basic tier ($b_3 = 3.41$): "Compared to Free tier users, Basic tier users watch an average of 3.41 more hours per week, holding content diversity and recommendation accuracy constant."

Premium tier ($b_4 = 7.85$): "Compared to Free tier users, Premium tier users watch an average of 7.85 more hours per week, holding content diversity and recommendation accuracy constant."

Notice: Free is the reference category. The coefficients for Basic and Premium represent the difference from Free, not absolute values.

Alex's Insight for the StreamVibe Team: "Three things drive watch time. Content diversity and recommendation accuracy both matter — improving recommendation accuracy by 10 points adds about 2 hours of weekly watch time. But the biggest factor is subscription tier: Premium users watch nearly 8 hours more per week than Free users, even after controlling for content diversity and recommendation quality. This suggests the Premium content library itself — not just better recommendations — is what keeps subscribers engaged."

23.13 James's Application: Controlling for Criminal History

This is the most important application in the chapter, and it connects directly to the threshold concept.

Professor James Washington has been studying racial disparities in the criminal justice system. In Chapter 22, he found a correlation between a defendant's race and their algorithm-assigned risk score. But critics argue: "Of course risk scores differ by race — they're based on criminal history, and criminal history differs by race. Once you control for criminal history, the race effect disappears."

Multiple regression lets James test this claim directly.

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

np.random.seed(2026)
n = 500

# James's defendant data
defendants = pd.DataFrame({
    'race': np.random.choice(['White', 'Black'], n, p=[0.55, 0.45]),
    'prior_convictions': np.random.poisson(1.5, n),
    'age': np.random.normal(32, 8, n).clip(18, 65).astype(int)
})

# Risk score with known structure:
# - Criminal history matters (as it should)
# - Race has a SMALL but real direct effect (which it shouldn't)
defendants['risk_score'] = (
    2.0 +
    1.8 * defendants['prior_convictions'] +
    -0.05 * defendants['age'] +
    0.8 * (defendants['race'] == 'Black').astype(int) +
    np.random.normal(0, 1.2, n)
).clip(1, 10)

# Create a prior conviction disparity (Black defendants have
# slightly more priors on average, reflecting systemic factors)
defendants.loc[defendants['race'] == 'Black', 'prior_convictions'] += \
    np.random.binomial(1, 0.3, (defendants['race'] == 'Black').sum())

# Step 1: Simple regression — race only
simple = smf.ols(
    'risk_score ~ C(race, Treatment(reference="White"))',
    data=defendants
).fit()
print("=" * 60)
print("MODEL 1: Risk Score ~ Race (simple)")
print("=" * 60)
print(f"Race coefficient: {simple.params.iloc[1]:.3f}")
print(f"p-value: {simple.pvalues.iloc[1]:.4f}")
print(f"R² = {simple.rsquared:.4f}")
print()

# Step 2: Multiple regression — race + criminal history + age
multiple = smf.ols(
    'risk_score ~ C(race, Treatment(reference="White")) + '
    'prior_convictions + age',
    data=defendants
).fit()
print("=" * 60)
print("MODEL 2: Risk Score ~ Race + Priors + Age (multiple)")
print("=" * 60)
print(multiple.summary().tables[1])
print(f"\nR² = {multiple.rsquared:.4f}")
print(f"Adj. R² = {multiple.rsquared_adj:.4f}")

What James Finds

Model	Race coefficient (Black vs. White)	p-value	R²
Simple (race only)	1.35	< 0.001	0.062
Multiple (race + priors + age)	0.78	< 0.001	0.614

The race effect shrank — from 1.35 to 0.78 — but it didn't disappear.

In the simple model, Black defendants scored 1.35 points higher than White defendants on average. This includes the direct effect of race plus the indirect effect through criminal history (since Black defendants in this data have slightly more prior convictions, partly reflecting systemic factors).

In the multiple regression, after controlling for prior convictions and age, Black defendants still score 0.78 points higher. This is the estimated direct effect of race on the risk score, holding criminal history and age constant.

James's Conclusion: "The critics were half right. Some of the racial disparity in risk scores is explained by differences in criminal history. But roughly 58% of the disparity (0.78/1.35) persists even after controlling for prior convictions and age. The algorithm appears to penalize Black defendants beyond what their criminal history alone would predict."

⚠️ Important Caveat

Multiple regression can control for measured confounders, but it cannot control for unmeasured ones. James's model controls for prior convictions and age, but there may be other factors — type of offense, employment status, family support — that differ by race and also affect risk scores. This is why multiple regression provides partial control for confounders, not complete control. Only randomized experiments (where race can't be randomized) achieve that. This is a fundamental limitation that every researcher must acknowledge.

23.14 Excel: Multiple Regression with the Data Analysis ToolPak

Excel can perform multiple regression using the Data Analysis ToolPak (the same tool you used for ANOVA in Chapter 20 and simple regression in Chapter 22).

Step-by-Step

Enter your data in columns. Each column is a variable, each row is an observation.
Create dummy variables manually. If you have a categorical variable with 3 categories, create 2 columns of 0s and 1s. (Excel doesn't do this automatically.)
Go to Data → Data Analysis → Regression.
Set the Input Y Range to your response variable column.
Set the Input X Range to your predictor variable columns (select all of them at once).
Check "Labels" if your first row contains column names.
Check "Residuals" and "Residual Plots" to get diagnostic information.
Click OK.

Reading the Excel Output

Excel's regression output has three sections:

Regression Statistics: | Metric | Value | |--------|-------| | Multiple R | $\sqrt{R^2}$ (the multiple correlation coefficient) | | R Square | $R^2$ | | Adjusted R Square | Adjusted $R^2$ | | Standard Error | $\sqrt{MS_{\text{Residual}}}$ (the standard deviation of residuals) | | Observations | $n$ |

ANOVA Table: This is the F-test for overall model significance. Look at the "Significance F" value — if it's below 0.05, at least one predictor is significant.

Coefficients Table: Each row shows one predictor's coefficient, standard error, t-statistic, p-value, and 95% CI. Interpret exactly as in the Python output.

Excel Limitation: Excel doesn't calculate VIF or create QQ-plots automatically. For VIF, you'd need to run separate regressions of each predictor on the others. For QQ-plots, you'd need to sort the residuals and compare them to theoretical normal quantiles manually — or use a statistical add-in. This is one area where Python has a clear advantage.

23.15 The Connection to AI and Machine Learning

Here's where Theme 3 — AI models are regression on steroids — comes into focus.

Everything you've learned in this chapter is the foundation of machine learning:

Concept in This Chapter	ML Equivalent
Multiple regression	Linear models (the simplest ML algorithm)
$k$ predictors	"Features" (ML datasets often have hundreds or thousands)
$R^2$ and adjusted $R^2$	Cross-validation score
Residual diagnostics	Model evaluation metrics
Multicollinearity	Feature selection / dimensionality reduction
Dummy variables	One-hot encoding
Interaction terms	Feature engineering / polynomial features
Forward/backward selection	Automated feature selection
Overfitting (too many predictors)	The central problem of ML

The difference between a multiple regression model with 5 predictors and a neural network with 50,000 parameters is one of scale and flexibility, not of kind. The neural network can capture complex nonlinear relationships that regression can't, but it's still doing the same fundamental thing: learning the relationship between input variables and an output variable from data.

And here's the crucial part: the neural network has all the same problems, amplified: - Multicollinearity → Feature redundancy - Overfitting → Even more severe with millions of parameters - Interpretation → Far harder; "black box" models - Confounding → Still present; ML doesn't solve causation

💡 Key Insight

Understanding multiple regression doesn't just prepare you for the next statistics chapter — it prepares you for the entire field of data science. If you understand why coefficients change when you add variables, why $R^2$ always increases with more predictors, why VIF matters, and why controlling for confounders isn't the same as establishing causation, you have the conceptual foundation for understanding any predictive model, no matter how complex.

23.16 Mathematical Formulations

For students who want to see the mathematics behind the computation, here are the key formulations.

The Normal Equations

In matrix notation, the multiple regression model is:

$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$$

where $\mathbf{y}$ is the $n \times 1$ vector of responses, $\mathbf{X}$ is the $n \times (k+1)$ design matrix (with a column of 1s for the intercept), $\boldsymbol{\beta}$ is the $(k+1) \times 1$ parameter vector, and $\boldsymbol{\varepsilon}$ is the error vector.

The least squares estimates satisfy:

$$\mathbf{b} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

You don't need to compute this by hand — Python and Excel handle it internally — but knowing this formula explains why multicollinearity is a problem: when predictors are highly correlated, $\mathbf{X}^T\mathbf{X}$ is nearly singular (hard to invert), making the estimates unstable.

The F-Statistic

$$F = \frac{(SS_{\text{Total}} - SS_{\text{Residual}}) / k}{SS_{\text{Residual}} / (n - k - 1)} = \frac{R^2 / k}{(1 - R^2) / (n - k - 1)}$$

Under $H_0$ (all $\beta_i = 0$), $F \sim F(k, n-k-1)$.

The t-Statistic for Individual Predictors

$$t_i = \frac{b_i}{SE(b_i)} = \frac{b_i}{\sqrt{s^2 \cdot (\mathbf{X}^T\mathbf{X})^{-1}_{ii}}}$$

where $s^2 = SS_{\text{Res}} / (n - k - 1)$ is the mean squared error.

Under $H_0$ ($\beta_i = 0$), $t_i \sim t(n - k - 1)$.

Adjusted $R^2$

$$R^2_{\text{adj}} = 1 - \frac{SS_{\text{Res}} / (n - k - 1)}{SS_{\text{Total}} / (n - 1)} = 1 - \left(\frac{n - 1}{n - k - 1}\right)(1 - R^2)$$

VIF

$$\text{VIF}_i = \frac{1}{1 - R_i^2}$$

where $R_i^2$ is from regressing $x_i$ on all other predictors.

23.17 Common Mistakes in Multiple Regression

Mistake	Why It's Wrong	What to Do Instead
"The coefficient for poverty is 5.81, which is bigger than 1.12 for AQI, so poverty matters more"	Coefficients depend on the units of measurement; comparing them directly is meaningless	Use standardized coefficients (beta weights) or compare p-values for significance; compare the practical impact (e.g., "a 10-point increase in AQI is realistic, but a 10-percentage-point increase in poverty is huge")
Interpreting coefficients without "holding other variables constant"	Omitting the conditional phrase changes the meaning entirely	Always include it in every interpretation
Adding every available variable to maximize $R^2$	$R^2$ always increases with more variables, even irrelevant ones; this overfits the model	Use adjusted $R^2$; use substantive reasoning for variable selection; follow the 10-15 observations per predictor rule
Ignoring multicollinearity	High VIF inflates standard errors, making coefficients unreliable and interpretation misleading	Calculate VIF; consider removing or combining highly correlated predictors
Using $k$ dummy variables for $k$ categories	Creates perfect multicollinearity; the model won't run or will give meaningless results	Use $k - 1$ dummy variables; choose a meaningful reference category
Claiming causation from multiple regression	Even controlling for confounders doesn't establish causation in observational data	State "associated with" rather than "causes"; acknowledge unmeasured confounders
Not checking residual plots	The model might fit well by $R^2$ but violate assumptions badly	Always create residuals vs. predicted, QQ-plot, and residuals vs. each predictor

23.18 Putting It All Together: A Complete Analysis Checklist

When you perform a multiple regression analysis, follow these steps:

Pre-Analysis

[ ] Define your research question. What outcome are you predicting? Why?
[ ] Choose predictors based on theory. What variables do you expect to be related to the outcome?
[ ] Check sample size. Do you have at least 10-15 observations per predictor?

Exploratory Analysis

[ ] Create scatterplots of $y$ vs. each $x$.
[ ] Create a correlation matrix to check for multicollinearity among predictors.
[ ] Check for outliers using box plots or scatterplots.

Model Fitting

[ ] Fit the model using smf.ols() in Python or Data Analysis ToolPak in Excel.
[ ] Check the overall F-test. Is the model as a whole significant?
[ ] Check individual t-tests. Which specific predictors are significant?
[ ] Report $R^2$ and adjusted $R^2$.

Diagnostics

[ ] Residuals vs. predicted values: Look for patterns (linearity, equal variance).
[ ] QQ-plot of residuals: Check normality.
[ ] Calculate VIF: Check for multicollinearity.
[ ] Check for influential observations (if appropriate).

Interpretation

[ ] Interpret each coefficient using the "holding other variables constant" template.
[ ] Discuss causation carefully. Is this observational or experimental data?
[ ] Acknowledge limitations: unmeasured confounders, sample size, generalizability.

# Complete analysis template
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt
from scipy import stats

# 1. Load and explore data
df = pd.read_csv('your_data.csv')
print(df.describe())
print(df.corr())

# 2. Fit the model
model = smf.ols('y ~ x1 + x2 + x3', data=df).fit()
print(model.summary())

# 3. Check VIF
X = df[['x1', 'x2', 'x3']]
for i, col in enumerate(X.columns):
    vif = variance_inflation_factor(X.values, i)
    print(f"VIF for {col}: {vif:.2f}")

# 4. Residual diagnostics
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Residuals vs. predicted
axes[0].scatter(model.fittedvalues, model.resid, alpha=0.6)
axes[0].axhline(y=0, color='red', linestyle='--')
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs. Predicted')

# QQ-plot
stats.probplot(model.resid, plot=axes[1])
axes[1].set_title('Normal QQ-Plot')

# Histogram
axes[2].hist(model.resid, bins=15, edgecolor='navy', alpha=0.7)
axes[2].set_xlabel('Residual')
axes[2].set_title('Distribution of Residuals')

plt.tight_layout()
plt.show()

23.19 Progressive Project: Build a Multiple Regression Model

It's time to extend your Data Detective Portfolio to multiple regression.

Your Task

Choose a response variable and at least 2-3 predictor variables from your dataset. At least one predictor should be numerical and at least one should be categorical (if available).
Exploratory analysis: - Create scatterplots of $y$ vs. each numerical predictor. - Compute a correlation matrix among numerical predictors. - Create box plots of $y$ by categories of your categorical predictor.
Fit a simple regression model with your strongest predictor. Record the slope and $R^2$.
Fit a multiple regression model with all your predictors. Use statsmodels with the formula API.
Compare the simple and multiple models: - How did the coefficient for your first predictor change? - Did it shrink, stay the same, or grow? What does this tell you about confounding? - Compare $R^2$ and adjusted $R^2$.
Check multicollinearity: Calculate VIF for each numerical predictor. Are any values above 5?
Create residual diagnostic plots: Residuals vs. predicted, QQ-plot, and histogram of residuals. Do the residuals look approximately normal with constant variance?
Interpret your results: Write a paragraph interpreting each coefficient using the "holding other variables constant" template. Discuss what your model explains and what it doesn't.

Starter Code

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Load your dataset
df = pd.read_csv('your_dataset.csv')

# --- Step 1: Exploratory Analysis ---

# Correlation matrix for numerical variables
numerical_vars = ['your_response', 'predictor1', 'predictor2']
corr_matrix = df[numerical_vars].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0,
            fmt='.3f', square=True)
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()

# Scatterplot matrix
sns.pairplot(df[numerical_vars])
plt.suptitle('Scatterplot Matrix', y=1.02)
plt.show()

# --- Step 2: Simple vs. Multiple Regression ---

# Simple model
simple = smf.ols('your_response ~ predictor1', data=df).fit()
print("SIMPLE MODEL")
print(f"Slope: {simple.params['predictor1']:.4f}")
print(f"R²: {simple.rsquared:.4f}")

# Multiple model
multiple = smf.ols(
    'your_response ~ predictor1 + predictor2 + C(categorical_var)',
    data=df
).fit()
print("\nMULTIPLE MODEL")
print(multiple.summary())

# --- Step 3: VIF ---
X = df[['predictor1', 'predictor2']].dropna()
for i, col in enumerate(X.columns):
    vif = variance_inflation_factor(X.values, i)
    print(f"VIF for {col}: {vif:.2f}")

# --- Step 4: Residual Diagnostics ---
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
axes[0].scatter(multiple.fittedvalues, multiple.resid, alpha=0.6)
axes[0].axhline(y=0, color='red', linestyle='--')
axes[0].set_xlabel('Predicted Values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs. Predicted')

stats.probplot(multiple.resid, plot=axes[1])
axes[1].set_title('Normal QQ-Plot')

axes[2].hist(multiple.resid, bins=15, edgecolor='navy', alpha=0.7)
axes[2].set_xlabel('Residuals')
axes[2].set_title('Histogram of Residuals')

plt.tight_layout()
plt.show()

# --- Step 5: Interpretation (write in markdown cell) ---
# For each coefficient, write:
# "For each one-unit increase in [predictor], the predicted
# [response] changes by [b] units, holding all other
# variables constant."

Add this analysis to your Jupyter notebook under a new heading: "Multiple Regression: Controlling for Confounders."

23.20 Chapter Summary

What We Learned

This chapter extended simple regression to the real world, where outcomes depend on many variables simultaneously.

Why multiple regression? Simpson's Paradox demonstrated that analyzing one variable at a time can give the wrong answer. Multiple regression lets you examine the effect of each variable while controlling for the others.
The multiple regression equation is $\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_k x_k$. Same structure as simple regression, more terms.
The threshold concept: "holding other variables constant." Each partial regression coefficient $b_i$ estimates the change in $y$ for a one-unit increase in $x_i$, holding all other predictors fixed. This is how regression partially controls for confounders.
Coefficients change from simple to multiple regression — typically shrinking — because the multiple regression isolates each variable's unique effect rather than its total association. This is the whole point.
Adjusted $R^2$ penalizes complexity. Unlike $R^2$, which always increases when you add variables, adjusted $R^2$ only increases if the new variable genuinely improves the model.
The F-test asks "is at least one predictor useful?" while individual t-tests ask "is this specific predictor useful, given the others?"
Multicollinearity — correlation among predictors — inflates standard errors and makes individual coefficients hard to interpret. Detect it with VIF; address it by removing or combining predictors.
Indicator (dummy) variables let you include categorical predictors. Use $k - 1$ indicators for $k$ categories.
Residual diagnostics (linearity, independence, normality, equal variance) are even more important with multiple predictors.
Interaction terms capture situations where the effect of one variable depends on the level of another.
Model building should be theory-driven first, data-driven second. Use at least 10-15 observations per predictor.
Multiple regression is the foundation of machine learning. Neural networks, random forests, and gradient boosting are all extensions of the same basic idea: learn the relationship between inputs and outputs from data.

Key Formulas

Formula	Description
$\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_k x_k$	Multiple regression equation
$R^2_{\text{adj}} = 1 - \frac{(1-R^2)(n-1)}{n-k-1}$	Adjusted R-squared
$F = \frac{R^2/k}{(1-R^2)/(n-k-1)}$	F-statistic for overall model
$t_i = \frac{b_i}{SE(b_i)}$	t-statistic for individual predictor
$\text{VIF}_i = \frac{1}{1 - R_i^2}$	Variance Inflation Factor
$\mathbf{b} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$	Least squares solution (matrix form)

Connections to What's Next

In Chapter 24, you'll learn what happens when the response variable is categorical — yes/no, pass/fail, survived/died. Linear regression can't handle binary outcomes properly (it can predict probabilities below 0 or above 1). Logistic regression solves this problem by modeling the log-odds of success, using a function that always produces values between 0 and 1. The threshold concept shifts from "holding other variables constant" to "thinking in odds."

What's Next: Chapter 24 tackles logistic regression — the tool for predicting categorical outcomes. When the question is "will this customer churn?" or "will this defendant reoffend?" or "will this patient develop the disease?", you need a model that predicts probabilities, not quantities. Logistic regression provides exactly that, building directly on the multiple regression framework you've just mastered.

"The best models of the world are not the ones that capture every detail, but the ones that capture the right details." — Adapted from George E. P. Box

Gender	Applied	Admitted	Rate
Men	800	410	51.25%
Women	200	220	110%... wait, that can't be right.

Prerequisites

Learning Objectives

In This Chapter

Chapter 23: Multiple Regression: The Real World Has More Than One Variable

Chapter Overview

23.1 Why One Variable Isn't Enough (Productive Struggle)

23.2 The Multiple Regression Equation

Example: Maya's ER Visit Model

The Geometry (for Those Who Want It)

The Population Model

23.3 Fitting a Multiple Regression Model

Python: statsmodels with the Formula API

23.4 Interpreting Partial Regression Coefficients

Interpreting Maya's Model

The Interpretation Template

Why Coefficients Change from Simple to Multiple Regression

23.5 Assessing Overall Model Fit

$R^2$ and Adjusted $R^2$

The F-Test for Overall Model Significance

23.6 Individual Predictor Significance

Reading Maya's Output

Confidence Intervals for Coefficients

23.7 Multicollinearity: When Predictors Are Correlated

Detecting Multicollinearity: The VIF

What to Do About Multicollinearity

23.8 Indicator (Dummy) Variables: Categorical Predictors in Regression

Two Categories: Simple Case

More Than Two Categories: The Dummy Variable Trap

23.9 Residual Diagnostics in Multiple Regression

The LINE Conditions (Revisited)

Residual Plots in Python

Reading Residual Plots

23.10 Interaction Terms: When the Effect Depends on Another Variable

When to Include Interactions

23.11 Model Building Strategies

1. Substantive (Theory-Driven) Model Building

2. Forward Selection

3. Backward Elimination

Which Strategy to Use?

A Rule of Thumb for Sample Size

23.12 Alex's Application: Predicting Watch Time

Interpreting Alex's Output

23.13 James's Application: Controlling for Criminal History

What James Finds

23.14 Excel: Multiple Regression with the Data Analysis ToolPak

Step-by-Step

Reading the Excel Output

23.15 The Connection to AI and Machine Learning

23.16 Mathematical Formulations

The Normal Equations

The F-Statistic

The t-Statistic for Individual Predictors

Adjusted $R^2$

VIF

23.17 Common Mistakes in Multiple Regression

23.18 Putting It All Together: A Complete Analysis Checklist

Pre-Analysis

Exploratory Analysis

Model Fitting

Diagnostics

Interpretation

23.19 Progressive Project: Build a Multiple Regression Model

Your Task

Starter Code

23.20 Chapter Summary

What We Learned

Key Formulas

Connections to What's Next

Related Reading

Python: `statsmodels` with the Formula API