In Chapter 22, you learned to describe and predict the relationship between one explanatory variable and one response variable. You fit lines, computed correlations, and interpreted slopes.
Learning Objectives
- Explain why multiple regression is needed beyond simple regression
- Interpret multiple regression coefficients ('holding other variables constant')
- Assess overall model fit and individual predictor significance
- Check regression assumptions using residual diagnostics
- Handle categorical predictors using indicator (dummy) variables
In This Chapter
- Chapter Overview
- 23.1 Why One Variable Isn't Enough (Productive Struggle)
- 23.2 The Multiple Regression Equation
- 23.3 Fitting a Multiple Regression Model
- 23.4 Interpreting Partial Regression Coefficients
- 23.5 Assessing Overall Model Fit
- 23.6 Individual Predictor Significance
- 23.7 Multicollinearity: When Predictors Are Correlated
- 23.8 Indicator (Dummy) Variables: Categorical Predictors in Regression
- 23.9 Residual Diagnostics in Multiple Regression
- 23.10 Interaction Terms: When the Effect Depends on Another Variable
- 23.11 Model Building Strategies
- 23.12 Alex's Application: Predicting Watch Time
- 23.13 James's Application: Controlling for Criminal History
- 23.14 Excel: Multiple Regression with the Data Analysis ToolPak
- 23.15 The Connection to AI and Machine Learning
- 23.16 Mathematical Formulations
- 23.17 Common Mistakes in Multiple Regression
- 23.18 Putting It All Together: A Complete Analysis Checklist
- 23.19 Progressive Project: Build a Multiple Regression Model
- 23.20 Chapter Summary
Chapter 23: Multiple Regression: The Real World Has More Than One Variable
"Essentially, all models are wrong, but some are useful." — George E. P. Box (1987)
Chapter Overview
In Chapter 22, you learned to describe and predict the relationship between one explanatory variable and one response variable. You fit lines, computed correlations, and interpreted slopes.
That was a great start. It was also a lie.
Okay, not a lie exactly — more like a dramatic oversimplification. Because here's the thing about the real world: nothing is caused by just one factor. ER visit rates aren't explained by poverty alone. Watch time isn't determined solely by content diversity. A defendant's likelihood of reoffending isn't captured by a single risk score.
The real world is multivariate. Outcomes depend on many things simultaneously. And if you want your model to reflect reality — even approximately — you need to let it consider more than one variable at a time.
That's what multiple regression does. Instead of the simple model $\hat{y} = b_0 + b_1 x$, we now write:
$$\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_k x_k$$
Same idea. More variables. Dramatically more power.
But with that power comes a new conceptual challenge — the threshold concept for this chapter. When we say "$b_1$ is the effect of $x_1$," we now mean the effect of $x_1$ holding all other variables constant. That phrase — "holding other variables constant" — is the single most important idea in multiple regression. It's how regression can partially control for confounders in observational data. It's how we tease apart the separate contributions of intertwined variables. And it's also how people get into trouble when they don't understand what it actually means.
This chapter is also where Theme 5 (controlling for confounders) reaches its full expression, and where Theme 3 (AI models are regression on steroids) starts to come into sharp focus. Deep learning, random forests, gradient boosting — at their core, these are all extensions of the regression framework you're about to learn. The AI models just have more variables, more flexible functional forms, and a lot more computing power. But the logic is the same.
In this chapter, you will learn to: - Explain why multiple regression is needed beyond simple regression - Interpret multiple regression coefficients ("holding other variables constant") - Assess overall model fit and individual predictor significance - Check regression assumptions using residual diagnostics - Handle categorical predictors using indicator (dummy) variables
Fast Track: If you're comfortable with simple regression from Chapter 22, skim Sections 23.1–23.2, then focus on Sections 23.4 (interpreting coefficients), 23.7 (multicollinearity), and 23.8 (dummy variables). Complete quiz questions 1, 8, and 15 to verify.
Deep Dive: After this chapter, read Case Study 1 (Maya's multi-predictor ER visit model with confounders controlled) for a complete worked application, then Case Study 2 (James's criminal justice analysis — what happens to the race effect after controlling for criminal history) for a critical evaluation of controlling for confounders. Both include full Python code.
23.1 Why One Variable Isn't Enough (Productive Struggle)
Before we build any models, consider this puzzle.
The Admission Paradox
A university admits students to two departments. Here are the overall admission rates by gender:
Gender Applied Admitted Rate Men 1,000 450 45% Women 1,000 400 40% It looks like the university favors men: 45% vs. 40% admitted. Discrimination? Let's look at the data by department:
Department A (very competitive):
Gender Applied Admitted Rate Men 200 40 20% Women 800 180 22.5% Department B (less competitive):
Gender Applied Admitted Rate Men 800 410 51.25% Women 200 220 110%... wait, that can't be right. Hmm, let me adjust to make the numbers work correctly:
Department A (competitive, 20% overall admission rate):
Gender Applied Admitted Rate Men 200 40 20% Women 800 160 20% Department B (less competitive, 50% overall admission rate):
Gender Applied Admitted Rate Men 800 410 51.25% Women 200 240 ... Okay, let me give you a cleaner version of this classic paradox.
The Real Puzzle (Simpson's Paradox):
A hospital compares two treatments for kidney stones. Overall results:
Treatment Patients Successes Rate Treatment A 700 273 78% Treatment B 700 289 83% Treatment B looks better overall (83% vs. 78%). But when you separate by stone size:
Small stones: | Treatment | Patients | Successes | Rate | |-----------|----------|-----------|------| | Treatment A | 87 | 81 | 93% | | Treatment B | 270 | 234 | 87% |
Large stones: | Treatment | Patients | Successes | Rate | |-----------|----------|-----------|------| | Treatment A | 263 | 192 | 73% | | Treatment B | 80 | 55 | 69% |
Treatment A is better for small stones (93% vs. 87%) AND better for large stones (73% vs. 69%). Yet Treatment B has a higher overall success rate.
(a) How is this possible? What variable is creating the paradox?
(b) Which treatment would you choose? Why?
(c) What does this imply about analyzing relationships with only one variable at a time?
Take 3 minutes before reading on. This puzzle motivates the entire chapter.
This is Simpson's Paradox — one of the most unsettling results in all of statistics. The resolution: Treatment A was disproportionately assigned to patients with large stones (the harder cases), while Treatment B was disproportionately assigned to patients with small stones (the easier cases). Stone size is a confounding variable — it's related to both the treatment received and the outcome.
When you ignore the confounder, you get the wrong answer. When you control for it — by looking at the data within categories of stone size — the truth emerges.
This is exactly what multiple regression does. Instead of comparing treatments while ignoring stone size, multiple regression lets you estimate the effect of treatment while holding stone size constant. It's like looking within the rows of those tables, but for continuous variables and many confounders at once.
🔄 Spaced Review 1 (from Ch.4): Confounding Variables
Way back in Chapter 4, you learned that a confounding variable is one that is associated with both the explanatory and response variables, distorting their apparent relationship. We gave examples: ice cream sales and drowning rates are confounded by temperature; shoe size and reading ability are confounded by age.
At the time, we said the best way to deal with confounders was randomization — randomly assign treatments so that confounders balance out across groups. But what about observational data, where you can't randomly assign?
Multiple regression offers a partial solution: it lets you statistically control for confounders by including them in the model. It's not as strong as randomization (we'll discuss why), but it's often the best tool available.
23.2 The Multiple Regression Equation
In Chapter 22, simple linear regression used one predictor:
$$\hat{y} = b_0 + b_1 x$$
Multiple regression extends this to $k$ predictors:
$$\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_k x_k$$
That's it. Same structure, more terms. Let's unpack the notation:
| Symbol | Meaning |
|---|---|
| $\hat{y}$ | Predicted value of the response variable |
| $b_0$ | Intercept: predicted $y$ when all $x$ variables equal zero |
| $b_1, b_2, \ldots, b_k$ | Partial regression coefficients: the predicted change in $y$ for a one-unit increase in $x_i$, holding all other predictors constant |
| $x_1, x_2, \ldots, x_k$ | Predictor (explanatory) variables |
| $k$ | Number of predictor variables |
Example: Maya's ER Visit Model
In Chapter 22, Maya modeled ER visit rates using poverty rate alone:
$$\widehat{\text{ER rate}} = 53.8 + 11.4 \times \text{poverty rate}$$
That model had $R^2 \approx 0.96$ — impressive, but poverty isn't the only factor. Maya knows that insurance coverage and air quality also matter. Her multiple regression model looks like:
$$\widehat{\text{ER rate}} = b_0 + b_1 \times \text{poverty rate} + b_2 \times \text{uninsured percentage} + b_3 \times \text{AQI}$$
where AQI is the Air Quality Index (higher = worse air quality).
Instead of fitting a line in two dimensions, she's now fitting a hyperplane in four dimensions. Don't worry — you don't need to visualize four dimensions. The math works the same way, and the computer handles the geometry.
The Geometry (for Those Who Want It)
With one predictor, the regression model is a line in two-dimensional space.
With two predictors, the regression model is a plane in three-dimensional space. Imagine a flat sheet of paper tilted in space — that's the regression plane. Each observation is a point floating above or below the plane, and the plane is positioned to minimize the total squared vertical distance from all points.
With three or more predictors, you're in hyperspace (mathematically speaking). The model is a hyperplane. You can't visualize it, but the math is unchanged.
The Population Model
Just as simple regression had a population model $y = \beta_0 + \beta_1 x + \varepsilon$, the multiple regression population model is:
$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k + \varepsilon$$
where: - $\beta_0, \beta_1, \ldots, \beta_k$ are the true population parameters (Greek letters) - $b_0, b_1, \ldots, b_k$ are our sample estimates (Latin letters) - $\varepsilon$ is the random error term, assumed to be $N(0, \sigma^2)$
The least squares method still minimizes $\sum(y_i - \hat{y}_i)^2$, but now the algebra requires matrix operations. Don't worry — we'll let Python and Excel handle the computation.
23.3 Fitting a Multiple Regression Model
Let's see this in action with Maya's data. We'll use the community dataset from Chapter 22, now expanded with additional variables.
Python: statsmodels with the Formula API
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
# Maya's community health data (25 communities)
communities = pd.DataFrame({
'community': [f'Community {i+1}' for i in range(25)],
'poverty_rate': [8.2, 12.5, 15.3, 5.1, 22.6, 18.4, 9.7, 25.1, 11.3,
19.8, 6.4, 14.7, 21.3, 7.8, 16.9, 28.4, 10.2, 23.7,
13.6, 20.5, 4.3, 17.2, 26.8, 8.9, 15.8],
'er_rate': [145, 198, 225, 110, 310, 265, 155, 345, 175,
280, 120, 215, 295, 138, 240, 380, 165, 320,
200, 290, 95, 250, 365, 148, 230],
'uninsured_pct': [6.5, 10.2, 14.1, 3.8, 20.5, 16.8, 7.3, 23.2, 9.8,
18.1, 5.1, 12.9, 19.7, 6.2, 15.3, 25.8, 8.4, 21.6,
11.5, 18.9, 3.2, 15.0, 24.3, 7.1, 13.7],
'aqi': [42, 55, 63, 35, 78, 70, 48, 85, 50,
72, 38, 60, 75, 44, 65, 90, 46, 80,
58, 74, 32, 67, 88, 45, 62]
})
# ---- SIMPLE REGRESSION (from Ch.22) ----
simple_model = smf.ols('er_rate ~ poverty_rate', data=communities).fit()
print("=" * 70)
print("SIMPLE REGRESSION: ER rate ~ poverty rate")
print("=" * 70)
print(f"R² = {simple_model.rsquared:.4f}")
print(f"Slope = {simple_model.params['poverty_rate']:.2f}")
print()
# ---- MULTIPLE REGRESSION ----
multi_model = smf.ols('er_rate ~ poverty_rate + uninsured_pct + aqi',
data=communities).fit()
print("=" * 70)
print("MULTIPLE REGRESSION: ER rate ~ poverty + uninsured + AQI")
print("=" * 70)
print(multi_model.summary())
Here's the output that matters:
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -24.5312 18.921 -1.297 0.209 -63.869 14.807
poverty_rate 5.8143 1.842 3.156 0.005 1.984 9.645
uninsured_pct 3.2567 1.614 2.018 0.057 -0.098 6.612
aqi 1.1204 0.409 2.738 0.013 0.269 1.972
==============================================================================
R-squared: 0.979
Adj. R-squared: 0.976
F-statistic: 326.4
Prob (F-statistic): 1.82e-18
Let's interpret every piece of this output. This is the core skill of multiple regression.
23.4 Interpreting Partial Regression Coefficients
This is the threshold concept for the chapter, and it's worth spending extra time on.
🎯 Threshold Concept: "Holding Other Variables Constant"
In simple regression, the slope $b_1$ has a straightforward interpretation: "for each one-unit increase in $x$, $y$ changes by $b_1$ on average."
In multiple regression, each coefficient $b_i$ has a conditional interpretation: "for each one-unit increase in $x_i$, $y$ changes by $b_i$ on average, holding all other predictor variables constant."
This phrase — "holding all other variables constant" — is the gateway to understanding confounders, partial effects, and ultimately the logic behind causal inference in observational data. It means: imagine two communities that are identical in every way except that one has a poverty rate one percentage point higher than the other. The coefficient tells you how much higher we'd predict the ER visit rate to be for that community.
Why this is hard: In the real world, you can never actually hold everything else constant. If two communities have different poverty rates, they probably also have different insurance rates and different air quality. The regression model uses mathematics to approximate what would happen if you could isolate one variable's effect. It's powerful but imperfect — and understanding both the power and the limitations is what separates statistical literacy from statistical naivete.
Interpreting Maya's Model
Let's interpret each coefficient from Maya's multiple regression output:
Intercept ($b_0 = -24.53$): "When poverty rate, uninsured percentage, and AQI are all zero, the predicted ER visit rate is $-24.53$ per 1,000 residents."
This doesn't make practical sense — you can't have a negative ER visit rate, and having zero poverty, zero uninsured, and perfect air quality simultaneously is unrealistic. The intercept is a mathematical anchor. Don't over-interpret it.
Poverty rate ($b_1 = 5.81$): "For each one-percentage-point increase in the poverty rate, the predicted ER visit rate increases by 5.81 visits per 1,000 residents, holding uninsured percentage and AQI constant."
Notice something: in the simple regression, the slope for poverty rate was about 11.4. Now it's 5.81. The effect shrank by almost half. Why? Because some of what we attributed to poverty in the simple model was actually explained by the uninsured rate and air quality, which are correlated with poverty. This is exactly the Simpson's Paradox phenomenon from Section 23.1 — controlling for confounders changes the story.
Uninsured percentage ($b_2 = 3.26$): "For each one-percentage-point increase in the uninsured rate, the predicted ER visit rate increases by 3.26 visits per 1,000 residents, holding poverty rate and AQI constant."
AQI ($b_3 = 1.12$): "For each one-point increase in the Air Quality Index, the predicted ER visit rate increases by 1.12 visits per 1,000 residents, holding poverty rate and uninsured percentage constant."
The Interpretation Template
For any multiple regression coefficient, use this template:
"For each one-unit increase in [predictor variable], the predicted [response variable] changes by $b_i$ [units], holding all other predictor variables constant."
Never forget the last phrase. It's what makes multiple regression coefficients partial regression coefficients.
🔄 Spaced Review 2 (from Ch.17): Effect Sizes and Partial Coefficients
In Chapter 17, you learned that effect sizes tell you "how big?" rather than just "is there an effect?" Cohen's $d$ measured the size of a mean difference in standard deviation units, independent of sample size.
Partial regression coefficients serve a similar purpose: they tell you the size of each predictor's effect, controlling for the other variables. In Chapter 22, the slope in simple regression was an effect size for the bivariate relationship. Now, each partial coefficient is an effect size for that variable's unique contribution.
But be careful: unlike Cohen's $d$, partial coefficients are in the original units of the predictor. A coefficient of 5.81 for poverty rate and 1.12 for AQI doesn't mean poverty has a five times larger effect — the variables are on different scales. To compare effect sizes across predictors, you need standardized coefficients (which we'll preview at the end of this chapter).
Why Coefficients Change from Simple to Multiple Regression
This is the key insight of the chapter. When you go from simple to multiple regression, coefficients can:
-
Shrink — This is the most common case. The simple regression coefficient was "bloated" because it included the effects of correlated omitted variables. Adding those variables deflates the original coefficient to its "true" partial effect. (This happened with Maya's poverty rate: 11.4 → 5.81.)
-
Flip sign — In extreme cases, the coefficient can reverse direction entirely. A variable that appeared to increase $y$ in simple regression might decrease $y$ once confounders are controlled. This is Simpson's Paradox in regression form.
-
Stay roughly the same — If the predictors are uncorrelated with each other (rare in observational data), adding new variables doesn't change existing coefficients much.
-
Grow — Occasionally, a coefficient increases when you add other variables. This happens when the added variable acts as a "suppressor" — it removes irrelevant variance from the original predictor, making its signal clearer.
💡 Key Insight
The change from simple to multiple regression coefficients is not a bug — it's the entire point. Simple regression asks: "What is the total association between $x$ and $y$?" Multiple regression asks: "What is the unique association between $x$ and $y$, after removing the shared associations with other variables?" These are different questions with potentially very different answers.
23.5 Assessing Overall Model Fit
$R^2$ and Adjusted $R^2$
In Chapter 22, you learned that $R^2$ measures the proportion of variability in $y$ explained by the model:
$$R^2 = 1 - \frac{SS_{\text{Residual}}}{SS_{\text{Total}}}$$
For Maya's multiple regression, $R^2 = 0.979$. That means 97.9% of the variability in ER visit rates across communities is explained by poverty rate, uninsured percentage, and AQI together.
But there's a catch — and it's an important one.
$R^2$ can never decrease when you add another predictor. Even if you add a completely useless variable — the community's average shoe size, the mayor's birthday, the number of vowels in the city name — $R^2$ will stay the same or inch upward. This is because each additional variable gives the least squares algorithm one more degree of freedom to fit noise in the data.
This means $R^2$ is a biased measure of model quality in multiple regression. It rewards complexity regardless of whether that complexity is useful.
Enter adjusted $R^2$:
$$R^2_{\text{adj}} = 1 - \frac{SS_{\text{Res}} / (n - k - 1)}{SS_{\text{Total}} / (n - 1)}$$
or equivalently:
$$R^2_{\text{adj}} = 1 - \left(\frac{n - 1}{n - k - 1}\right)(1 - R^2)$$
where $n$ is the sample size and $k$ is the number of predictors.
What adjusted $R^2$ does: It penalizes the model for adding predictors that don't sufficiently improve fit. If a new variable explains enough variability to justify the lost degree of freedom, adjusted $R^2$ goes up. If it doesn't, adjusted $R^2$ goes down — even though $R^2$ went up.
| Metric | Maya's Simple Model ($k = 1$) | Maya's Multiple Model ($k = 3$) |
|---|---|---|
| $R^2$ | 0.962 | 0.979 |
| Adjusted $R^2$ | 0.960 | 0.976 |
Both $R^2$ and adjusted $R^2$ increased here, confirming that the additional variables genuinely improve the model. But the gap between them ($0.979 - 0.976 = 0.003$) represents the penalty for the extra complexity.
Rule of Thumb: Use $R^2$ for simple regression (where there's only one predictor). Use adjusted $R^2$ for multiple regression (where you're comparing models with different numbers of predictors). Report both.
# Compare R² and adjusted R² across models
print(f"Simple model: R² = {simple_model.rsquared:.4f}, "
f"Adj. R² = {simple_model.rsquared_adj:.4f}")
print(f"Multiple model: R² = {multi_model.rsquared:.4f}, "
f"Adj. R² = {multi_model.rsquared_adj:.4f}")
The F-Test for Overall Model Significance
While $R^2$ tells you how much variability is explained, the F-test tells you whether that explanation is statistically significant. It answers the question:
"Is at least one predictor in this model useful?"
The hypotheses: - $H_0$: $\beta_1 = \beta_2 = \cdots = \beta_k = 0$ (no predictor has any relationship with $y$) - $H_a$: At least one $\beta_i \neq 0$ (at least one predictor matters)
The F-statistic:
$$F = \frac{SS_{\text{Regression}} / k}{SS_{\text{Residual}} / (n - k - 1)} = \frac{MS_{\text{Regression}}}{MS_{\text{Residual}}}$$
🔄 Spaced Review 3 (from Ch.22): Decomposing Variability
In Chapter 22 (building on Chapter 20's ANOVA), you learned that total variability decomposes into explained and unexplained components:
$$SS_{\text{Total}} = SS_{\text{Regression}} + SS_{\text{Residual}}$$
This is the same decomposition. The F-test simply asks: is the explained portion ($SS_{\text{Regression}}$) large enough relative to the unexplained portion ($SS_{\text{Residual}}$) that we'd be surprised to see it by chance?
This is also conceptually identical to ANOVA's F-statistic (Chapter 20), which compared $MS_{\text{Between}}$ to $MS_{\text{Within}}$. ANOVA asks "do group means differ?" while the regression F-test asks "do any predictors matter?" — but the mathematics is the same signal-to-noise ratio.
For Maya's model: $F = 326.4$ with $p < 0.001$. We can confidently reject $H_0$. At least one of the three predictors has a statistically significant relationship with ER visit rates.
But the F-test doesn't tell you which predictors are significant. For that, you need individual t-tests.
23.6 Individual Predictor Significance
Each row in the regression output includes a t-test for the individual predictor:
- $H_0$: $\beta_i = 0$ (this predictor has no relationship with $y$, after accounting for all other predictors)
- $H_a$: $\beta_i \neq 0$
The test statistic:
$$t = \frac{b_i}{SE(b_i)}$$
with $df = n - k - 1$.
Reading Maya's Output
| Predictor | Coefficient | Std Error | t | p-value | Significant at $\alpha = 0.05$? |
|---|---|---|---|---|---|
| poverty_rate | 5.8143 | 1.842 | 3.156 | 0.005 | Yes |
| uninsured_pct | 3.2567 | 1.614 | 2.018 | 0.057 | No (borderline) |
| aqi | 1.1204 | 0.409 | 2.738 | 0.013 | Yes |
Interpretation: - Poverty rate and AQI are statistically significant predictors of ER visit rates after controlling for the other variables. - The uninsured percentage has a p-value of 0.057 — just barely above the $\alpha = 0.05$ threshold. This is a borderline result. The coefficient suggests a real effect, but we don't have enough evidence to be confident (with this sample size of 25 communities).
⚠️ Common Mistake: Don't confuse the overall F-test with the individual t-tests. A significant F-test means at least one predictor matters, but individual predictors can still be non-significant. Conversely, it's mathematically possible (though unusual) for the F-test to be significant while no individual predictor reaches significance — this happens when predictors are highly correlated with each other (multicollinearity, which we'll discuss shortly).
Confidence Intervals for Coefficients
🔄 Spaced Review 4 (from Ch.22): Confidence Intervals for Regression Coefficients
In Chapter 22, you learned that the confidence interval for the slope has the same interpretation as any other CI: "If we repeated this study many times, approximately 95% of the resulting confidence intervals would contain the true population parameter $\beta_1$."
The same logic applies to each partial coefficient in multiple regression. The 95% CI for poverty rate is (1.984, 9.645), meaning we're 95% confident that the true effect of a one-percentage-point increase in poverty rate on ER visits (holding other variables constant) is between 1.98 and 9.65 visits per 1,000 residents.
Notice that the CI for uninsured_pct includes zero: ($-0.098$, $6.612$). This is consistent with its non-significant p-value — if the CI includes zero, we cannot rule out the possibility that the true effect is zero.
23.7 Multicollinearity: When Predictors Are Correlated
Here's a question you should be asking right now: if poverty rate, uninsured percentage, and air quality are all correlated with each other, how does the model separate their effects?
The answer: with difficulty.
Multicollinearity occurs when predictor variables are highly correlated with each other. It doesn't violate any regression assumptions — the model is still unbiased — but it creates practical problems:
-
Inflated standard errors: When predictors overlap, it becomes hard to determine which one is responsible for the effect. The standard errors for the coefficients increase, making individual predictors less likely to be statistically significant.
-
Unstable coefficients: Small changes in the data can cause coefficients to change dramatically — even flip signs.
-
Difficult interpretation: If poverty rate and uninsured percentage move together, what does it mean to "hold one constant while the other changes"? In the real world, you might not be able to separate them.
Detecting Multicollinearity: The VIF
The Variance Inflation Factor (VIF) measures how much the variance of a coefficient is inflated due to multicollinearity.
$$\text{VIF}_i = \frac{1}{1 - R_i^2}$$
where $R_i^2$ is the $R^2$ from regressing predictor $x_i$ on all the other predictors.
| VIF Value | Interpretation |
|---|---|
| 1 | No multicollinearity (predictors are uncorrelated) |
| 1–5 | Moderate multicollinearity (generally acceptable) |
| 5–10 | High multicollinearity (investigate further) |
| > 10 | Severe multicollinearity (take action) |
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Calculate VIF for each predictor
X = communities[['poverty_rate', 'uninsured_pct', 'aqi']]
vif_data = pd.DataFrame()
vif_data['Variable'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i)
for i in range(X.shape[1])]
print(vif_data)
Variable VIF
0 poverty_rate 8.74
1 uninsured_pct 7.21
2 aqi 4.35
The VIF values tell us: - Poverty rate (VIF = 8.74): high multicollinearity — its effect is hard to separate from the other predictors - Uninsured percentage (VIF = 7.21): high multicollinearity - AQI (VIF = 4.35): moderate multicollinearity
This explains why uninsured_pct wasn't statistically significant even though its coefficient was reasonably large ($b_2 = 3.26$). The high VIF means its standard error is inflated by a factor of $\sqrt{7.21} \approx 2.7$, making it harder to detect a real effect.
What to Do About Multicollinearity
| Situation | Action |
|---|---|
| VIF < 5 for all predictors | No action needed |
| Some VIF values 5–10 | Be cautious in interpretation; acknowledge multicollinearity in your report |
| VIF > 10 | Consider removing one of the correlated predictors, or combining them into a composite variable |
| Perfect multicollinearity (e.g., including both Fahrenheit and Celsius temperature) | The model won't run — Python/Excel will drop one variable or give an error |
💡 Key Insight
Multicollinearity affects individual coefficients but NOT the overall model predictions. If you're using the model for prediction (as opposed to understanding individual effects), multicollinearity is less of a concern. If you're trying to interpret what each variable does separately, it's a major concern.
23.8 Indicator (Dummy) Variables: Categorical Predictors in Regression
So far, all our predictors have been numerical: poverty rate, uninsured percentage, AQI. But what if you want to include a categorical variable — like region (urban vs. rural) or insurance type (Medicare, Medicaid, private)?
Regression requires numerical inputs. To include categorical variables, we create indicator variables (also called dummy variables) — variables that equal 1 if an observation belongs to a category and 0 otherwise.
Two Categories: Simple Case
Suppose Maya wants to add an urban/rural indicator to her model:
$$\text{urban} = \begin{cases} 1 & \text{if community is urban} \\ 0 & \text{if community is rural} \end{cases}$$
The model becomes:
$$\widehat{\text{ER rate}} = b_0 + b_1 \times \text{poverty rate} + b_2 \times \text{uninsured pct} + b_3 \times \text{AQI} + b_4 \times \text{urban}$$
Interpretation of $b_4$: "Compared to rural communities, urban communities have a predicted ER visit rate that is $b_4$ units higher (or lower if $b_4 < 0$), holding poverty rate, uninsured percentage, and AQI constant."
The indicator variable effectively shifts the regression plane up or down for urban communities. The slopes stay the same, but the intercept changes.
More Than Two Categories: The Dummy Variable Trap
What if the variable has more than two categories? Suppose Maya classifies communities as "urban," "suburban," or "rural."
You might think: create three indicators — $\text{urban}$, $\text{suburban}$, $\text{rural}$. But this creates a problem called perfect multicollinearity: urban + suburban + rural = 1 for every observation. The model can't separate the effects.
The solution: use $k - 1$ indicator variables for a variable with $k$ categories. With three categories, you use two indicators:
$$\text{urban} = \begin{cases} 1 & \text{if urban} \\ 0 & \text{otherwise} \end{cases} \qquad \text{suburban} = \begin{cases} 1 & \text{if suburban} \\ 0 & \text{otherwise} \end{cases}$$
Rural becomes the reference category — the group against which others are compared. When both urban = 0 and suburban = 0, the community must be rural.
| Community Type | urban | suburban | Interpretation of coefficients |
|---|---|---|---|
| Rural | 0 | 0 | Baseline (captured in intercept) |
| Urban | 1 | 0 | $b_{\text{urban}}$ = difference from rural |
| Suburban | 0 | 1 | $b_{\text{suburban}}$ = difference from rural |
# Adding categorical variable to Maya's model
communities['type'] = ['urban', 'suburban', 'rural', 'rural', 'urban',
'suburban', 'rural', 'urban', 'suburban', 'urban',
'rural', 'suburban', 'urban', 'rural', 'suburban',
'urban', 'rural', 'urban', 'suburban', 'urban',
'rural', 'suburban', 'urban', 'rural', 'suburban']
# statsmodels automatically creates dummy variables with C()
model_with_type = smf.ols(
'er_rate ~ poverty_rate + uninsured_pct + aqi + C(type)',
data=communities
).fit()
print(model_with_type.summary())
The output will show coefficients like C(type)[T.urban] and C(type)[T.suburban], each representing the difference from the reference category (rural).
💡 Key Insight
Python's
statsmodels(with the formula API) and R handle dummy variable creation automatically using theC()function. In Excel's Data Analysis ToolPak, you need to create the 0/1 columns yourself. Always remember: for $k$ categories, use $k - 1$ dummy variables.
23.9 Residual Diagnostics in Multiple Regression
In Chapter 22, you learned the LINE conditions for simple regression inference. The same conditions apply in multiple regression, and they're even more important to check because the model is more complex.
The LINE Conditions (Revisited)
| Condition | What to Check | How to Check |
|---|---|---|
| Linearity | Relationship between each predictor and $y$ is linear | Residuals vs. predicted values plot: look for no pattern |
| Independence | Observations are independent of each other | Study design: no time series, no clustering, no repeated measures |
| Normality | Residuals are approximately normally distributed | Histogram and QQ-plot of residuals |
| Equal variance | Residuals have constant spread across all predicted values (homoscedasticity) | Residuals vs. predicted values: look for constant width |
Residual Plots in Python
# ---- RESIDUAL DIAGNOSTICS ----
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# 1. Residuals vs. Predicted (check linearity and equal variance)
axes[0, 0].scatter(multi_model.fittedvalues, multi_model.resid,
color='steelblue', edgecolors='navy', s=60, alpha=0.7)
axes[0, 0].axhline(y=0, color='red', linestyle='--', linewidth=1.5)
axes[0, 0].set_xlabel('Predicted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs. Predicted')
# 2. QQ-plot (check normality)
from scipy import stats
(theoretical, sample), (slope, intercept, r) = stats.probplot(
multi_model.resid, dist="norm"
)
axes[0, 1].scatter(theoretical, sample, color='steelblue',
edgecolors='navy', s=60, alpha=0.7)
axes[0, 1].plot(theoretical, slope * np.array(theoretical) + intercept,
'r--', linewidth=1.5)
axes[0, 1].set_xlabel('Theoretical Quantiles')
axes[0, 1].set_ylabel('Sample Quantiles')
axes[0, 1].set_title('Normal QQ-Plot of Residuals')
# 3. Histogram of residuals
axes[1, 0].hist(multi_model.resid, bins=8, color='steelblue',
edgecolor='navy', alpha=0.7)
axes[1, 0].set_xlabel('Residual')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Distribution of Residuals')
# 4. Residuals vs. each predictor (check linearity per variable)
axes[1, 1].scatter(communities['poverty_rate'], multi_model.resid,
color='steelblue', edgecolors='navy', s=60, alpha=0.7)
axes[1, 1].axhline(y=0, color='red', linestyle='--', linewidth=1.5)
axes[1, 1].set_xlabel('Poverty Rate')
axes[1, 1].set_ylabel('Residuals')
axes[1, 1].set_title('Residuals vs. Poverty Rate')
plt.suptitle('Residual Diagnostics for Multiple Regression',
fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
Reading Residual Plots
| Pattern | Diagnosis | Remedy |
|---|---|---|
| Random scatter around zero | Good — assumptions satisfied | None needed |
| Curved pattern | Non-linearity — need to transform variables or add polynomial terms | Try $\log(x)$, $\sqrt{x}$, or add $x^2$ |
| Fan/funnel shape (wider on one side) | Heteroscedasticity — unequal variance | Try $\log(y)$ transformation or use weighted least squares |
| Trend or cycles | Non-independence — often time-series data | Use time series methods (beyond this course) |
| A few extreme points | Potential outliers/influential observations | Investigate; report results with and without |
23.10 Interaction Terms: When the Effect Depends on Another Variable
So far, our model assumes that the effect of each predictor on $y$ is the same regardless of the values of the other predictors. The slope for poverty rate is 5.81 whether AQI is high or low.
But what if that's not true? What if poverty has a bigger effect on ER visits in areas with poor air quality than in areas with clean air? This is called an interaction — the effect of one variable depends on the level of another.
To test for an interaction, you add a product term to the model:
$$\widehat{\text{ER rate}} = b_0 + b_1 \times \text{poverty} + b_2 \times \text{AQI} + b_3 \times (\text{poverty} \times \text{AQI})$$
The coefficient $b_3$ tells you: for each one-unit increase in AQI, how much does the effect of poverty change?
If $b_3$ is statistically significant, the interaction is real — the variables don't just add up, they multiply.
# Test for interaction between poverty and AQI
interaction_model = smf.ols(
'er_rate ~ poverty_rate * aqi',
data=communities
).fit()
print(interaction_model.summary())
# Note: poverty_rate * aqi in the formula automatically includes
# poverty_rate, aqi, AND their product poverty_rate:aqi
When to Include Interactions
- Theory-driven: You have a reason to believe the effect of one variable depends on another.
- Exploratory: You're testing whether effects differ across subgroups (e.g., does the effect of a treatment differ by gender?).
- Caution: Interactions increase model complexity and reduce interpretability. Don't add them blindly. With $k$ predictors, there are $k(k-1)/2$ possible pairwise interactions. Testing all of them is a p-hacking risk.
💡 Key Insight
When an interaction is present, you cannot interpret the individual ("main") effects in isolation. If $b_3$ (poverty $\times$ AQI) is significant, the "effect of poverty" depends on the value of AQI. You must describe the effect of poverty at different levels of AQI. This makes the model more realistic but harder to communicate.
23.11 Model Building Strategies
With many potential predictors, how do you decide which to include? There are three broad strategies:
1. Substantive (Theory-Driven) Model Building
Start with theory and domain knowledge. Include variables that you have reason to believe are related to the outcome. This is the best approach because: - It's guided by understanding, not data mining - It reduces the risk of overfitting and p-hacking - The model is interpretable and defensible
Maya's approach: She included poverty rate, uninsured percentage, and AQI because public health research has established that all three affect healthcare utilization. She didn't go fishing through dozens of variables hoping to find significant ones.
2. Forward Selection
Start with no predictors. Add the one with the smallest p-value. Then add the next best predictor (given the first is already in). Continue until no remaining predictor has $p < \alpha$.
3. Backward Elimination
Start with all potential predictors. Remove the one with the largest p-value (as long as it's above $\alpha$). Refit the model. Continue until all remaining predictors are significant.
Which Strategy to Use?
| Strategy | Pros | Cons |
|---|---|---|
| Substantive | Theoretically grounded, interpretable, less overfitting | Requires domain knowledge; may miss unexpected predictors |
| Forward | Simple; starts simple and adds complexity | Order-dependent; may miss combinations of predictors |
| Backward | Considers all predictors initially; can find interactions | Requires more data (at least 10-15 observations per predictor); order-dependent |
Best Practice: Use the substantive approach as your primary strategy. Use forward or backward selection as a secondary check, and be transparent about which method you used. Never present a data-mined model as if it were theory-driven.
A Rule of Thumb for Sample Size
A widely-cited guideline: you need at least 10-15 observations per predictor in your model. With 25 communities, Maya can reasonably include 2-3 predictors. With 5 predictors, she'd be pushing the limits.
This connects to the bias-variance tradeoff: more predictors can explain more variability (lower bias) but with fewer observations per parameter, the estimates become unstable (higher variance).
23.12 Alex's Application: Predicting Watch Time
Let's see multiple regression in a different context. Alex Rivera at StreamVibe wants to predict weekly watch time from three variables: content diversity score, recommendation accuracy, and subscription tier.
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
np.random.seed(42)
n = 200
# Alex's StreamVibe user data
users = pd.DataFrame({
'content_diversity': np.random.uniform(30, 95, n),
'rec_accuracy': np.random.uniform(50, 98, n),
'tier': np.random.choice(['Free', 'Basic', 'Premium'], n,
p=[0.4, 0.35, 0.25])
})
# Generate watch time with known relationships
users['watch_hours'] = (
2.0 +
0.15 * users['content_diversity'] +
0.20 * users['rec_accuracy'] +
3.5 * (users['tier'] == 'Basic').astype(int) +
8.0 * (users['tier'] == 'Premium').astype(int) +
np.random.normal(0, 3, n)
)
# Fit the model
alex_model = smf.ols(
'watch_hours ~ content_diversity + rec_accuracy + C(tier)',
data=users
).fit()
print(alex_model.summary())
Interpreting Alex's Output
Suppose the output shows:
coef std err t P>|t|
-------------------------------------------------------------
Intercept 1.8234 1.512 1.206 0.229
content_diversity 0.1523 0.015 10.153 0.000
rec_accuracy 0.1987 0.016 12.418 0.000
C(tier)[T.Basic] 3.4129 0.478 7.140 0.000
C(tier)[T.Premium] 7.8541 0.535 14.681 0.000
Content diversity ($b_1 = 0.15$): "For each one-point increase in content diversity score, predicted weekly watch time increases by 0.15 hours (about 9 minutes), holding recommendation accuracy and subscription tier constant."
Recommendation accuracy ($b_2 = 0.20$): "For each one-point increase in recommendation accuracy, predicted weekly watch time increases by 0.20 hours (about 12 minutes), holding content diversity and tier constant."
Basic tier ($b_3 = 3.41$): "Compared to Free tier users, Basic tier users watch an average of 3.41 more hours per week, holding content diversity and recommendation accuracy constant."
Premium tier ($b_4 = 7.85$): "Compared to Free tier users, Premium tier users watch an average of 7.85 more hours per week, holding content diversity and recommendation accuracy constant."
Notice: Free is the reference category. The coefficients for Basic and Premium represent the difference from Free, not absolute values.
Alex's Insight for the StreamVibe Team: "Three things drive watch time. Content diversity and recommendation accuracy both matter — improving recommendation accuracy by 10 points adds about 2 hours of weekly watch time. But the biggest factor is subscription tier: Premium users watch nearly 8 hours more per week than Free users, even after controlling for content diversity and recommendation quality. This suggests the Premium content library itself — not just better recommendations — is what keeps subscribers engaged."
23.13 James's Application: Controlling for Criminal History
This is the most important application in the chapter, and it connects directly to the threshold concept.
Professor James Washington has been studying racial disparities in the criminal justice system. In Chapter 22, he found a correlation between a defendant's race and their algorithm-assigned risk score. But critics argue: "Of course risk scores differ by race — they're based on criminal history, and criminal history differs by race. Once you control for criminal history, the race effect disappears."
Multiple regression lets James test this claim directly.
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
np.random.seed(2026)
n = 500
# James's defendant data
defendants = pd.DataFrame({
'race': np.random.choice(['White', 'Black'], n, p=[0.55, 0.45]),
'prior_convictions': np.random.poisson(1.5, n),
'age': np.random.normal(32, 8, n).clip(18, 65).astype(int)
})
# Risk score with known structure:
# - Criminal history matters (as it should)
# - Race has a SMALL but real direct effect (which it shouldn't)
defendants['risk_score'] = (
2.0 +
1.8 * defendants['prior_convictions'] +
-0.05 * defendants['age'] +
0.8 * (defendants['race'] == 'Black').astype(int) +
np.random.normal(0, 1.2, n)
).clip(1, 10)
# Create a prior conviction disparity (Black defendants have
# slightly more priors on average, reflecting systemic factors)
defendants.loc[defendants['race'] == 'Black', 'prior_convictions'] += \
np.random.binomial(1, 0.3, (defendants['race'] == 'Black').sum())
# Step 1: Simple regression — race only
simple = smf.ols(
'risk_score ~ C(race, Treatment(reference="White"))',
data=defendants
).fit()
print("=" * 60)
print("MODEL 1: Risk Score ~ Race (simple)")
print("=" * 60)
print(f"Race coefficient: {simple.params.iloc[1]:.3f}")
print(f"p-value: {simple.pvalues.iloc[1]:.4f}")
print(f"R² = {simple.rsquared:.4f}")
print()
# Step 2: Multiple regression — race + criminal history + age
multiple = smf.ols(
'risk_score ~ C(race, Treatment(reference="White")) + '
'prior_convictions + age',
data=defendants
).fit()
print("=" * 60)
print("MODEL 2: Risk Score ~ Race + Priors + Age (multiple)")
print("=" * 60)
print(multiple.summary().tables[1])
print(f"\nR² = {multiple.rsquared:.4f}")
print(f"Adj. R² = {multiple.rsquared_adj:.4f}")
What James Finds
| Model | Race coefficient (Black vs. White) | p-value | R² |
|---|---|---|---|
| Simple (race only) | 1.35 | < 0.001 | 0.062 |
| Multiple (race + priors + age) | 0.78 | < 0.001 | 0.614 |
The race effect shrank — from 1.35 to 0.78 — but it didn't disappear.
In the simple model, Black defendants scored 1.35 points higher than White defendants on average. This includes the direct effect of race plus the indirect effect through criminal history (since Black defendants in this data have slightly more prior convictions, partly reflecting systemic factors).
In the multiple regression, after controlling for prior convictions and age, Black defendants still score 0.78 points higher. This is the estimated direct effect of race on the risk score, holding criminal history and age constant.
James's Conclusion: "The critics were half right. Some of the racial disparity in risk scores is explained by differences in criminal history. But roughly 58% of the disparity (0.78/1.35) persists even after controlling for prior convictions and age. The algorithm appears to penalize Black defendants beyond what their criminal history alone would predict."
⚠️ Important Caveat
Multiple regression can control for measured confounders, but it cannot control for unmeasured ones. James's model controls for prior convictions and age, but there may be other factors — type of offense, employment status, family support — that differ by race and also affect risk scores. This is why multiple regression provides partial control for confounders, not complete control. Only randomized experiments (where race can't be randomized) achieve that. This is a fundamental limitation that every researcher must acknowledge.
23.14 Excel: Multiple Regression with the Data Analysis ToolPak
Excel can perform multiple regression using the Data Analysis ToolPak (the same tool you used for ANOVA in Chapter 20 and simple regression in Chapter 22).
Step-by-Step
-
Enter your data in columns. Each column is a variable, each row is an observation.
-
Create dummy variables manually. If you have a categorical variable with 3 categories, create 2 columns of 0s and 1s. (Excel doesn't do this automatically.)
-
Go to Data → Data Analysis → Regression.
-
Set the Input Y Range to your response variable column.
-
Set the Input X Range to your predictor variable columns (select all of them at once).
-
Check "Labels" if your first row contains column names.
-
Check "Residuals" and "Residual Plots" to get diagnostic information.
-
Click OK.
Reading the Excel Output
Excel's regression output has three sections:
Regression Statistics: | Metric | Value | |--------|-------| | Multiple R | $\sqrt{R^2}$ (the multiple correlation coefficient) | | R Square | $R^2$ | | Adjusted R Square | Adjusted $R^2$ | | Standard Error | $\sqrt{MS_{\text{Residual}}}$ (the standard deviation of residuals) | | Observations | $n$ |
ANOVA Table: This is the F-test for overall model significance. Look at the "Significance F" value — if it's below 0.05, at least one predictor is significant.
Coefficients Table: Each row shows one predictor's coefficient, standard error, t-statistic, p-value, and 95% CI. Interpret exactly as in the Python output.
Excel Limitation: Excel doesn't calculate VIF or create QQ-plots automatically. For VIF, you'd need to run separate regressions of each predictor on the others. For QQ-plots, you'd need to sort the residuals and compare them to theoretical normal quantiles manually — or use a statistical add-in. This is one area where Python has a clear advantage.
23.15 The Connection to AI and Machine Learning
Here's where Theme 3 — AI models are regression on steroids — comes into focus.
Everything you've learned in this chapter is the foundation of machine learning:
| Concept in This Chapter | ML Equivalent |
|---|---|
| Multiple regression | Linear models (the simplest ML algorithm) |
| $k$ predictors | "Features" (ML datasets often have hundreds or thousands) |
| $R^2$ and adjusted $R^2$ | Cross-validation score |
| Residual diagnostics | Model evaluation metrics |
| Multicollinearity | Feature selection / dimensionality reduction |
| Dummy variables | One-hot encoding |
| Interaction terms | Feature engineering / polynomial features |
| Forward/backward selection | Automated feature selection |
| Overfitting (too many predictors) | The central problem of ML |
The difference between a multiple regression model with 5 predictors and a neural network with 50,000 parameters is one of scale and flexibility, not of kind. The neural network can capture complex nonlinear relationships that regression can't, but it's still doing the same fundamental thing: learning the relationship between input variables and an output variable from data.
And here's the crucial part: the neural network has all the same problems, amplified: - Multicollinearity → Feature redundancy - Overfitting → Even more severe with millions of parameters - Interpretation → Far harder; "black box" models - Confounding → Still present; ML doesn't solve causation
💡 Key Insight
Understanding multiple regression doesn't just prepare you for the next statistics chapter — it prepares you for the entire field of data science. If you understand why coefficients change when you add variables, why $R^2$ always increases with more predictors, why VIF matters, and why controlling for confounders isn't the same as establishing causation, you have the conceptual foundation for understanding any predictive model, no matter how complex.
23.16 Mathematical Formulations
For students who want to see the mathematics behind the computation, here are the key formulations.
The Normal Equations
In matrix notation, the multiple regression model is:
$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$$
where $\mathbf{y}$ is the $n \times 1$ vector of responses, $\mathbf{X}$ is the $n \times (k+1)$ design matrix (with a column of 1s for the intercept), $\boldsymbol{\beta}$ is the $(k+1) \times 1$ parameter vector, and $\boldsymbol{\varepsilon}$ is the error vector.
The least squares estimates satisfy:
$$\mathbf{b} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$
You don't need to compute this by hand — Python and Excel handle it internally — but knowing this formula explains why multicollinearity is a problem: when predictors are highly correlated, $\mathbf{X}^T\mathbf{X}$ is nearly singular (hard to invert), making the estimates unstable.
The F-Statistic
$$F = \frac{(SS_{\text{Total}} - SS_{\text{Residual}}) / k}{SS_{\text{Residual}} / (n - k - 1)} = \frac{R^2 / k}{(1 - R^2) / (n - k - 1)}$$
Under $H_0$ (all $\beta_i = 0$), $F \sim F(k, n-k-1)$.
The t-Statistic for Individual Predictors
$$t_i = \frac{b_i}{SE(b_i)} = \frac{b_i}{\sqrt{s^2 \cdot (\mathbf{X}^T\mathbf{X})^{-1}_{ii}}}$$
where $s^2 = SS_{\text{Res}} / (n - k - 1)$ is the mean squared error.
Under $H_0$ ($\beta_i = 0$), $t_i \sim t(n - k - 1)$.
Adjusted $R^2$
$$R^2_{\text{adj}} = 1 - \frac{SS_{\text{Res}} / (n - k - 1)}{SS_{\text{Total}} / (n - 1)} = 1 - \left(\frac{n - 1}{n - k - 1}\right)(1 - R^2)$$
VIF
$$\text{VIF}_i = \frac{1}{1 - R_i^2}$$
where $R_i^2$ is from regressing $x_i$ on all other predictors.
23.17 Common Mistakes in Multiple Regression
| Mistake | Why It's Wrong | What to Do Instead |
|---|---|---|
| "The coefficient for poverty is 5.81, which is bigger than 1.12 for AQI, so poverty matters more" | Coefficients depend on the units of measurement; comparing them directly is meaningless | Use standardized coefficients (beta weights) or compare p-values for significance; compare the practical impact (e.g., "a 10-point increase in AQI is realistic, but a 10-percentage-point increase in poverty is huge") |
| Interpreting coefficients without "holding other variables constant" | Omitting the conditional phrase changes the meaning entirely | Always include it in every interpretation |
| Adding every available variable to maximize $R^2$ | $R^2$ always increases with more variables, even irrelevant ones; this overfits the model | Use adjusted $R^2$; use substantive reasoning for variable selection; follow the 10-15 observations per predictor rule |
| Ignoring multicollinearity | High VIF inflates standard errors, making coefficients unreliable and interpretation misleading | Calculate VIF; consider removing or combining highly correlated predictors |
| Using $k$ dummy variables for $k$ categories | Creates perfect multicollinearity; the model won't run or will give meaningless results | Use $k - 1$ dummy variables; choose a meaningful reference category |
| Claiming causation from multiple regression | Even controlling for confounders doesn't establish causation in observational data | State "associated with" rather than "causes"; acknowledge unmeasured confounders |
| Not checking residual plots | The model might fit well by $R^2$ but violate assumptions badly | Always create residuals vs. predicted, QQ-plot, and residuals vs. each predictor |
23.18 Putting It All Together: A Complete Analysis Checklist
When you perform a multiple regression analysis, follow these steps:
Pre-Analysis
- [ ] Define your research question. What outcome are you predicting? Why?
- [ ] Choose predictors based on theory. What variables do you expect to be related to the outcome?
- [ ] Check sample size. Do you have at least 10-15 observations per predictor?
Exploratory Analysis
- [ ] Create scatterplots of $y$ vs. each $x$.
- [ ] Create a correlation matrix to check for multicollinearity among predictors.
- [ ] Check for outliers using box plots or scatterplots.
Model Fitting
- [ ] Fit the model using
smf.ols()in Python or Data Analysis ToolPak in Excel. - [ ] Check the overall F-test. Is the model as a whole significant?
- [ ] Check individual t-tests. Which specific predictors are significant?
- [ ] Report $R^2$ and adjusted $R^2$.
Diagnostics
- [ ] Residuals vs. predicted values: Look for patterns (linearity, equal variance).
- [ ] QQ-plot of residuals: Check normality.
- [ ] Calculate VIF: Check for multicollinearity.
- [ ] Check for influential observations (if appropriate).
Interpretation
- [ ] Interpret each coefficient using the "holding other variables constant" template.
- [ ] Discuss causation carefully. Is this observational or experimental data?
- [ ] Acknowledge limitations: unmeasured confounders, sample size, generalizability.
# Complete analysis template
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt
from scipy import stats
# 1. Load and explore data
df = pd.read_csv('your_data.csv')
print(df.describe())
print(df.corr())
# 2. Fit the model
model = smf.ols('y ~ x1 + x2 + x3', data=df).fit()
print(model.summary())
# 3. Check VIF
X = df[['x1', 'x2', 'x3']]
for i, col in enumerate(X.columns):
vif = variance_inflation_factor(X.values, i)
print(f"VIF for {col}: {vif:.2f}")
# 4. Residual diagnostics
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# Residuals vs. predicted
axes[0].scatter(model.fittedvalues, model.resid, alpha=0.6)
axes[0].axhline(y=0, color='red', linestyle='--')
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs. Predicted')
# QQ-plot
stats.probplot(model.resid, plot=axes[1])
axes[1].set_title('Normal QQ-Plot')
# Histogram
axes[2].hist(model.resid, bins=15, edgecolor='navy', alpha=0.7)
axes[2].set_xlabel('Residual')
axes[2].set_title('Distribution of Residuals')
plt.tight_layout()
plt.show()
23.19 Progressive Project: Build a Multiple Regression Model
It's time to extend your Data Detective Portfolio to multiple regression.
Your Task
-
Choose a response variable and at least 2-3 predictor variables from your dataset. At least one predictor should be numerical and at least one should be categorical (if available).
-
Exploratory analysis: - Create scatterplots of $y$ vs. each numerical predictor. - Compute a correlation matrix among numerical predictors. - Create box plots of $y$ by categories of your categorical predictor.
-
Fit a simple regression model with your strongest predictor. Record the slope and $R^2$.
-
Fit a multiple regression model with all your predictors. Use
statsmodelswith the formula API. -
Compare the simple and multiple models: - How did the coefficient for your first predictor change? - Did it shrink, stay the same, or grow? What does this tell you about confounding? - Compare $R^2$ and adjusted $R^2$.
-
Check multicollinearity: Calculate VIF for each numerical predictor. Are any values above 5?
-
Create residual diagnostic plots: Residuals vs. predicted, QQ-plot, and histogram of residuals. Do the residuals look approximately normal with constant variance?
-
Interpret your results: Write a paragraph interpreting each coefficient using the "holding other variables constant" template. Discuss what your model explains and what it doesn't.
Starter Code
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
# Load your dataset
df = pd.read_csv('your_dataset.csv')
# --- Step 1: Exploratory Analysis ---
# Correlation matrix for numerical variables
numerical_vars = ['your_response', 'predictor1', 'predictor2']
corr_matrix = df[numerical_vars].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0,
fmt='.3f', square=True)
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()
# Scatterplot matrix
sns.pairplot(df[numerical_vars])
plt.suptitle('Scatterplot Matrix', y=1.02)
plt.show()
# --- Step 2: Simple vs. Multiple Regression ---
# Simple model
simple = smf.ols('your_response ~ predictor1', data=df).fit()
print("SIMPLE MODEL")
print(f"Slope: {simple.params['predictor1']:.4f}")
print(f"R²: {simple.rsquared:.4f}")
# Multiple model
multiple = smf.ols(
'your_response ~ predictor1 + predictor2 + C(categorical_var)',
data=df
).fit()
print("\nMULTIPLE MODEL")
print(multiple.summary())
# --- Step 3: VIF ---
X = df[['predictor1', 'predictor2']].dropna()
for i, col in enumerate(X.columns):
vif = variance_inflation_factor(X.values, i)
print(f"VIF for {col}: {vif:.2f}")
# --- Step 4: Residual Diagnostics ---
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
axes[0].scatter(multiple.fittedvalues, multiple.resid, alpha=0.6)
axes[0].axhline(y=0, color='red', linestyle='--')
axes[0].set_xlabel('Predicted Values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs. Predicted')
stats.probplot(multiple.resid, plot=axes[1])
axes[1].set_title('Normal QQ-Plot')
axes[2].hist(multiple.resid, bins=15, edgecolor='navy', alpha=0.7)
axes[2].set_xlabel('Residuals')
axes[2].set_title('Histogram of Residuals')
plt.tight_layout()
plt.show()
# --- Step 5: Interpretation (write in markdown cell) ---
# For each coefficient, write:
# "For each one-unit increase in [predictor], the predicted
# [response] changes by [b] units, holding all other
# variables constant."
Add this analysis to your Jupyter notebook under a new heading: "Multiple Regression: Controlling for Confounders."
23.20 Chapter Summary
What We Learned
This chapter extended simple regression to the real world, where outcomes depend on many variables simultaneously.
-
Why multiple regression? Simpson's Paradox demonstrated that analyzing one variable at a time can give the wrong answer. Multiple regression lets you examine the effect of each variable while controlling for the others.
-
The multiple regression equation is $\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_k x_k$. Same structure as simple regression, more terms.
-
The threshold concept: "holding other variables constant." Each partial regression coefficient $b_i$ estimates the change in $y$ for a one-unit increase in $x_i$, holding all other predictors fixed. This is how regression partially controls for confounders.
-
Coefficients change from simple to multiple regression — typically shrinking — because the multiple regression isolates each variable's unique effect rather than its total association. This is the whole point.
-
Adjusted $R^2$ penalizes complexity. Unlike $R^2$, which always increases when you add variables, adjusted $R^2$ only increases if the new variable genuinely improves the model.
-
The F-test asks "is at least one predictor useful?" while individual t-tests ask "is this specific predictor useful, given the others?"
-
Multicollinearity — correlation among predictors — inflates standard errors and makes individual coefficients hard to interpret. Detect it with VIF; address it by removing or combining predictors.
-
Indicator (dummy) variables let you include categorical predictors. Use $k - 1$ indicators for $k$ categories.
-
Residual diagnostics (linearity, independence, normality, equal variance) are even more important with multiple predictors.
-
Interaction terms capture situations where the effect of one variable depends on the level of another.
-
Model building should be theory-driven first, data-driven second. Use at least 10-15 observations per predictor.
-
Multiple regression is the foundation of machine learning. Neural networks, random forests, and gradient boosting are all extensions of the same basic idea: learn the relationship between inputs and outputs from data.
Key Formulas
| Formula | Description |
|---|---|
| $\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_k x_k$ | Multiple regression equation |
| $R^2_{\text{adj}} = 1 - \frac{(1-R^2)(n-1)}{n-k-1}$ | Adjusted R-squared |
| $F = \frac{R^2/k}{(1-R^2)/(n-k-1)}$ | F-statistic for overall model |
| $t_i = \frac{b_i}{SE(b_i)}$ | t-statistic for individual predictor |
| $\text{VIF}_i = \frac{1}{1 - R_i^2}$ | Variance Inflation Factor |
| $\mathbf{b} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ | Least squares solution (matrix form) |
Connections to What's Next
In Chapter 24, you'll learn what happens when the response variable is categorical — yes/no, pass/fail, survived/died. Linear regression can't handle binary outcomes properly (it can predict probabilities below 0 or above 1). Logistic regression solves this problem by modeling the log-odds of success, using a function that always produces values between 0 and 1. The threshold concept shifts from "holding other variables constant" to "thinking in odds."
What's Next: Chapter 24 tackles logistic regression — the tool for predicting categorical outcomes. When the question is "will this customer churn?" or "will this defendant reoffend?" or "will this patient develop the disease?", you need a model that predicts probabilities, not quantities. Logistic regression provides exactly that, building directly on the multiple regression framework you've just mastered.
"The best models of the world are not the ones that capture every detail, but the ones that capture the right details." — Adapted from George E. P. Box