Everything in this course has been building to this chapter.
Learning Objectives
- Calculate and interpret the Pearson correlation coefficient
- Distinguish between correlation and causation with specific examples
- Fit a simple linear regression model and interpret slope and intercept
- Assess model fit using r-squared and residual plots
- Make predictions and understand extrapolation risk
In This Chapter
- Chapter Overview
- 22.1 A Puzzle Before We Start (Productive Struggle)
- 22.2 Scatterplots: The Starting Point
- 22.3 The Pearson Correlation Coefficient
- 22.4 Worked Example: Calculating $r$ by Hand
- 22.5 Correlation Does Not Imply Causation
- 22.6 Simple Linear Regression: The Model
- 22.7 Worked Example: Fitting a Regression Line
- 22.8 Assessing Model Fit: $R^2$ and Residual Plots
- 22.9 Regression to the Mean
- 22.10 The Danger of Extrapolation
- 22.11 Regression in Python: The Full Toolkit
- 22.12 Maya's Analysis: Poverty Rate and ER Visits
- 22.13 Alex's Analysis: User Engagement and Content Diversity
- 22.14 Sam's Analysis: Practice Hours and Shooting Percentage
- 22.15 James's Analysis: Algorithm Risk Score and Actual Recidivism
- 22.16 Conditions for Regression Inference
- 22.17 Spaced Review: What 95% Confidence Really Means (from Ch.12)
- 22.18 Spaced Review: Statistical vs. Practical Significance (from Ch.17)
- 22.19 Progressive Project: Scatterplots and Regression in Your Dataset
- 22.20 Chapter Summary
Chapter 22: Correlation and Simple Linear Regression
"All models are wrong, but some are useful." — George E. P. Box
Chapter Overview
Everything in this course has been building to this chapter.
Not literally everything, of course. Confidence intervals are important. Hypothesis testing is essential. ANOVA is powerful. But if you ask working data scientists, business analysts, epidemiologists, economists, or sports statisticians what technique they use most often in their daily work, the overwhelming answer is regression.
Regression is the Swiss Army knife of statistics. It lets you answer questions like:
- How does study time relate to exam scores? (And by how much?)
- Does advertising spending predict sales? (And can we put a number on it?)
- Is there a relationship between a city's poverty rate and its ER visit rate? (And is something else driving both?)
- Can we predict a basketball player's shooting percentage from practice hours? (And should we trust that prediction?)
These aren't just abstract questions. Maya Chen needs to understand why some communities have higher ER visit rates. Alex Rivera wants to predict user engagement from content diversity scores. Sam Okafor is trying to connect practice hours to on-court performance. Professor James Washington is examining whether an algorithm's risk scores actually predict real-world outcomes.
But before we can predict with regression, we need to describe relationships. And that starts with two tools you've been waiting to use together: scatterplots and correlation.
This chapter is also the flagship chapter for Theme 5: correlation vs. causation. You've encountered this idea in Chapter 1, revisited it in Chapter 4 (confounding variables), and seen it surface in Chapter 16 (observational vs. experimental comparisons). Now we give it the full treatment. By the end of this chapter, you'll be able to spot the difference between a genuine causal relationship and a spurious correlation — and more importantly, you'll understand why the distinction matters.
In this chapter, you will learn to: - Calculate and interpret the Pearson correlation coefficient - Distinguish between correlation and causation with specific examples - Fit a simple linear regression model and interpret slope and intercept - Assess model fit using r-squared and residual plots - Make predictions and understand extrapolation risk
Fast Track: If you've encountered correlation and regression before, skim Sections 22.1–22.4, then jump to Section 22.9 (regression to the mean). Complete quiz questions 1, 10, and 18 to verify.
Deep Dive: After this chapter, read Case Study 1 (Maya's poverty-rate and ER-visit analysis with lurking variable investigation) for a complete worked application, then Case Study 2 (James's algorithm risk scores vs. actual recidivism) for a critical evaluation of predictive models. Both include full Python code.
22.1 A Puzzle Before We Start (Productive Struggle)
Before we define anything, look at these four datasets.
Anscombe's Quartet
Each of the four datasets below has the same mean of $x$, the same mean of $y$, the same standard deviation of $x$, the same standard deviation of $y$, the same correlation coefficient ($r \approx 0.816$), the same regression line ($\hat{y} = 3.00 + 0.50x$), and the same $R^2$ ($\approx 0.67$).
Dataset I Dataset II Dataset III Dataset IV $x$: 10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5 $x$: 10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5 $x$: 10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5 $x$: 8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8 $y$: 8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68 $y$: 9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74 $y$: 7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73 $y$: 6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89 (a) Without plotting, what would you conclude about these four datasets from the summary statistics alone?
(b) Now imagine plotting each dataset as a scatterplot. Dataset I shows a clear linear pattern. What might the other three look like?
(c) What does this tell you about relying solely on numerical summaries?
Take 3 minutes. This puzzle contains the single most important lesson in all of regression.
Here's the answer, and it's one of the most famous results in all of statistics.
When you plot these four datasets, they look completely different:
- Dataset I: A nice linear relationship with random scatter. This is what you hope your data looks like.
- Dataset II: A clear curved (quadratic) relationship. A straight line is the wrong model entirely.
- Dataset III: A perfect linear relationship except for one outlier that pulls the regression line.
- Dataset IV: One extreme point creates the entire appearance of a relationship. Remove it, and there's nothing.
All four datasets produce identical summary statistics — same $r$, same regression line, same $R^2$. But they describe fundamentally different phenomena.
This is Anscombe's Quartet, published by statistician Francis Anscombe in 1973 specifically to demonstrate a point that cannot be overstated:
Always plot your data first.
No amount of correlation coefficients, regression equations, or $R^2$ values can substitute for looking at the data. Numerical summaries can deceive you. Scatterplots cannot — or at least, they're much harder to misread.
import numpy as np
import matplotlib.pyplot as plt
# Anscombe's Quartet
x1 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
x2 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]
x3 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]
x4 = [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]
y4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
datasets = [(x1, y1), (x2, y2), (x3, y3), (x4, y4)]
for i, (ax, (x, y)) in enumerate(zip(axes.flat, datasets)):
ax.scatter(x, y, color='steelblue', edgecolors='navy', s=60)
# Fit and plot regression line
m, b = np.polyfit(x, y, 1)
x_line = np.linspace(2, 20, 100)
ax.plot(x_line, m * x_line + b, 'r--', linewidth=1.5)
ax.set_title(f'Dataset {["I", "II", "III", "IV"][i]}',
fontsize=12, fontweight='bold')
ax.set_xlim(2, 20)
ax.set_ylim(2, 14)
ax.set_xlabel('x')
ax.set_ylabel('y')
plt.suptitle("Anscombe's Quartet: Same Statistics, Different Stories",
fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
# Verify identical statistics
for i, (x, y) in enumerate(datasets):
x, y = np.array(x), np.array(y)
print(f"Dataset {['I','II','III','IV'][i]}: "
f"mean_x={np.mean(x):.1f}, mean_y={np.mean(y):.2f}, "
f"r={np.corrcoef(x, y)[0,1]:.3f}")
You've been introduced to scatterplots in Chapter 5. Now it's time to quantify what they show.
22.2 Scatterplots: The Starting Point
🔄 Spaced Review 1 (from Ch.5): Scatterplots
In Chapter 5, you learned that scatterplots display the relationship between two numerical variables. Each point represents one observation, with the explanatory variable ($x$) on the horizontal axis and the response variable ($y$) on the vertical axis. You learned to describe scatterplot patterns using three features:
- Direction: positive (uphill), negative (downhill), or no direction
- Form: linear, curved, or no pattern
- Strength: how tightly the points cluster around the form
Back then, you described these patterns in words. Now we're going to put numbers on them. The number that captures direction and strength for linear relationships is the correlation coefficient.
When you create a scatterplot, you're answering the most fundamental question in data analysis: do these two variables move together?
Here's the rule that should be tattooed on every data analyst's forearm:
Before computing any correlation or fitting any regression line, ALWAYS make a scatterplot.
Always. Every single time. No exceptions.
Why? Because a scatterplot tells you three things no number can:
-
Is the relationship actually linear? Correlation and regression assume linearity. If the relationship is curved, your numbers will be misleading (remember Dataset II from Anscombe's Quartet).
-
Are there outliers? A single extreme point can dramatically change a correlation coefficient or tilt a regression line (Datasets III and IV).
-
Are there clusters or subgroups? Sometimes what looks like one relationship is actually two or more separate relationships combined.
Creating Scatterplots in Python
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# Example: study hours vs. exam scores
study_hours = [2, 3, 5, 1, 6, 4, 7, 3, 5, 8, 2, 4, 6, 1, 9]
exam_scores = [65, 70, 82, 55, 88, 75, 92, 68, 80, 95, 60, 78, 85, 50, 98]
plt.figure(figsize=(8, 6))
plt.scatter(study_hours, exam_scores, color='steelblue',
edgecolors='navy', s=80, alpha=0.7)
plt.xlabel('Study Hours per Week', fontsize=12)
plt.ylabel('Exam Score', fontsize=12)
plt.title('Study Hours vs. Exam Scores', fontsize=14)
plt.grid(True, alpha=0.3)
plt.show()
Creating Scatterplots in Excel
- Select both columns of data (explanatory and response variables)
- Insert tab → Charts → Scatter (X, Y)
- Choose Scatter with Only Markers (the first option)
- Right-click chart → Add Chart Element → Axis Titles
- Label your axes clearly
Key Insight: Notice that I said explanatory variable on the x-axis and response variable on the y-axis. In regression, this distinction matters. The explanatory variable (also called the independent variable or predictor) is the one you think might influence the other. The response variable (also called the dependent variable or outcome) is the one you're trying to explain or predict. Study hours explain exam scores — not the other way around.
22.3 The Pearson Correlation Coefficient
Now let's quantify what a scatterplot shows. The Pearson correlation coefficient, denoted $r$, is a single number that measures the strength and direction of the linear relationship between two numerical variables.
The Formula
Here's the formula:
$$r = \frac{1}{n-1} \sum_{i=1}^{n} \left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)$$
🔄 Spaced Review 2 (from Ch.6): Standard Deviation
In Chapter 6, you learned that the standard deviation measures the "typical distance" of observations from the mean:
$$s = \sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1}}$$
Look at the correlation formula above. The terms $\frac{x_i - \bar{x}}{s_x}$ and $\frac{y_i - \bar{y}}{s_y}$ are just z-scores — each observation's distance from the mean, measured in standard deviations. So the correlation coefficient is the average product of paired z-scores.
That's all it is. The standard deviation from Chapter 6 is doing the heavy lifting inside the correlation formula.
Let me break this formula apart in plain English.
Step 1: For each observation, convert both $x$ and $y$ to z-scores. This standardizes both variables to the same scale — now a value of +1 means "one standard deviation above the mean" regardless of whether you're measuring inches, dollars, or exam points.
Step 2: Multiply each pair of z-scores together. - If both z-scores are positive (above average on both variables), the product is positive. - If both are negative (below average on both), the product is also positive. - If one is positive and the other negative (above average on one, below on the other), the product is negative.
Step 3: Average those products. If points consistently fall in the "both above average" or "both below average" quadrants, the average is positive — a positive correlation. If they consistently fall in the "one above, one below" quadrants, the average is negative — a negative correlation.
An Equivalent Formula
You'll sometimes see the correlation formula written as:
$$r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \cdot \sum_{i=1}^{n}(y_i - \bar{y})^2}}$$
This is mathematically equivalent. The numerator is the sum of cross-products of deviations (how $x$ and $y$ vary together), and the denominator scales it by the total variability in both variables.
Properties of $r$
The Pearson correlation coefficient has several important properties:
| Property | Details |
|---|---|
| Range | $-1 \leq r \leq +1$ |
| Sign | Positive $r$ = positive association (as $x$ increases, $y$ tends to increase). Negative $r$ = negative association (as $x$ increases, $y$ tends to decrease). |
| Magnitude | $\|r\|$ close to 1 = strong linear relationship. $\|r\|$ close to 0 = weak or no linear relationship. |
| Perfect correlation | $r = +1$: all points fall exactly on an upward-sloping line. $r = -1$: all points fall exactly on a downward-sloping line. |
| No units | $r$ has no units. It doesn't matter whether $x$ is in pounds and $y$ is in inches, or $x$ is in meters and $y$ is in dollars. |
| Symmetric | The correlation between $x$ and $y$ equals the correlation between $y$ and $x$. |
| Only measures LINEAR relationships | A perfect U-shaped curve can have $r = 0$. |
Interpreting the Strength of $r$
Here's a rough guide — but remember, context matters enormously:
| $|r|$ | Interpretation |
|---|---|
| 0.00 – 0.19 | Very weak |
| 0.20 – 0.39 | Weak |
| 0.40 – 0.59 | Moderate |
| 0.60 – 0.79 | Strong |
| 0.80 – 1.00 | Very strong |
Common Pitfall: These benchmarks are guidelines, not laws. In physics, a correlation of 0.90 might be disappointing (you expected near-perfect). In psychology, a correlation of 0.40 might be exciting (human behavior is messy). In economics, 0.60 could be groundbreaking. Always interpret $r$ in the context of your field.
Computing $r$ in Python
import numpy as np
from scipy import stats
study_hours = [2, 3, 5, 1, 6, 4, 7, 3, 5, 8, 2, 4, 6, 1, 9]
exam_scores = [65, 70, 82, 55, 88, 75, 92, 68, 80, 95, 60, 78, 85, 50, 98]
# Method 1: numpy (correlation matrix)
r_matrix = np.corrcoef(study_hours, exam_scores)
print(f"Correlation (numpy): r = {r_matrix[0, 1]:.4f}")
# Method 2: scipy (with p-value for the test H0: rho = 0)
r, p_value = stats.pearsonr(study_hours, exam_scores)
print(f"Correlation (scipy): r = {r:.4f}")
print(f"p-value: {p_value:.6f}")
numpy.corrcoef() returns the full 2x2 correlation matrix; the off-diagonal elements contain $r$. scipy.stats.pearsonr() returns both $r$ and the p-value for the hypothesis test $H_0: \rho = 0$ (where $\rho$ is the population correlation).
Computing $r$ in Excel
Use the CORREL function:
=CORREL(A2:A16, B2:B16)
Where column A contains the $x$ values and column B contains the $y$ values.
22.4 Worked Example: Calculating $r$ by Hand
Let's calculate the correlation between study hours and exam scores for a small dataset — five students — so you can see exactly what happens inside the formula.
| Student | Hours ($x$) | Score ($y$) |
|---|---|---|
| 1 | 2 | 65 |
| 2 | 4 | 75 |
| 3 | 6 | 80 |
| 4 | 3 | 70 |
| 5 | 5 | 85 |
Step 1: Calculate means.
$$\bar{x} = \frac{2 + 4 + 6 + 3 + 5}{5} = \frac{20}{5} = 4.0$$
$$\bar{y} = \frac{65 + 75 + 80 + 70 + 85}{5} = \frac{375}{5} = 75.0$$
Step 2: Calculate deviations, products of deviations, and squared deviations.
| Student | $x_i - \bar{x}$ | $y_i - \bar{y}$ | $(x_i - \bar{x})(y_i - \bar{y})$ | $(x_i - \bar{x})^2$ | $(y_i - \bar{y})^2$ |
|---|---|---|---|---|---|
| 1 | $-2$ | $-10$ | $20$ | $4$ | $100$ |
| 2 | $0$ | $0$ | $0$ | $0$ | $0$ |
| 3 | $2$ | $5$ | $10$ | $4$ | $25$ |
| 4 | $-1$ | $-5$ | $5$ | $1$ | $25$ |
| 5 | $1$ | $10$ | $10$ | $1$ | $100$ |
| Sum | 45 | 10 | 250 |
Step 3: Plug into the formula.
$$r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \cdot \sum(y_i - \bar{y})^2}} = \frac{45}{\sqrt{10 \times 250}} = \frac{45}{\sqrt{2500}} = \frac{45}{50} = 0.90$$
Interpretation: There's a very strong positive linear relationship between study hours and exam scores. Students who study more tend to score higher.
Let's verify:
import numpy as np
hours = [2, 4, 6, 3, 5]
scores = [65, 75, 80, 70, 85]
r = np.corrcoef(hours, scores)[0, 1]
print(f"r = {r:.4f}") # r = 0.9000
22.5 Correlation Does Not Imply Causation
This is it. The most important section in this entire chapter. Possibly the most important section in this entire textbook.
🔄 Spaced Review 3 (from Ch.4): Confounding Variables
In Chapter 4, you learned that a confounding variable is a variable associated with both the explanatory and response variables, distorting their apparent relationship. The classic example: ice cream sales and drowning deaths are positively correlated — but eating ice cream doesn't cause drowning. The lurking variable is temperature: hot weather increases both ice cream sales and swimming (which increases drowning risk).
In Chapter 16, you saw this principle in action: Alex's A/B test used random assignment to eliminate confounders, allowing a causal conclusion. Maya's community comparison was observational, so the difference in respiratory illness rates could only be described as an association.
Now we're going to give this idea its full treatment. Regression is a powerful tool for describing relationships — but describing a relationship is not the same as explaining why it exists.
Correlation tells you two things: the direction and strength of a linear association. It does NOT tell you that one variable causes the other.
Three Vivid Examples
Example 1: Ice Cream and Drowning
There's a strong positive correlation between ice cream sales and drowning deaths across months of the year. But banning ice cream would not prevent drowning. The lurking variable (also called a confounding variable) is temperature. Hot months increase both ice cream consumption and swimming.
Temperature (lurking variable)
↗ ↘
Ice cream sales Drowning deaths
↕ (correlated, but not causally connected)
Example 2: Shoe Size and Reading Ability
Among children in elementary school, there's a strong positive correlation between shoe size and reading ability. Do big feet make you a better reader? Of course not. The lurking variable is age. Older children have larger feet and better reading skills.
Example 3: Per Capita Cheese Consumption and Civil Engineering Doctorates
Between 2000 and 2009, per capita cheese consumption in the United States was almost perfectly correlated ($r = 0.958$) with the number of civil engineering doctorates awarded. This is a beautiful example of a spurious correlation — two variables that track each other over time purely by coincidence. Both happened to increase over the same period. The correlation is real (the numbers genuinely co-vary), but the relationship is meaningless.
The website "Spurious Correlations" (tylervigen.com) has collected hundreds of these examples. The divorce rate in Maine correlates with per capita consumption of margarine. U.S. spending on science correlates with suicides by hanging. Nicolas Cage film appearances correlate with drowning in swimming pools. All of these have $r > 0.90$. None of them are causal.
Why Does This Happen?
There are four main explanations for an observed correlation:
| Explanation | Description | Example |
|---|---|---|
| Direct causation | $x$ actually causes $y$ | Smoking causes lung cancer |
| Reverse causation | $y$ actually causes $x$ | Cities with more police have more crime — but crime causes cities to hire police, not the other way around |
| Common cause (confounding) | A lurking variable $z$ causes both $x$ and $y$ | Temperature drives both ice cream sales and drowning |
| Coincidence | No meaningful connection; the correlation is spurious | Cheese consumption and engineering doctorates |
The Lurking Variable
A lurking variable (or confounding variable) is a variable that is not included in the analysis but affects both the explanatory and response variables, creating the appearance of a direct relationship between them.
Key Insight: You cannot determine causation from correlation alone. To establish causation, you need either:
- A randomized controlled experiment (like Alex's A/B test in Chapter 4/16), or
- A very careful observational argument using multiple lines of evidence (temporal precedence, dose-response, biological plausibility, ruling out confounders — as in the landmark studies linking smoking to lung cancer)
Regression with observational data can describe associations. It can control for some confounders (as you'll see in Chapter 23). But it cannot, on its own, prove causation.
Theme 5 Deep Dive: Correlation vs. Causation as a Lifelong Skill
🎯 Theme 5: Correlation vs. Causation — The Flagship Chapter
This theme has run through the entire textbook. In Chapter 1, Maya's correlation between pollution levels and asthma rates raised the question: does pollution cause asthma, or could something else explain the pattern? In Chapter 4, you learned that only randomized experiments can establish causation. In Chapter 13, a statistically significant result told you the effect was real but not whether it was causal. In Chapter 16, the distinction between Alex's experimental A/B test and Maya's observational comparison determined what kind of conclusions each could draw.
Now, in this chapter, you can see exactly why the distinction matters. A correlation coefficient of $r = 0.85$ between poverty and ER visits is impressive — but it doesn't tell you whether reducing poverty would reduce ER visits. A lurking variable (like access to primary care, which is associated with both poverty and ER reliance) might be doing all the work.
The ability to distinguish correlation from causation is not just a statistical skill. It's a critical thinking skill that will serve you in every conversation about data, every news article about a scientific study, and every business decision based on analytics. This is why we call it a lifelong habit.
22.6 Simple Linear Regression: The Model
A scatterplot shows a relationship. A correlation coefficient quantifies its strength. But neither one lets you predict one variable from another. That's what regression does.
Simple linear regression fits a straight line through the data that best describes the linear relationship between $x$ and $y$. The equation of that line is:
$$\hat{y} = b_0 + b_1 x$$
Where: - $\hat{y}$ (read "y-hat") is the predicted value of $y$ for a given $x$ - $b_0$ is the intercept — the predicted value of $y$ when $x = 0$ - $b_1$ is the slope — the predicted change in $y$ for each one-unit increase in $x$ - $x$ is the explanatory (predictor) variable
The hat on $\hat{y}$ is important. It reminds you that $\hat{y}$ is a prediction, not an actual observed value. The actual values rarely fall exactly on the line — there's always some scatter.
What "Best" Means: Least Squares
How do we find the "best" line? There are infinitely many lines we could draw through a scatterplot. We need a criterion for choosing the best one.
The criterion is called least squares: find the line that minimizes the sum of the squared vertical distances between the observed points and the line.
For each observation $i$, the residual is the difference between the observed value and the predicted value:
$$e_i = y_i - \hat{y}_i = y_i - (b_0 + b_1 x_i)$$
The residual is how much the line "misses" for that observation. Positive residuals mean the point is above the line; negative residuals mean it's below.
The least squares criterion finds $b_0$ and $b_1$ that minimize the sum of squared residuals:
$$\text{minimize} \sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} (y_i - b_0 - b_1 x_i)^2$$
Why squared? For the same reason we square deviations in the variance formula (Chapter 6): squaring prevents positive and negative residuals from canceling out, and it penalizes large misses more than small ones.
The Least Squares Formulas
Using calculus (which we won't derive here, but you can find in any mathematical statistics textbook), the values of $b_0$ and $b_1$ that minimize the sum of squared residuals are:
$$b_1 = r \cdot \frac{s_y}{s_x}$$
$$b_0 = \bar{y} - b_1 \bar{x}$$
Look at the slope formula. It's beautifully intuitive:
- The slope depends on the correlation ($r$): stronger correlations produce steeper slopes.
- It's scaled by the ratio of standard deviations ($s_y / s_x$): this adjusts for the fact that $x$ and $y$ may be measured in different units with different spreads.
And the intercept formula guarantees that the regression line passes through the point $(\bar{x}, \bar{y})$ — the center of the data. The line always goes through the "center of gravity" of the scatterplot.
An Alternative Formula for $b_1$
You'll also see the slope written as:
$$b_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}$$
This is the ratio of the "cross-products of deviations" (how $x$ and $y$ vary together) to the "sum of squared deviations of $x$" (the total variability in $x$). Compare this to the correlation formula — the numerator is the same, but the denominator only uses $x$'s variability.
22.7 Worked Example: Fitting a Regression Line
Let's return to our five-student dataset and fit the regression line.
| Student | Hours ($x$) | Score ($y$) |
|---|---|---|
| 1 | 2 | 65 |
| 2 | 4 | 75 |
| 3 | 6 | 80 |
| 4 | 3 | 70 |
| 5 | 5 | 85 |
From Section 22.4, we already know: - $\bar{x} = 4.0$, $\bar{y} = 75.0$ - $\sum(x_i - \bar{x})(y_i - \bar{y}) = 45$ - $\sum(x_i - \bar{x})^2 = 10$ - $r = 0.90$
Step 1: Calculate the slope.
$$b_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2} = \frac{45}{10} = 4.5$$
Step 2: Calculate the intercept.
$$b_0 = \bar{y} - b_1 \bar{x} = 75.0 - 4.5 \times 4.0 = 75.0 - 18.0 = 57.0$$
Step 3: Write the equation.
$$\hat{y} = 57.0 + 4.5x$$
Interpreting the Slope
For each additional hour of study per week, the predicted exam score increases by 4.5 points, on average.
This is the template for interpreting any regression slope: "For each one-unit increase in $x$, the predicted $y$ changes by $b_1$ units, on average."
The "on average" part is crucial. Not every student who studies one more hour will gain exactly 4.5 points. The slope describes the average trend, not an individual guarantee.
Interpreting the Intercept
A student who studies zero hours per week is predicted to score 57.0 on the exam.
Is this interpretation meaningful? It depends. In this case, studying zero hours is plausible (though unwise), so the intercept has a real-world interpretation. But be careful — in many situations, $x = 0$ is outside the range of the data or doesn't make practical sense.
Common Pitfall: Intercepts That Don't Make Sense
Suppose you're regressing adult height (in inches) on shoe size. The intercept might be 48 inches — the predicted height for someone with shoe size 0. But nobody has shoe size 0, so the intercept is just a mathematical anchor for the line with no practical meaning. Report it, but don't overinterpret it.
Making Predictions
How would a student who studies 5 hours score?
$$\hat{y} = 57.0 + 4.5(5) = 57.0 + 22.5 = 79.5$$
Predicted score: 79.5 points.
Student 5 actually studied 5 hours and scored 85. The residual is:
$$e_5 = y_5 - \hat{y}_5 = 85 - 79.5 = 5.5$$
This student scored 5.5 points above the predicted value. Maybe they're a more efficient studier, or maybe random variation is at play.
Calculating Residuals
| Student | $x$ | $y$ | $\hat{y} = 57 + 4.5x$ | $e = y - \hat{y}$ |
|---|---|---|---|---|
| 1 | 2 | 65 | 66.0 | $-1.0$ |
| 2 | 4 | 75 | 75.0 | $0.0$ |
| 3 | 6 | 80 | 84.0 | $-4.0$ |
| 4 | 3 | 70 | 70.5 | $-0.5$ |
| 5 | 5 | 85 | 79.5 | $5.5$ |
Notice that the residuals sum to zero: $-1.0 + 0.0 + (-4.0) + (-0.5) + 5.5 = 0.0$. This is always true for least squares regression — the positive and negative residuals balance exactly.
22.8 Assessing Model Fit: $R^2$ and Residual Plots
How well does your regression line fit the data? Two tools answer this question: $R^2$ (a number) and residual plots (a picture).
$R^2$: The Coefficient of Determination
The coefficient of determination, denoted $R^2$ (or $r^2$ in simple linear regression), is the proportion of the total variability in $y$ that is explained by the linear relationship with $x$.
$$R^2 = r^2 = 1 - \frac{SS_{\text{Residual}}}{SS_{\text{Total}}}$$
Where: - $SS_{\text{Total}} = \sum(y_i - \bar{y})^2$ — total variability in $y$ - $SS_{\text{Residual}} = \sum(y_i - \hat{y}_i)^2 = \sum e_i^2$ — variability left unexplained by the model
🎯 Threshold Concept Review: Decomposing Variability (from Ch.20)
In Chapter 20, you learned the threshold concept that total variability can be decomposed:
$$SS_{\text{Total}} = SS_{\text{Between}} + SS_{\text{Within}}$$
The regression version is exactly the same idea:
$$SS_{\text{Total}} = SS_{\text{Regression}} + SS_{\text{Residual}}$$
Total variability = Explained variability + Unexplained variability.
In ANOVA, the "explained" part was $SS_{\text{Between}}$ (variability due to group membership), and the effect size was $\eta^2 = SS_B / SS_T$. In regression, the "explained" part is $SS_{\text{Regression}}$ (variability due to the predictor $x$), and the effect size is $R^2 = SS_{\text{Reg}} / SS_T$. Same concept, same formula, different context. This is the payoff of the threshold concept from Chapter 20.
For our study hours example: $r = 0.90$, so $R^2 = 0.90^2 = 0.81$.
Interpretation: 81% of the variability in exam scores is explained by the linear relationship with study hours. The remaining 19% is due to other factors (talent, sleep, test anxiety, etc.).
Interpreting $R^2$
| $R^2$ | Interpretation |
|---|---|
| 0.00 | The model explains none of the variability. The regression line is useless. |
| 0.25 | The model explains 25% of the variability. Some signal, lots of noise. |
| 0.50 | The model explains half the variability. Moderate fit. |
| 0.75 | The model explains three-quarters of the variability. Good fit. |
| 1.00 | The model explains all the variability. Every point falls exactly on the line. (Almost never happens in practice.) |
Common Pitfall: A high $R^2$ does not mean the model is correct. Anscombe's Dataset II has $R^2 \approx 0.67$ — a seemingly reasonable fit — but the relationship is curved, so the linear model is wrong. $R^2$ tells you how much variability the model explains, but it doesn't tell you whether a linear model is appropriate. For that, you need residual plots.
Residual Plots: The Diagnostic Tool
A residual plot graphs the residuals ($e_i = y_i - \hat{y}_i$) on the y-axis against the predicted values ($\hat{y}_i$) or the $x$ values on the x-axis.
What you WANT to see: A random scatter of points around zero — no pattern, no fan shape, no curves. This means the linear model is capturing the relationship well, and the leftover noise is genuinely random.
What you DON'T want to see:
| Pattern | Meaning | Action |
|---|---|---|
| Curved pattern (U-shape or inverted U) | The relationship is nonlinear | Consider a transformation or polynomial term |
| Fan shape (residuals spread out as $\hat{y}$ increases) | Non-constant variance (heteroscedasticity) | Consider a log transformation of $y$ |
| Clusters | Subgroups in the data | Investigate separately |
| One extreme point | An outlier or influential observation | Investigate; consider its impact |
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
study_hours = np.array([2, 3, 5, 1, 6, 4, 7, 3, 5, 8, 2, 4, 6, 1, 9])
exam_scores = np.array([65, 70, 82, 55, 88, 75, 92, 68, 80, 95, 60, 78, 85, 50, 98])
# Fit regression
slope, intercept, r, p, se = stats.linregress(study_hours, exam_scores)
predicted = intercept + slope * study_hours
residuals = exam_scores - predicted
# Residual plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left: Scatter with regression line
axes[0].scatter(study_hours, exam_scores, color='steelblue',
edgecolors='navy', s=80, zorder=5)
x_line = np.linspace(0, 10, 100)
axes[0].plot(x_line, intercept + slope * x_line, 'r-', linewidth=2)
axes[0].set_xlabel('Study Hours')
axes[0].set_ylabel('Exam Score')
axes[0].set_title(f'Regression Line (r = {r:.3f}, R² = {r**2:.3f})')
axes[0].grid(True, alpha=0.3)
# Right: Residual plot
axes[1].scatter(predicted, residuals, color='steelblue',
edgecolors='navy', s=80, zorder=5)
axes[1].axhline(y=0, color='red', linestyle='--', linewidth=1.5)
axes[1].set_xlabel('Predicted Values (ŷ)')
axes[1].set_ylabel('Residuals (y - ŷ)')
axes[1].set_title('Residual Plot')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Excel: Checking Model Fit
- Trendline: Right-click on data points in your scatterplot → Add Trendline → Linear → Check "Display Equation on chart" and "Display R-squared value on chart"
- Functions:
-
=RSQ(known_y's, known_x's)— returns $R^2$ -=SLOPE(known_y's, known_x's)— returns $b_1$ -=INTERCEPT(known_y's, known_x's)— returns $b_0$
22.9 Regression to the Mean
Here's the threshold concept of this chapter, and it's one of the most subtle and powerful ideas in all of statistics.
🎯 Threshold Concept: Regression to the Mean
Extreme observations tend to be followed by less extreme observations. This is not because of any causal force pulling things toward the average. It's a mathematical consequence of imperfect correlation.
If the correlation between two measurements is less than perfect ($|r| < 1$), then predicted values will always be less extreme (closer to the mean) than the observed values they're based on. A student who scores in the 95th percentile on Exam 1 will tend to score lower on Exam 2 — not because they got worse, but because some of their extreme score was due to luck, and luck doesn't repeat perfectly.
Understanding regression to the mean prevents you from seeing patterns that aren't there and attributing changes to interventions that didn't work.
Galton's Original Insight
The term "regression" comes from Sir Francis Galton's 1886 study of the heights of parents and their adult children. Galton noticed something surprising: very tall parents tended to have children who were tall — but not as tall as their parents. Very short parents tended to have children who were short — but not as short.
The children's heights "regressed" toward the overall mean height. Galton called this "regression toward mediocrity."
This isn't because tall families are somehow shrinking over generations. It's because the correlation between parent and child height is strong ($r \approx 0.5$) but not perfect. Some of what makes a parent unusually tall is genetic (and gets passed on), but some is environmental luck (nutrition, health, random developmental variation). The environmental luck doesn't repeat.
In regression terms, if you predict a child's height from a parent's height, the predicted value will be less extreme than the parent's value — pulled toward the mean. That's what $r < 1$ does mathematically.
The Sports Illustrated Jinx
Here's the most famous modern example of regression to the mean.
Athletes who appear on the cover of Sports Illustrated often perform worse afterward. Fans call it the "Sports Illustrated Jinx." But there's no curse. An athlete who makes the cover has probably just had an exceptional performance — an outlier. Regression to the mean guarantees that their next performance will likely be less exceptional, simply because extreme performances don't sustain themselves perfectly.
Similarly: - The NFL Madden Curse: players on the video game cover often have worse seasons afterward. - Sophomore slump: a rookie with an amazing first year often has a less remarkable second year. - Hot streaks in basketball: a player who makes 10 shots in a row is likely to miss some of the next 10 — not because they "cooled off" but because extreme runs regress toward their true shooting percentage.
Sam Okafor should know this. If Daria has an incredible game (scoring 35 points), Sam shouldn't expect 35 points next game. Daria's long-run average is around 22 points. The 35-point game included some luck — favorable matchups, shots falling that normally don't — and that luck won't all repeat.
The Medical Trap
Regression to the mean has serious implications for medicine.
Imagine you have high blood pressure one day — say, 150/95 mmHg. Your doctor prescribes medication. A week later, your blood pressure is 140/88. The medication worked, right?
Maybe. But your blood pressure naturally fluctuates. If you were measured at a peak (which is why you went to the doctor in the first place — you felt bad), the next measurement would likely be lower regardless of any treatment. This is regression to the mean.
This is exactly why randomized controlled trials (Chapter 4) are essential for evaluating treatments. The treatment group and the control group both experience regression to the mean, so any difference between them can be attributed to the treatment.
Maya Chen should be vigilant about this. If communities with the highest ER visit rates are given a new intervention, ER visits might decline simply because extreme rates tend to regress toward the mean. Without a control group, Maya can't tell whether the intervention worked or regression to the mean did the work.
The Mathematical Explanation
Why does regression to the mean happen? Look at the slope formula:
$$b_1 = r \cdot \frac{s_y}{s_x}$$
If we're predicting $y$ from $x$ where both variables have the same standard deviation (or are standardized), then $b_1 = r$. Since $|r| < 1$ (unless the correlation is perfect), the predicted $y$ is always less extreme than the observed $x$.
In z-score terms: if someone has a z-score of +2 on $x$ (two standard deviations above the mean), their predicted z-score on $y$ is $r \times 2$. If $r = 0.7$, that's $0.7 \times 2 = 1.4$ — still above the mean, but less extreme. The prediction has "regressed toward the mean."
22.10 The Danger of Extrapolation
Extrapolation is using your regression model to make predictions for $x$ values that are outside the range of the data you used to build the model.
This is dangerous. Very dangerous.
A Concrete Example
Suppose you have data on study hours from 1 to 9 hours per week, and the regression line is $\hat{y} = 45.6 + 5.8x$. Within the observed range, the model works well.
But what if you try to predict the exam score for a student who studies 30 hours per week?
$$\hat{y} = 45.6 + 5.8 \times 30 = 45.6 + 174 = 219.6$$
An exam score of 219.6? On a 100-point exam? That's absurd.
The problem is that the linear relationship between study hours and scores holds within a certain range, but it doesn't hold forever. At some point, additional study hours produce diminishing returns (you can't learn more than everything on the exam), and the relationship levels off or even reverses (sleep deprivation).
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# Data within observed range
hours = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
scores = np.array([50, 58, 65, 73, 78, 84, 88, 93, 96])
slope, intercept, r, p, se = stats.linregress(hours, scores)
fig, ax = plt.subplots(figsize=(10, 6))
# Observed range
x_obs = np.linspace(0, 10, 100)
ax.plot(x_obs, intercept + slope * x_obs, 'b-', linewidth=2,
label='Within observed range')
# Extrapolation (dangerous!)
x_extrap = np.linspace(10, 25, 100)
ax.plot(x_extrap, intercept + slope * x_extrap, 'r--', linewidth=2,
label='Extrapolation (DANGEROUS)')
# Data points
ax.scatter(hours, scores, color='steelblue', s=100, zorder=5,
edgecolors='navy')
# What really happens (hypothetical curve)
x_real = np.linspace(0, 25, 100)
y_real = 100 * (1 - np.exp(-0.25 * x_real)) # Diminishing returns
ax.plot(x_real, y_real, 'g:', linewidth=2, alpha=0.6,
label='What probably really happens')
ax.axhline(y=100, color='gray', linestyle=':', alpha=0.5)
ax.text(20, 102, 'Maximum possible score', fontsize=9, color='gray')
ax.set_xlabel('Study Hours per Week', fontsize=12)
ax.set_ylabel('Exam Score', fontsize=12)
ax.set_title('The Danger of Extrapolation', fontsize=14)
ax.legend(fontsize=10)
ax.set_ylim(30, 180)
ax.grid(True, alpha=0.3)
# Shade the dangerous region
ax.axvspan(10, 25, alpha=0.1, color='red')
ax.text(15, 40, 'EXTRAPOLATION\nZONE', fontsize=14, color='red',
fontweight='bold', ha='center', alpha=0.4)
plt.tight_layout()
plt.show()
Key Insight: Your regression model is a local approximation. It describes the relationship between $x$ and $y$ within the range of your observed data. Outside that range, all bets are off. The further you extrapolate, the more likely you are to get nonsense.
When Extrapolation Goes Wrong in Practice
- Climate models in the 1970s predicted a new ice age by linearly extrapolating a cooling trend from 1945-1975. The trend reversed.
- Housing price models before 2008 extrapolated an upward trend and didn't consider that prices could fall. The result was the worst financial crisis in 80 years.
- Growth projections for tech companies that assume linear user growth continues indefinitely. At some point, you run out of humans on the planet.
22.11 Regression in Python: The Full Toolkit
Python offers multiple ways to perform regression, from simple to sophisticated. Here's the progression:
Method 1: scipy.stats.linregress (Quick and Simple)
from scipy import stats
import numpy as np
x = np.array([2, 3, 5, 1, 6, 4, 7, 3, 5, 8, 2, 4, 6, 1, 9])
y = np.array([65, 70, 82, 55, 88, 75, 92, 68, 80, 95, 60, 78, 85, 50, 98])
# Fit the regression
result = stats.linregress(x, y)
print(f"Slope (b₁): {result.slope:.4f}")
print(f"Intercept (b₀): {result.intercept:.4f}")
print(f"r: {result.rvalue:.4f}")
print(f"R²: {result.rvalue**2:.4f}")
print(f"p-value: {result.pvalue:.6f}")
print(f"SE of slope: {result.stderr:.4f}")
print(f"\nRegression equation: ŷ = {result.intercept:.2f} + "
f"{result.slope:.2f}x")
print(f"\nInterpretation: For each additional hour of study, "
f"the predicted exam score increases by {result.slope:.2f} points.")
Method 2: seaborn.regplot (Visualization with Regression)
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'study_hours': x, 'exam_score': y})
plt.figure(figsize=(8, 6))
sns.regplot(data=df, x='study_hours', y='exam_score',
ci=95, # 95% confidence band
scatter_kws={'s': 80, 'edgecolors': 'navy'},
line_kws={'color': 'red'})
plt.title('Study Hours vs. Exam Scores with Regression Line')
plt.xlabel('Study Hours per Week')
plt.ylabel('Exam Score')
plt.grid(True, alpha=0.3)
plt.show()
The shaded band around the line is the 95% confidence interval for the regression line — it shows the uncertainty in the line's position.
Method 3: statsmodels OLS (Full Statistical Output)
import statsmodels.api as sm
import numpy as np
x = np.array([2, 3, 5, 1, 6, 4, 7, 3, 5, 8, 2, 4, 6, 1, 9])
y = np.array([65, 70, 82, 55, 88, 75, 92, 68, 80, 95, 60, 78, 85, 50, 98])
# Add constant (intercept term)
X = sm.add_constant(x)
# Fit OLS (Ordinary Least Squares) model
model = sm.OLS(y, X).fit()
# Full summary
print(model.summary())
The statsmodels output gives you everything: coefficients, standard errors, t-statistics, p-values, confidence intervals, $R^2$, adjusted $R^2$, F-statistic, and diagnostic information. This is the output you'll see in published research.
AI Connection (Theme 3):
Every machine learning prediction model — from the simplest linear regression to the most complex neural network — is, at its core, an extension of what you're learning here. A linear regression minimizes the sum of squared residuals. A neural network minimizes a loss function (which is often... the sum of squared residuals, or a variant of it). The slope and intercept you're calculating are parameters — the same word used for the weights in a neural network.
When ChatGPT predicts the next word in a sentence, it's using a model with billions of parameters. But the idea — find parameters that minimize prediction error — is the same idea as least squares regression, invented by Adrien-Marie Legendre in 1805 and Carl Friedrich Gauss in 1809. The most cutting-edge AI rests on a 200-year-old foundation.
22.12 Maya's Analysis: Poverty Rate and ER Visits
Maya is investigating a question from the county health board: Is there a relationship between a community's poverty rate and its rate of emergency room visits?
She has data from 25 communities in her region.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.api as sm
np.random.seed(2026)
# ============================================================
# MAYA'S ANALYSIS: POVERTY RATE vs. ER VISIT RATE
# ============================================================
# Data: 25 communities
# poverty_rate = % of households below poverty line
# er_rate = ER visits per 1,000 residents per year
poverty_rate = np.array([8.2, 12.5, 15.3, 5.1, 22.6, 18.4, 9.7, 25.1, 11.3,
19.8, 6.4, 14.7, 21.3, 7.8, 16.9, 28.4, 10.2, 23.7,
13.6, 20.5, 4.3, 17.2, 26.8, 8.9, 15.8])
er_rate = np.array([145, 198, 225, 110, 310, 265, 155, 345, 175,
280, 120, 215, 295, 138, 240, 380, 165, 320,
200, 290, 95, 250, 365, 148, 230])
print("=" * 65)
print("MAYA'S ANALYSIS: POVERTY RATE vs. ER VISIT RATE")
print("=" * 65)
# Step 1: ALWAYS plot first
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
axes[0].scatter(poverty_rate, er_rate, color='steelblue',
edgecolors='navy', s=80, zorder=5)
axes[0].set_xlabel('Poverty Rate (%)')
axes[0].set_ylabel('ER Visit Rate (per 1,000)')
axes[0].set_title("Step 1: Always Plot First")
axes[0].grid(True, alpha=0.3)
# Step 2: Correlation
r, p_value = stats.pearsonr(poverty_rate, er_rate)
print(f"\nCorrelation Analysis:")
print(f" Pearson's r = {r:.4f}")
print(f" p-value = {p_value:.8f}")
print(f" R² = {r**2:.4f}")
print(f"\n Interpretation: {r**2*100:.1f}% of the variability in ER visit")
print(f" rates is explained by the linear relationship with poverty rate.")
# Step 3: Regression
result = stats.linregress(poverty_rate, er_rate)
print(f"\nRegression Equation:")
print(f" ŷ = {result.intercept:.2f} + {result.slope:.2f}x")
print(f"\n Slope interpretation: For each 1 percentage-point")
print(f" increase in poverty rate, the predicted ER visit rate")
print(f" increases by {result.slope:.2f} visits per 1,000 residents.")
print(f"\n Intercept interpretation: A community with 0% poverty")
print(f" rate would have a predicted ER visit rate of")
print(f" {result.intercept:.1f} per 1,000 residents.")
# Plot with regression line
predicted = result.intercept + result.slope * poverty_rate
axes[0].plot(np.sort(poverty_rate),
result.intercept + result.slope * np.sort(poverty_rate),
'r-', linewidth=2)
axes[0].set_title(f"Regression: ŷ = {result.intercept:.1f} + "
f"{result.slope:.1f}x (R² = {r**2:.3f})")
# Step 4: Residual plot
residuals = er_rate - predicted
axes[1].scatter(predicted, residuals, color='steelblue',
edgecolors='navy', s=80, zorder=5)
axes[1].axhline(y=0, color='red', linestyle='--', linewidth=1.5)
axes[1].set_xlabel('Predicted ER Visit Rate')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Residual Plot')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Step 5: Full statsmodels output
X = sm.add_constant(poverty_rate)
ols_model = sm.OLS(er_rate, X).fit()
print("\n" + "=" * 65)
print("FULL REGRESSION OUTPUT (statsmodels)")
print("=" * 65)
print(ols_model.summary())
Maya's Interpretation
"There's a strong positive correlation between poverty rate and ER visit rate ($r = 0.99$, $p < 0.001$). For each 1 percentage-point increase in a community's poverty rate, the predicted ER visit rate increases by approximately 11 visits per 1,000 residents. Poverty rate explains about 97% of the variability in ER visit rates across communities."
But Wait — Lurking Variables
Maya is careful not to conclude that poverty causes ER visits. She identifies several lurking variables that could be driving both:
-
Access to primary care: Communities with higher poverty rates tend to have fewer primary care doctors. Residents use the ER for non-emergency care because they lack alternatives. The lurking variable (access to primary care) is associated with both poverty and ER usage.
-
Insurance coverage: Higher poverty → lower insurance rates → residents avoid preventive care → conditions worsen → ER is the last resort.
-
Environmental factors: Poorer communities may have worse air quality, older housing (lead paint, mold), and fewer healthy food options — all of which contribute to health problems that drive ER visits.
Maya's conclusion for the county board: "The data show a strong association between poverty and ER utilization, but this doesn't mean that reducing poverty alone would proportionally reduce ER visits. Multiple factors — access to primary care, insurance coverage, environmental quality — are intertwined with poverty and likely contribute independently to ER usage. A multivariate analysis (Chapter 23) is needed to disentangle these effects."
This is Theme 5 in action: the correlation is real and strong, but the causal story is complicated.
22.13 Alex's Analysis: User Engagement and Content Diversity
Alex Rivera wants to know whether the diversity of content on a user's homepage predicts engagement (measured as session duration in minutes).
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
np.random.seed(42)
# ============================================================
# ALEX'S ANALYSIS: CONTENT DIVERSITY vs. ENGAGEMENT
# ============================================================
# Content diversity score: 0-100 (higher = more diverse genres shown)
# Session duration: minutes per visit
diversity_score = np.array([
15, 22, 35, 48, 55, 62, 70, 78, 85, 92,
18, 28, 40, 52, 58, 65, 72, 80, 88, 95,
20, 30, 42, 50, 60, 68, 75, 82, 90, 45
])
session_duration = np.array([
12, 18, 25, 35, 40, 48, 52, 58, 55, 50,
14, 22, 30, 38, 42, 45, 50, 55, 52, 48,
16, 20, 28, 32, 44, 46, 54, 56, 53, 33
])
print("=" * 60)
print("ALEX'S ANALYSIS: CONTENT DIVERSITY vs. SESSION DURATION")
print("=" * 60)
# Correlation
r, p = stats.pearsonr(diversity_score, session_duration)
print(f"\nCorrelation: r = {r:.4f}, p = {p:.8f}")
print(f"R²: {r**2:.4f}")
# Regression
result = stats.linregress(diversity_score, session_duration)
print(f"\nRegression equation: ŷ = {result.intercept:.2f} + "
f"{result.slope:.4f}x")
print(f"Slope: For each 1-point increase in diversity score,")
print(f" predicted session duration increases by "
f"{result.slope:.3f} minutes.")
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Scatter with regression
df = pd.DataFrame({
'diversity_score': diversity_score,
'session_duration': session_duration
})
sns.regplot(data=df, x='diversity_score', y='session_duration',
ax=axes[0], ci=95,
scatter_kws={'s': 60, 'edgecolors': 'navy'},
line_kws={'color': 'red'})
axes[0].set_title(f'Content Diversity vs. Session Duration\n'
f'(r = {r:.3f}, R² = {r**2:.3f})')
axes[0].set_xlabel('Content Diversity Score')
axes[0].set_ylabel('Session Duration (minutes)')
axes[0].grid(True, alpha=0.3)
# Residual plot
predicted = result.intercept + result.slope * diversity_score
residuals = session_duration - predicted
axes[1].scatter(predicted, residuals, color='steelblue',
edgecolors='navy', s=60, zorder=5)
axes[1].axhline(y=0, color='red', linestyle='--', linewidth=1.5)
axes[1].set_xlabel('Predicted Session Duration')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Residual Plot')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Alex notices something interesting in the residual plot: the residuals show a slight curved pattern. At very high diversity scores (above 80), session duration seems to plateau or even decline slightly. Users who are shown too many different genres might feel overwhelmed rather than engaged.
This is a clue that the relationship may not be perfectly linear. Alex notes this for Chapter 23, where a quadratic term or other nonlinear approach might better capture the pattern. For now, the linear model with $R^2 \approx 0.90$ provides a useful first approximation.
Alex's recommendation to the product team: "Higher content diversity is associated with longer session durations ($r \approx 0.95$), but the relationship appears to plateau at very high diversity levels. I recommend testing a content diversity target of 60-80 rather than maximizing diversity, and conducting an A/B test to establish causation."
Note Alex's caution: the regression shows an association, but only an A/B test (Chapter 4/16) can establish that changing diversity scores causes longer sessions.
22.14 Sam's Analysis: Practice Hours and Shooting Percentage
Sam Okafor wants to know if there's a relationship between the Raptors' practice hours and their game-day shooting performance.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
np.random.seed(2026)
# ============================================================
# SAM'S ANALYSIS: PRACTICE HOURS vs. SHOOTING PERCENTAGE
# ============================================================
# Weekly practice hours (for each player, averaged over season)
practice_hours = np.array([8, 10, 12, 14, 16, 18, 20, 22, 6, 11,
13, 15, 17, 19, 9, 21, 7, 14, 16, 12])
# Game-day shooting percentage
shooting_pct = np.array([38, 42, 45, 48, 50, 52, 55, 54, 35, 43,
44, 49, 51, 53, 40, 56, 37, 47, 50, 46])
print("=" * 60)
print("SAM'S ANALYSIS: PRACTICE HOURS vs. SHOOTING PERCENTAGE")
print("=" * 60)
# Correlation
r, p = stats.pearsonr(practice_hours, shooting_pct)
print(f"\nCorrelation: r = {r:.4f}")
print(f"p-value: {p:.6f}")
print(f"R²: {r**2:.4f}")
# Regression
result = stats.linregress(practice_hours, shooting_pct)
print(f"\nRegression equation: ŷ = {result.intercept:.2f} + "
f"{result.slope:.2f}x")
print(f"\nSlope interpretation: For each additional hour of weekly")
print(f" practice, the predicted shooting percentage increases")
print(f" by {result.slope:.2f} percentage points.")
# Prediction: Daria practices 15 hours/week
daria_hours = 15
daria_predicted = result.intercept + result.slope * daria_hours
print(f"\nPrediction for Daria (15 hrs/week): {daria_predicted:.1f}%")
# Extrapolation warning
extreme_hours = 40
extreme_predicted = result.intercept + result.slope * extreme_hours
print(f"\nEXTRAPOLATION WARNING:")
print(f" Prediction for 40 hrs/week: {extreme_predicted:.1f}%")
print(f" (This is extrapolation! Our data only covers 6-22 hrs/week)")
print(f" (Also, {extreme_predicted:.1f}% might exceed physical limits)")
# Visualization
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(practice_hours, shooting_pct, color='steelblue',
edgecolors='navy', s=80, zorder=5)
x_line = np.linspace(4, 24, 100)
ax.plot(x_line, result.intercept + result.slope * x_line,
'r-', linewidth=2, label=f'ŷ = {result.intercept:.1f} + '
f'{result.slope:.2f}x')
# Mark Daria's prediction
ax.scatter([15], [daria_predicted], color='gold', s=200,
zorder=6, edgecolors='darkorange', marker='*',
label=f'Daria prediction: {daria_predicted:.1f}%')
ax.set_xlabel('Weekly Practice Hours', fontsize=12)
ax.set_ylabel('Shooting Percentage (%)', fontsize=12)
ax.set_title(f'Practice Hours vs. Shooting Percentage '
f'(r = {r:.3f}, R² = {r**2:.3f})')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Sam's reflection:
"The data shows a strong positive correlation ($r \approx 0.98$) between practice hours and shooting percentage. But I need to be careful about two things:
-
Correlation vs. causation: Maybe more talented players naturally practice more AND shoot better. The lurking variable could be innate ability or motivation. An experiment would require randomly assigning practice hours, which isn't practical.
-
Regression to the mean: Daria's 35-point game last week was exceptional. I shouldn't expect that to continue — her true average is around 22 points. The regression model predicts her average shooting percentage, not her best game."
22.15 James's Analysis: Algorithm Risk Score and Actual Recidivism
Professor James Washington is examining a critical question: Does the algorithm's risk score actually predict recidivism?
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
np.random.seed(2026)
# ============================================================
# JAMES'S ANALYSIS: RISK SCORE vs. ACTUAL RECIDIVISM RATE
# ============================================================
# Risk score bins (1-10 scale, grouped by decile)
# For each risk score bin, the percentage who actually reoffended
risk_score_bin = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
actual_recidivism_pct = np.array([8, 12, 18, 24, 30, 38, 45, 55, 62, 72])
# Sample sizes per bin (to weight our interpretation)
n_per_bin = np.array([450, 380, 320, 290, 260, 220, 180, 150, 120, 80])
print("=" * 65)
print("JAMES'S ANALYSIS: ALGORITHM RISK SCORE vs. ACTUAL RECIDIVISM")
print("=" * 65)
# Correlation
r, p = stats.pearsonr(risk_score_bin, actual_recidivism_pct)
print(f"\nCorrelation: r = {r:.4f}")
print(f"p-value: {p:.8f}")
print(f"R²: {r**2:.4f}")
# Regression
result = stats.linregress(risk_score_bin, actual_recidivism_pct)
print(f"\nRegression equation: ŷ = {result.intercept:.2f} + "
f"{result.slope:.2f}x")
print(f"\nSlope: For each 1-point increase in risk score,")
print(f" the predicted actual recidivism rate increases by")
print(f" {result.slope:.2f} percentage points.")
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Scatter with regression
axes[0].scatter(risk_score_bin, actual_recidivism_pct,
s=n_per_bin/3, color='steelblue',
edgecolors='navy', alpha=0.7, zorder=5)
x_line = np.linspace(0.5, 10.5, 100)
axes[0].plot(x_line, result.intercept + result.slope * x_line,
'r-', linewidth=2)
axes[0].set_xlabel('Algorithm Risk Score (1-10)')
axes[0].set_ylabel('Actual Recidivism Rate (%)')
axes[0].set_title(f'Risk Score vs. Actual Recidivism\n'
f'(r = {r:.3f}, R² = {r**2:.3f})')
axes[0].set_xlim(0, 11)
axes[0].grid(True, alpha=0.3)
axes[0].text(2, 65, 'Bubble size = sample size\nper risk score bin',
fontsize=9, fontstyle='italic')
# Residual plot
predicted = result.intercept + result.slope * risk_score_bin
residuals = actual_recidivism_pct - predicted
axes[1].scatter(predicted, residuals, color='steelblue',
edgecolors='navy', s=80, zorder=5)
axes[1].axhline(y=0, color='red', linestyle='--', linewidth=1.5)
axes[1].set_xlabel('Predicted Recidivism Rate')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Residual Plot')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# The critical question: does the algorithm perform equally
# across racial groups?
print("\n" + "=" * 65)
print("CRITICAL QUESTION: DOES THE ALGORITHM PERFORM EQUALLY")
print("ACROSS RACIAL GROUPS?")
print("=" * 65)
# Simulate disparate calibration
# At the same risk score, are recidivism rates the same?
risk_bins_shared = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
recid_white = np.array([10, 15, 20, 27, 33, 40, 48, 57, 65, 75])
recid_black = np.array([6, 9, 15, 20, 26, 34, 41, 52, 58, 68])
r_w, _ = stats.pearsonr(risk_bins_shared, recid_white)
r_b, _ = stats.pearsonr(risk_bins_shared, recid_black)
result_w = stats.linregress(risk_bins_shared, recid_white)
result_b = stats.linregress(risk_bins_shared, recid_black)
print(f"\nWhite defendants:")
print(f" ŷ = {result_w.intercept:.2f} + {result_w.slope:.2f}x "
f" (r = {r_w:.3f})")
print(f"\nBlack defendants:")
print(f" ŷ = {result_b.intercept:.2f} + {result_b.slope:.2f}x "
f" (r = {r_b:.3f})")
print(f"\nAt risk score = 5:")
print(f" White predicted recidivism: "
f"{result_w.intercept + result_w.slope * 5:.1f}%")
print(f" Black predicted recidivism: "
f"{result_b.intercept + result_b.slope * 5:.1f}%")
print(f"\n At the SAME risk score, white defendants are MORE likely")
print(f" to actually reoffend than Black defendants.")
print(f" This means the algorithm OVER-predicts risk for Black")
print(f" defendants relative to their actual recidivism rates.")
James's analysis reveals a disturbing pattern: At the same risk score, Black defendants have lower actual recidivism rates than white defendants. This means the algorithm is effectively holding Black defendants to a stricter standard — assigning them the same risk score as white defendants who are actually more dangerous.
James's conclusion: "The overall correlation between risk scores and recidivism is strong ($r \approx 0.99$, $R^2 \approx 0.99$), which superficially suggests the algorithm 'works.' But when we disaggregate by race, we find differential calibration — the same risk score means different things for different racial groups. A risk score of 5 corresponds to approximately 33% recidivism for white defendants but only 26% for Black defendants. The algorithm's accuracy is not the issue; its fairness is."
22.16 Conditions for Regression Inference
When we use regression for inference (hypothesis tests on the slope, confidence intervals for predictions), we need to check four conditions, often remembered as LINE:
| Condition | Description | How to Check |
|---|---|---|
| Linearity | The relationship between $x$ and $y$ is linear | Scatterplot, residual plot (no curves) |
| Independence | The observations are independent of each other | Study design (random sampling, no repeated measures) |
| Normality | The residuals are approximately normally distributed | Histogram of residuals, QQ-plot of residuals |
| Equal variance | The spread of the residuals is roughly constant across all $x$ values (homoscedasticity) | Residual plot (no fan shape) |
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# Checking all four conditions
x = np.array([2, 3, 5, 1, 6, 4, 7, 3, 5, 8, 2, 4, 6, 1, 9])
y = np.array([65, 70, 82, 55, 88, 75, 92, 68, 80, 95, 60, 78, 85, 50, 98])
result = stats.linregress(x, y)
predicted = result.intercept + result.slope * x
residuals = y - predicted
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# 1. Linearity: scatterplot with regression line
axes[0, 0].scatter(x, y, color='steelblue', edgecolors='navy', s=60)
x_line = np.linspace(0, 10, 100)
axes[0, 0].plot(x_line, result.intercept + result.slope * x_line,
'r-', linewidth=2)
axes[0, 0].set_title('1. Linearity Check (Scatterplot)')
axes[0, 0].set_xlabel('x')
axes[0, 0].set_ylabel('y')
axes[0, 0].grid(True, alpha=0.3)
# 2. Residual plot (linearity + equal variance)
axes[0, 1].scatter(predicted, residuals, color='steelblue',
edgecolors='navy', s=60)
axes[0, 1].axhline(y=0, color='red', linestyle='--')
axes[0, 1].set_title('2. Residual Plot (Linearity + Equal Variance)')
axes[0, 1].set_xlabel('Predicted values')
axes[0, 1].set_ylabel('Residuals')
axes[0, 1].grid(True, alpha=0.3)
# 3. Normality of residuals
axes[1, 0].hist(residuals, bins=6, color='steelblue',
edgecolor='navy', alpha=0.7)
axes[1, 0].set_title('3. Normality Check (Histogram of Residuals)')
axes[1, 0].set_xlabel('Residual')
axes[1, 0].set_ylabel('Frequency')
# 4. QQ-plot of residuals
stats.probplot(residuals, dist="norm", plot=axes[1, 1])
axes[1, 1].set_title('4. Normality Check (QQ-Plot of Residuals)')
plt.suptitle('Regression Diagnostics: Checking LINE Conditions',
fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
Practical Note: With small samples (like our 15-observation example), these diagnostic plots are hard to interpret. Patterns are difficult to distinguish from random noise. With larger samples (50+), the plots become much more informative.
22.17 Spaced Review: What 95% Confidence Really Means (from Ch.12)
🔄 Spaced Review 4 (from Ch.12): What 95% Confidence Really Means
In Chapter 12, you learned that a 95% confidence interval does NOT mean "there's a 95% chance the true value is in this interval." The true value is fixed — it's either in the interval or it's not. What 95% means is: if we repeated the entire sampling-and-analysis process many times, approximately 95% of the resulting intervals would contain the true value.
The same logic applies to regression. When
statsmodelsreports a 95% confidence interval for the slope — say, (4.1, 6.3) — it means: if we repeated the study many times with new samples, approximately 95% of the resulting slope confidence intervals would contain the true population slope $\beta_1$. This particular interval either contains $\beta_1$ or it doesn't, but the method captures the truth 95% of the time.
22.18 Spaced Review: Statistical vs. Practical Significance (from Ch.17)
🔄 Spaced Review 5 (from Ch.17): Statistical vs. Practical Significance
In Chapter 17, you learned the crucial distinction between statistical and practical significance. A statistically significant regression slope (low p-value) tells you the relationship is unlikely to be due to chance alone. But it doesn't tell you whether the relationship is big enough to matter.
In regression, $R^2$ is the key measure of practical significance. A slope of $b_1 = 0.003$ might be statistically significant with a large enough sample, but if $R^2 = 0.02$, the model explains only 2% of the variability — the relationship is real but essentially useless for prediction. Always report and interpret $R^2$ alongside the p-value.
22.19 Progressive Project: Scatterplots and Regression in Your Dataset
It's time to explore relationships in your Data Detective Portfolio.
Your Task
-
Choose two numerical variables from your dataset that you think might be related. One should be the explanatory variable ($x$), the other the response variable ($y$).
-
Create a scatterplot. Describe the direction, form, and strength of the relationship.
-
Calculate the Pearson correlation coefficient using
scipy.stats.pearsonr(). -
Fit a simple linear regression model using
scipy.stats.linregress()orstatsmodels.OLS(). -
Interpret the slope and intercept in context. Does the intercept have a meaningful interpretation?
-
Report $R^2$ and interpret it: what percentage of the variability in $y$ is explained by $x$?
-
Create a residual plot. Do the residuals show any patterns that suggest the linear model is inappropriate?
-
Discuss correlation vs. causation. Identify at least one potential lurking variable that might explain the association.
Starter Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns
# Load your dataset
df = pd.read_csv('your_dataset.csv')
# Define your variables
x_var = 'your_explanatory_variable'
y_var = 'your_response_variable'
# Step 1: Scatterplot
plt.figure(figsize=(10, 6))
sns.regplot(data=df, x=x_var, y=y_var, ci=95,
scatter_kws={'s': 60, 'edgecolors': 'navy', 'alpha': 0.6},
line_kws={'color': 'red'})
plt.title(f'{y_var} vs. {x_var}')
plt.grid(True, alpha=0.3)
plt.show()
# Step 2: Correlation
r, p = stats.pearsonr(df[x_var].dropna(), df[y_var].dropna())
print(f"Pearson's r = {r:.4f}, p-value = {p:.6f}")
# Step 3: Regression
x = df[x_var].dropna().values
y = df[y_var].dropna().values
result = stats.linregress(x, y)
print(f"\nRegression equation: ŷ = {result.intercept:.4f} + "
f"{result.slope:.4f}x")
print(f"R² = {result.rvalue**2:.4f}")
# Step 4: Residual plot
predicted = result.intercept + result.slope * x
residuals = y - predicted
plt.figure(figsize=(8, 5))
plt.scatter(predicted, residuals, color='steelblue',
edgecolors='navy', s=60, alpha=0.6)
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.grid(True, alpha=0.3)
plt.show()
# Step 5: Interpretation (write in a markdown cell)
Add this analysis to your Jupyter notebook under a new heading: "Relationships and Prediction: Correlation and Regression."
22.20 Chapter Summary
What We Learned
This chapter introduced the two most important tools for describing and predicting relationships between numerical variables. Here's the arc:
-
Always plot first. Anscombe's Quartet proved that summary statistics can be identical for fundamentally different relationships. Scatterplots reveal what numbers hide.
-
The Pearson correlation coefficient ($r$) measures the strength and direction of a linear relationship. It ranges from $-1$ to $+1$, has no units, and is symmetric. But it only measures linear relationships — a perfect curve can have $r = 0$.
-
Correlation does not imply causation. Lurking variables, reverse causation, and coincidence can all produce impressive correlations between unrelated variables. Only randomized experiments establish causation definitively.
-
Simple linear regression fits the line $\hat{y} = b_0 + b_1 x$ that minimizes the sum of squared residuals. The slope tells you "for each unit increase in $x$, $y$ changes by $b_1$ on average." The intercept anchors the line at $x = 0$.
-
$R^2$ measures model fit — the proportion of variability in $y$ explained by $x$. It's the regression analogue of $\eta^2$ from ANOVA (Chapter 20). Same concept, same formula, different context.
-
Residual plots check whether the linear model is appropriate. Look for randomness — patterns signal problems.
-
Regression to the mean is the threshold concept: extreme observations tend to be followed by less extreme ones, not because of any force, but because of imperfect correlation. It explains the Sports Illustrated Jinx, the sophomore slump, and why you shouldn't conclude a medical treatment works just because patients improve after visiting the doctor.
-
Extrapolation is dangerous. Never predict beyond the range of your observed data.
Key Formulas
| Formula | Description |
|---|---|
| $r = \frac{1}{n-1}\sum\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)$ | Pearson correlation coefficient |
| $b_1 = r \cdot \frac{s_y}{s_x}$ | Slope of regression line |
| $b_0 = \bar{y} - b_1\bar{x}$ | Intercept of regression line |
| $\hat{y} = b_0 + b_1 x$ | Regression equation |
| $e_i = y_i - \hat{y}_i$ | Residual |
| $R^2 = r^2 = 1 - \frac{SS_{\text{Res}}}{SS_{\text{Total}}}$ | Coefficient of determination |
| $b_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}$ | Alternative slope formula |
Connections to What's Next
In Chapter 23, you'll learn what happens when you add more predictors. Simple linear regression uses one $x$ variable. Multiple regression uses two, five, or even fifty. The key new idea is interpreting a coefficient "holding other variables constant" — which is how regression can partially control for confounders in observational data. The $R^2$ you learned here extends naturally, and the residual diagnostics become even more important.
What's Next: Chapter 23 tackles multiple regression — the tool for modeling the real world, where outcomes depend on many variables simultaneously. You'll learn to control for confounders, interpret "holding other variables constant," and build models that explain far more variability than any single predictor can. If simple linear regression is a flashlight, multiple regression is a searchlight.
"The best thing about being a statistician is that you get to play in everyone's backyard." — John Tukey