> "If you torture the data long enough, it will confess to anything."
Learning Objectives
- Compute and interpret Pearson's r and Spearman's rho to measure the strength and direction of relationships between variables
- Create and interpret scatter plots and correlation matrices using Python
- Explain why correlation does not imply causation and identify confounding variables in real-world examples
- Recognize Simpson's paradox and explain how aggregation can reverse apparent relationships
- Distinguish between observational studies and randomized controlled trials and explain why RCTs are the gold standard for causal inference
- Evaluate causal claims in news articles and research by identifying potential confounders and alternative explanations
- Apply correlation analysis to the progressive project by investigating relationships between GDP, healthcare spending, and vaccination rates
In This Chapter
- Chapter Overview
- 24.1 What Is Correlation? Measuring Relationships Between Variables
- 24.2 Correlation Matrices: Exploring Many Relationships at Once
- 24.3 The Big Idea: Why Correlation Does Not Equal Causation
- 24.4 Simpson's Paradox: When the Whole Contradicts the Parts
- 24.5 How to Think About Causation: A Framework
- 24.6 A Practical Checklist for Evaluating Causal Claims
- 24.7 Progressive Project: GDP, Healthcare Spending, and Vaccination Rates
- 24.8 A Gallery of Confounders: Real-World Examples
- 24.9 Connecting the Threads
- 24.10 Chapter Summary
Chapter 24: Correlation, Causation, and the Danger of Confusing the Two
"If you torture the data long enough, it will confess to anything." — Ronald Coase, Nobel laureate in Economics
Chapter Overview
Here is a true statement: countries that consume more chocolate per capita win more Nobel Prizes.
Here is another true statement: this does not mean that eating chocolate makes you smarter, that Nobel laureates eat a lot of chocolate, or that importing Toblerone is a path to scientific glory.
And yet, a real paper published in the New England Journal of Medicine in 2012 presented exactly this correlation, complete with a scatter plot and a Pearson correlation coefficient of r = 0.79. The paper was widely covered in the media with headlines like "Chocolate consumption linked to Nobel Prizes." Some readers took it as half-serious dietary advice. Others recognized it for what it was: a vivid demonstration that correlation — even strong, statistically significant correlation — does not mean one thing causes another.
This chapter is about that distinction. It's a distinction that sounds simple ("yeah, yeah, correlation isn't causation, I've heard that before") but is actually one of the hardest ideas in all of data science to fully internalize. Once you truly understand it — once it's not just a slogan but a reflex — you'll read every news headline, every research summary, and every data analysis differently. You'll start seeing confounders everywhere. You'll catch yourself (and others) jumping from "X and Y are related" to "X causes Y" and you'll know to stop and ask: "Wait — is there a third variable that explains this?"
That's the threshold concept for this chapter, and it's one of the most important ideas in this entire book.
In this chapter, you will learn to:
- Compute and interpret Pearson's r and Spearman's rho to measure the strength and direction of relationships (all paths)
- Create and interpret scatter plots and correlation matrices using Python (all paths)
- Explain why correlation does not imply causation and identify confounding variables (all paths)
- Recognize Simpson's paradox and explain how aggregation can reverse relationships (standard + deep dive paths)
- Distinguish between observational studies, natural experiments, and randomized controlled trials (all paths)
- Evaluate causal claims by identifying potential confounders and alternative explanations (all paths)
- Apply correlation analysis to investigate GDP, healthcare spending, and vaccination rates (all paths)
🚪 Threshold Concept Alert: This chapter contains a threshold concept — "correlation does not equal causation" — that genuinely changes how you see the world. You've probably heard the phrase before, but understanding it deeply is different from knowing the words. Once this clicks, you can't unsee it.
24.1 What Is Correlation? Measuring Relationships Between Variables
Let's start with the mechanics. You have two variables — say, a country's GDP per capita and its vaccination rate — and you want to know: are they related? When one goes up, does the other tend to go up too? Or go down? Or is there no pattern at all?
Correlation is a statistical measure of the strength and direction of the linear relationship between two variables.
Pearson's r: The Most Common Correlation Measure
The Pearson correlation coefficient (r) measures the strength of the linear relationship between two continuous variables. It ranges from -1 to +1:
- r = +1: Perfect positive linear relationship (as X goes up, Y goes up, perfectly in a line)
- r = -1: Perfect negative linear relationship (as X goes up, Y goes down, perfectly)
- r = 0: No linear relationship (knowing X tells you nothing about Y)
Values between these extremes indicate intermediate strengths:
| |r| Range | Interpretation | |---|---| | 0.00 - 0.09 | Negligible | | 0.10 - 0.29 | Weak | | 0.30 - 0.49 | Moderate | | 0.50 - 0.69 | Strong | | 0.70 - 0.89 | Very strong | | 0.90 - 1.00 | Nearly perfect |
Let's compute and visualize correlations:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
np.random.seed(42)
# Create datasets with different correlations
n = 100
# Strong positive (r ≈ 0.85)
x1 = np.random.normal(50, 10, n)
y1 = 0.8 * x1 + np.random.normal(0, 6, n)
# Moderate positive (r ≈ 0.50)
x2 = np.random.normal(50, 10, n)
y2 = 0.5 * x2 + np.random.normal(0, 10, n)
# Near zero (r ≈ 0)
x3 = np.random.normal(50, 10, n)
y3 = np.random.normal(50, 10, n)
# Strong negative (r ≈ -0.80)
x4 = np.random.normal(50, 10, n)
y4 = -0.8 * x4 + 90 + np.random.normal(0, 6, n)
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
datasets = [(x1, y1, 'Strong Positive'),
(x2, y2, 'Moderate Positive'),
(x3, y3, 'Near Zero'),
(x4, y4, 'Strong Negative')]
for ax, (x, y, title) in zip(axes.flatten(), datasets):
r, p = stats.pearsonr(x, y)
ax.scatter(x, y, alpha=0.5, color='steelblue', edgecolor='white')
ax.set_title(f'{title}\nr = {r:.3f}, p = {p:.4f}', fontsize=12)
ax.set_xlabel('Variable X')
ax.set_ylabel('Variable Y')
# Add regression line
m, b = np.polyfit(x, y, 1)
ax.plot(np.sort(x), m * np.sort(x) + b, color='red', linewidth=2)
plt.suptitle('Scatter Plots with Different Correlation Strengths', fontsize=14,
fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('correlation_examples.png', dpi=150, bbox_inches='tight')
plt.show()
Computing Pearson's r in Python
# Method 1: scipy.stats.pearsonr (returns r and p-value)
r, p_value = stats.pearsonr(x1, y1)
print(f"Pearson r: {r:.4f}")
print(f"P-value: {p_value:.6f}")
# Method 2: numpy.corrcoef (returns correlation matrix)
corr_matrix = np.corrcoef(x1, y1)
print(f"\nCorrelation matrix:\n{corr_matrix.round(4)}")
# Method 3: pandas .corr() (works on DataFrames)
df = pd.DataFrame({'x': x1, 'y': y1})
print(f"\npandas correlation: {df['x'].corr(df['y']):.4f}")
The Formula (For Understanding, Not Memorization)
Pearson's r is essentially the average of the product of the z-scores of x and y:
$$r = \frac{1}{n-1} \sum_{i=1}^{n} \left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)$$
Intuitively: when x is above its mean AND y is above its mean (both z-scores positive), the product is positive. When both are below their means (both z-scores negative), the product is still positive. When they go in opposite directions, the product is negative. Averaging these products tells you whether x and y tend to move together (positive r), in opposite directions (negative r), or independently (r near zero).
Spearman's Rho: Correlation for Non-Linear Relationships
Pearson's r measures linear relationships. But what if the relationship is monotonic (consistently increasing or decreasing) but not linear?
Spearman's rank correlation (ρ or "rho") measures monotonic relationships by first converting the data to ranks and then computing Pearson's r on the ranks.
# Nonlinear but monotonic relationship
x_exp = np.random.uniform(1, 100, 100)
y_exp = np.log(x_exp) * 10 + np.random.normal(0, 3, 100) # Logarithmic
pearson_r, _ = stats.pearsonr(x_exp, y_exp)
spearman_r, _ = stats.spearmanr(x_exp, y_exp)
fig, ax = plt.subplots(figsize=(8, 5))
ax.scatter(x_exp, y_exp, alpha=0.6, color='steelblue', edgecolor='white')
ax.set_xlabel('X', fontsize=12)
ax.set_ylabel('Y', fontsize=12)
ax.set_title(f'Nonlinear Monotonic Relationship\n'
f'Pearson r = {pearson_r:.3f} | Spearman ρ = {spearman_r:.3f}',
fontsize=13)
plt.tight_layout()
plt.savefig('spearman_vs_pearson.png', dpi=150, bbox_inches='tight')
plt.show()
Spearman's ρ is higher because it captures the monotonic (always-increasing) pattern, while Pearson's r is pulled down by the curvature. Use Spearman when: - The relationship might not be linear - The data has outliers (ranks are resistant to outliers) - The data is ordinal (ranked categories)
When Correlation Misleads: Anscombe's Quartet
In 1973, statistician Francis Anscombe created four datasets that have nearly identical correlation coefficients (r ≈ 0.82) and regression lines — but look completely different when plotted:
# Anscombe's Quartet
anscombe = sns.load_dataset('anscombe')
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
for ax, (name, group) in zip(axes.flatten(), anscombe.groupby('dataset')):
ax.scatter(group['x'], group['y'], color='steelblue', s=60, edgecolor='white')
r, _ = stats.pearsonr(group['x'], group['y'])
m, b = np.polyfit(group['x'], group['y'], 1)
x_line = np.linspace(group['x'].min(), group['x'].max(), 100)
ax.plot(x_line, m * x_line + b, color='red', linewidth=2)
ax.set_title(f'Dataset {name}: r = {r:.3f}', fontsize=12)
ax.set_xlabel('x')
ax.set_ylabel('y')
plt.suptitle("Anscombe's Quartet: Same r, Very Different Stories",
fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('anscombes_quartet.png', dpi=150, bbox_inches='tight')
plt.show()
The lesson is clear: always plot your data. A correlation coefficient is a summary — and like all summaries, it can hide important details. Dataset II has a perfect curved relationship that Pearson's r describes poorly. Dataset III has a perfect linear relationship distorted by a single outlier. Dataset IV has no relationship at all except for one influential point.
24.2 Correlation Matrices: Exploring Many Relationships at Once
When you have multiple variables, computing correlations between all pairs gives you a correlation matrix — a table (and often a heatmap) that shows every pairwise relationship at a glance.
# Create a dataset with multiple health indicators
np.random.seed(42)
n_countries = 150
# Generate correlated variables
gdp = np.random.lognormal(mean=9, sigma=1.2, size=n_countries)
gdp = gdp / 1000 # Scale to thousands
# Variables correlated with GDP (with noise)
health_spending = 0.3 * gdp + np.random.exponential(2, n_countries)
physicians = 0.5 * np.log(gdp) + np.random.normal(0, 0.8, n_countries)
physicians = np.clip(physicians, 0.1, 8)
vax_rate = 40 + 10 * np.log(gdp + 1) + np.random.normal(0, 10, n_countries)
vax_rate = np.clip(vax_rate, 10, 99)
life_expectancy = 55 + 5 * np.log(gdp + 1) + np.random.normal(0, 4, n_countries)
life_expectancy = np.clip(life_expectancy, 40, 88)
infant_mortality = 80 - 8 * np.log(gdp + 1) + np.random.normal(0, 10, n_countries)
infant_mortality = np.clip(infant_mortality, 2, 120)
health_data = pd.DataFrame({
'GDP per capita ($K)': gdp,
'Health spending (% GDP)': health_spending,
'Physicians per 1000': physicians,
'Vaccination rate (%)': vax_rate,
'Life expectancy': life_expectancy,
'Infant mortality (per 1000)': infant_mortality
})
# Compute correlation matrix
corr = health_data.corr()
print("Correlation Matrix:")
print(corr.round(3))
# Visualize as heatmap
fig, ax = plt.subplots(figsize=(10, 8))
mask = np.triu(np.ones_like(corr, dtype=bool), k=1)
sns.heatmap(corr, annot=True, fmt='.2f', cmap='RdBu_r', center=0,
mask=mask, square=True, linewidths=1, ax=ax,
vmin=-1, vmax=1, cbar_kws={'label': 'Correlation (r)'})
ax.set_title('Correlation Matrix: Health Indicators', fontsize=14)
plt.tight_layout()
plt.savefig('correlation_matrix_health.png', dpi=150, bbox_inches='tight')
plt.show()
Reading the Correlation Matrix
The heatmap tells a rich story: - GDP and vaccination rate are positively correlated — wealthier countries tend to have higher vaccination rates - GDP and infant mortality are negatively correlated — wealthier countries tend to have lower infant mortality - Vaccination rate and life expectancy are positively correlated — but does vaccination cause longer life, or do they both reflect underlying wealth and healthcare access?
That last question is the heart of this chapter.
24.3 The Big Idea: Why Correlation Does Not Equal Causation
Let's say you've computed a correlation between two variables and it's strong, positive, and statistically significant. You're tempted to conclude that one causes the other. Here's why you shouldn't.
The Three Possible Explanations for a Correlation
When X and Y are correlated, there are at least three possible explanations:
1. X causes Y Maybe GDP really does cause higher vaccination rates — richer countries can afford better healthcare infrastructure, more vaccine supplies, and wider distribution networks.
2. Y causes X (reverse causation) Maybe higher vaccination rates cause higher GDP — healthier populations are more productive, miss fewer work days, and generate more economic output.
3. Something else causes both (confounding) Maybe there's a third variable — say, institutional quality or governance — that independently causes both higher GDP AND higher vaccination rates. Countries with effective governments tend to have both stronger economies and better public health systems.
Explanation 1: X → Y (X causes Y)
Explanation 2: X ← Y (Y causes X)
Explanation 3: X ← Z → Y (Z confounds the relationship)
A correlation coefficient alone cannot tell you which explanation is correct. It just tells you the variables move together. The why requires something more than a number.
The Confounding Variable: The Hidden Third Wheel
A confounding variable (also called a lurking variable) is a variable that influences both X and Y, creating an apparent relationship between them even if neither causes the other.
The classic example:
Ice cream sales and drowning deaths are positively correlated.
Does ice cream cause drowning? Does drowning cause ice cream sales? Of course not. The confounding variable is temperature (or, more broadly, summer). Hot weather causes people to both buy ice cream AND swim more, independently. The correlation between ice cream and drowning is real but spurious — it reflects their shared cause, not a direct link between them.
# Simulating a confounding variable
np.random.seed(42)
# The REAL causal structure:
# Temperature → Ice cream sales
# Temperature → Swimming → Drowning risk
n_days = 365
temperature = 50 + 30 * np.sin(2 * np.pi * np.arange(n_days) / 365) + \
np.random.normal(0, 8, n_days)
ice_cream_sales = 2 * temperature + np.random.normal(0, 30, n_days) + 100
ice_cream_sales = np.clip(ice_cream_sales, 50, 500)
drowning_incidents = 0.05 * temperature + np.random.normal(0, 1.5, n_days) - 1
drowning_incidents = np.clip(drowning_incidents, 0, 15).astype(int)
# The spurious correlation
r_spurious, p_spurious = stats.pearsonr(ice_cream_sales, drowning_incidents)
fig, axes = plt.subplots(1, 3, figsize=(16, 4.5))
# Ice cream vs drowning (spurious!)
axes[0].scatter(ice_cream_sales, drowning_incidents, alpha=0.3,
color='steelblue', s=20)
axes[0].set_xlabel('Ice cream sales ($)')
axes[0].set_ylabel('Drowning incidents')
axes[0].set_title(f'Spurious Correlation\nr = {r_spurious:.3f} (p < 0.001)',
fontsize=11)
# Temperature vs ice cream (real cause)
r1, _ = stats.pearsonr(temperature, ice_cream_sales)
axes[1].scatter(temperature, ice_cream_sales, alpha=0.3, color='#e74c3c', s=20)
axes[1].set_xlabel('Temperature (°F)')
axes[1].set_ylabel('Ice cream sales ($)')
axes[1].set_title(f'Real Cause #1\nr = {r1:.3f}', fontsize=11)
# Temperature vs drowning (real cause)
r2, _ = stats.pearsonr(temperature, drowning_incidents)
axes[2].scatter(temperature, drowning_incidents, alpha=0.3, color='#2ecc71', s=20)
axes[2].set_xlabel('Temperature (°F)')
axes[2].set_ylabel('Drowning incidents')
axes[2].set_title(f'Real Cause #2\nr = {r2:.3f}', fontsize=11)
plt.suptitle('Confounding: Temperature Drives Both Variables', fontsize=14,
fontweight='bold', y=1.05)
plt.tight_layout()
plt.savefig('confounding_ice_cream.png', dpi=150, bbox_inches='tight')
plt.show()
Spurious Correlations: When the Pattern Is Pure Coincidence
Sometimes two variables are correlated for no reason at all — it's pure coincidence. With enough variables and enough time, you'll find all sorts of statistically significant correlations that are completely meaningless.
A well-known website (created by Tyler Vigen) catalogs these absurd correlations: - The divorce rate in Maine correlates with the per-capita consumption of margarine (r = 0.99) - The number of people who drowned by falling into a pool correlates with the number of films Nicolas Cage appeared in (r = 0.67) - US spending on science, space, and technology correlates with suicides by hanging (r = 0.99)
These are real correlations computed from real data. None of them reflect a causal relationship. They are spurious correlations — statistical artifacts of looking at enough variables over enough time.
The lesson: statistical significance does not establish meaningfulness. A strong, significant correlation can be completely meaningless if there's no plausible mechanism connecting the variables.
24.4 Simpson's Paradox: When the Whole Contradicts the Parts
Simpson's paradox is one of the most mind-bending phenomena in statistics. It occurs when a trend that appears in several different groups reverses when the groups are combined.
The Classic Example: UC Berkeley Admissions
In 1973, the University of California, Berkeley was sued for gender bias in graduate admissions. The overall numbers seemed damning: 44% of male applicants were admitted, but only 35% of female applicants were admitted. The difference was statistically significant.
But when researchers examined admissions by individual department, a surprising pattern emerged: in most departments, female applicants were admitted at equal or higher rates than male applicants. The paradox was explained by the fact that women disproportionately applied to more competitive departments (like English) with lower overall admission rates, while men disproportionately applied to less competitive departments (like Engineering) with higher overall admission rates.
Let's simulate a simplified version:
# Simpson's Paradox: Vaccination rates and income
np.random.seed(42)
# Two regions with different baseline vaccination rates
data_simpsons = pd.DataFrame({
'Region': ['Urban']*100 + ['Rural']*100,
'Income': (['High']*60 + ['Low']*40 + # Urban: mostly high-income
['High']*20 + ['Low']*80), # Rural: mostly low-income
'Vaccinated': (
# Urban, High income: 90% vaccinated
list(np.random.choice([1, 0], 60, p=[0.90, 0.10])) +
# Urban, Low income: 85% vaccinated
list(np.random.choice([1, 0], 40, p=[0.85, 0.15])) +
# Rural, High income: 70% vaccinated
list(np.random.choice([1, 0], 20, p=[0.70, 0.30])) +
# Rural, Low income: 55% vaccinated
list(np.random.choice([1, 0], 80, p=[0.55, 0.45]))
)
})
# Overall rates by income
overall = data_simpsons.groupby('Income')['Vaccinated'].mean()
print("=== OVERALL (combined) ===")
print(f"High income vaccination rate: {overall['High']:.1%}")
print(f"Low income vaccination rate: {overall['Low']:.1%}")
if overall['High'] < overall['Low']:
print("→ Low income appears HIGHER overall!")
else:
print(f"→ High income appears higher overall (as expected)")
# Rates by income WITHIN each region
print("\n=== BY REGION (disaggregated) ===")
by_region = data_simpsons.groupby(['Region', 'Income'])['Vaccinated'].mean()
for region in ['Urban', 'Rural']:
high = by_region[(region, 'High')]
low = by_region[(region, 'Low')]
print(f"{region}: High income = {high:.1%}, Low income = {low:.1%} "
f"→ High is {'higher' if high > low else 'lower'}")
# Counts to show the structural explanation
print("\n=== THE EXPLANATION: Where each group lives ===")
counts = data_simpsons.groupby(['Region', 'Income']).size()
print(counts)
print("\nHigh-income people are concentrated in Urban areas (high vax rate).")
print("Low-income people are concentrated in Rural areas (low vax rate).")
print("Region is the confounder!")
The paradox arises because the confounding variable (region) has different distributions in the two income groups. High-income people tend to live in urban areas (which have high vaccination rates for everyone), while low-income people tend to live in rural areas (which have lower rates for everyone). The aggregated data mixes the within-group relationship with the between-group distribution difference.
Visualizing Simpson's Paradox
fig, axes = plt.subplots(1, 2, figsize=(13, 5))
# Overall view (misleading)
overall_rates = data_simpsons.groupby('Income')['Vaccinated'].mean()
axes[0].bar(['High Income', 'Low Income'],
[overall_rates['High'], overall_rates['Low']],
color=['#3498db', '#e74c3c'], alpha=0.7, edgecolor='white')
axes[0].set_ylabel('Vaccination Rate', fontsize=12)
axes[0].set_title('Overall: High Income Higher\n(But this is misleading!)',
fontsize=12)
axes[0].set_ylim(0, 1)
for i, v in enumerate([overall_rates['High'], overall_rates['Low']]):
axes[0].text(i, v + 0.02, f'{v:.0%}', ha='center', fontsize=14,
fontweight='bold')
# Disaggregated view (correct)
x = np.arange(2)
width = 0.35
for region, offset, color in [('Urban', -width/2, '#2ecc71'),
('Rural', width/2, '#e67e22')]:
rates = [by_region[(region, 'High')], by_region[(region, 'Low')]]
axes[1].bar(x + offset, rates, width, label=region, color=color,
alpha=0.7, edgecolor='white')
for i, v in enumerate(rates):
axes[1].text(i + offset, v + 0.02, f'{v:.0%}', ha='center',
fontsize=11, fontweight='bold')
axes[1].set_xticks(x)
axes[1].set_xticklabels(['High Income', 'Low Income'])
axes[1].set_ylabel('Vaccination Rate', fontsize=12)
axes[1].set_title('By Region: High Income Higher WITHIN Each Region',
fontsize=12)
axes[1].set_ylim(0, 1)
axes[1].legend(fontsize=11)
plt.suptitle("Simpson's Paradox in Action", fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('simpsons_paradox.png', dpi=150, bbox_inches='tight')
plt.show()
When Does Simpson's Paradox Matter?
Simpson's paradox isn't just a statistical curiosity. It has real-world consequences:
- Medical treatment: A treatment can appear worse overall but be better for every subgroup if sicker patients are more likely to receive it
- Hiring discrimination: A company can appear to hire a lower proportion of minority candidates overall while hiring a higher proportion within every department
- Education policy: A school district can appear to have declining test scores while every school improves, if student populations shift between schools
The lesson: always consider whether your analysis should be disaggregated by a potential confounding variable. Aggregation can create, eliminate, or reverse relationships.
24.5 How to Think About Causation: A Framework
If correlation doesn't prove causation, what does? Let's develop a framework for evaluating causal claims.
The Hierarchy of Evidence
Not all study designs are equal when it comes to establishing causation. Here's a rough hierarchy, from weakest to strongest:
1. Anecdote / Case Report "My uncle drank green tea every day and lived to 95." → Tells you nothing about causation. Sample size of one, no comparison group.
2. Cross-Sectional Study (Correlation) "Countries with higher GDP have higher vaccination rates." → Shows association but can't establish direction or rule out confounders.
3. Longitudinal / Cohort Study "We followed 10,000 people for 20 years. Those who exercised regularly had lower heart disease rates." → Better because it tracks changes over time, but participants aren't randomly assigned to exercise, so confounders may still explain the difference.
4. Natural Experiment "A law change affected some states but not others. We compared outcomes." → The "treatment" (the law) wasn't randomly assigned, but it was plausibly independent of other factors, creating a natural comparison group.
5. Randomized Controlled Trial (RCT) "We randomly assigned 1,000 patients to the drug or a placebo." → The gold standard. Randomization ensures that all confounders — measured and unmeasured — are balanced between groups. Any difference in outcomes can be attributed to the treatment.
Why RCTs Are the Gold Standard
The power of randomization is that it handles all confounders — even ones you haven't thought of.
In an observational study, you can try to control for known confounders (age, income, education), but you can never be sure you've caught them all. There might be a confounding variable you haven't measured or haven't thought of.
In an RCT, randomization makes the treatment and control groups approximately equal on every characteristic — both the ones you know about and the ones you don't. This is why RCTs are required for drug approval and are considered the strongest form of evidence for causal claims.
# Demonstrating why randomization works
np.random.seed(42)
n = 500
# In an OBSERVATIONAL study, healthier people choose to exercise
# (Selection bias: health-conscious people both exercise AND eat well)
health_consciousness = np.random.normal(50, 15, n)
exercises = health_consciousness > 50 # More health-conscious people exercise
diet_quality = health_consciousness + np.random.normal(0, 10, n) # Also eat better
# Outcome depends on exercise AND diet (but we're measuring exercise)
health_outcome = (5 * exercises.astype(float) +
0.3 * diet_quality +
np.random.normal(0, 5, n))
# Observational analysis: overestimates the effect of exercise
exercisers = health_outcome[exercises]
non_exercisers = health_outcome[~exercises]
observed_diff = exercisers.mean() - non_exercisers.mean()
# In an RCT, exercise is RANDOMLY assigned
rct_exercises = np.random.choice([True, False], n, p=[0.5, 0.5])
rct_outcome = (5 * rct_exercises.astype(float) +
0.3 * diet_quality +
np.random.normal(0, 5, n))
rct_exercisers = rct_outcome[rct_exercises]
rct_non_exercisers = rct_outcome[~rct_exercises]
rct_diff = rct_exercisers.mean() - rct_non_exercisers.mean()
print(f"TRUE causal effect of exercise: 5.0 units")
print(f"Observational estimate: {observed_diff:.2f} units (biased upward)")
print(f"RCT estimate: {rct_diff:.2f} units (close to truth)")
print(f"\nThe observational study overestimates because health-conscious")
print(f"people BOTH exercise AND have better diets. The RCT eliminates")
print(f"this confound through random assignment.")
When RCTs Are Impossible
For many important questions, RCTs are impossible or unethical:
- You can't randomly assign countries to have different GDPs
- You can't randomly assign people to smoke or not smoke for 30 years
- You can't randomly assign children to grow up in poverty or wealth
For these questions, we rely on observational studies with careful statistical methods to control for confounders: regression, matching, instrumental variables, difference-in-differences, and regression discontinuity designs. These methods are powerful but require strong assumptions, and they can never completely eliminate the possibility of unmeasured confounders.
Directed Acyclic Graphs (DAGs): Drawing Causal Stories
A directed acyclic graph (DAG) is a visual tool for representing causal relationships. Arrows indicate the direction of causation. They help you identify confounders and think clearly about causal structure.
GDP ──────────────────→ Vaccination Rate
↑ ↑
│ │
└── Institutional ────────┘
Quality
(CONFOUNDER)
In this DAG: - GDP directly affects vaccination rate (more money for healthcare) - Institutional quality affects both GDP and vaccination rate - If you just correlate GDP and vaccination rate, you get the total association (direct + confounded) - To isolate the causal effect of GDP on vaccination, you'd need to control for institutional quality
We won't go deep into DAGs in this course (that's a topic for advanced causal inference), but the key idea is: draw the causal story before you analyze the data. It forces you to think about what's causing what and where the confounders might be.
24.6 A Practical Checklist for Evaluating Causal Claims
When you encounter a claim that X causes Y — in a news article, a research paper, or your own analysis — run through this checklist:
1. Is there a correlation?
If X and Y aren't even correlated, there's no relationship to explain. (But absence of correlation doesn't prove absence of causation — the relationship might be nonlinear or confounded.)
2. Could the relationship be spurious?
Is it plausible that this is a coincidence? The more variables you test, the more likely you are to find spurious correlations.
3. Is there a plausible mechanism?
Can you tell a story for how X would cause Y? "GDP causes higher vaccination rates because wealthier countries can afford better healthcare infrastructure" is plausible. "Ice cream causes drowning" is not.
4. Could there be a confounding variable?
Is there a third variable that could explain the association? This is the most important question. Always ask: "What else could be causing both X and Y?"
5. Could the causation be reversed?
Could Y cause X instead of X causing Y? Higher vaccination rates might contribute to economic growth (healthier workforce), not just the other way around.
6. Is the evidence from an RCT or an observational study?
RCTs provide much stronger evidence for causation than observational studies. If the claim is based on an observational study, treat it as suggestive, not conclusive.
7. Has the finding been replicated?
A single study — even an RCT — could be a fluke. Replication by independent researchers strengthens the causal case enormously.
8. Is there a dose-response relationship?
If more X leads to more Y (more cigarettes → more cancer risk), that strengthens the causal argument compared to a simple "present vs. absent" comparison.
24.7 Progressive Project: GDP, Healthcare Spending, and Vaccination Rates
Let's apply everything to the progressive project. Our question: What is the relationship between GDP, healthcare spending, and vaccination rates? And can we disentangle correlation from causation?
np.random.seed(42)
# Create a realistic dataset
n = 150
# GDP per capita (log-normal distribution)
log_gdp = np.random.normal(9, 1.3, n)
gdp = np.exp(log_gdp) / 1000 # In thousands of dollars
# Institutional quality (correlated with GDP — it's a confounder)
institutional_quality = 0.3 * log_gdp + np.random.normal(0, 0.5, n)
# Healthcare spending (caused by GDP and institutional quality)
health_spending = (0.5 * np.log(gdp + 1) +
0.3 * institutional_quality +
np.random.normal(0, 1, n))
health_spending = np.clip(health_spending, 0.5, 15)
# Vaccination rate (caused by health spending, institutional quality, and GDP)
vax_rate = (30 +
8 * np.log(gdp + 1) +
5 * health_spending +
10 * institutional_quality +
np.random.normal(0, 8, n))
vax_rate = np.clip(vax_rate, 10, 99)
project_data = pd.DataFrame({
'GDP per capita ($K)': gdp,
'Health spending (% GDP)': health_spending,
'Institutional quality': institutional_quality,
'Vaccination rate (%)': vax_rate
})
# Correlation matrix
print("Correlation Matrix:")
print(project_data.corr().round(3))
Scatter Plots: Visual Exploration
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
# GDP vs Vaccination
r1, p1 = stats.pearsonr(project_data['GDP per capita ($K)'],
project_data['Vaccination rate (%)'])
axes[0].scatter(project_data['GDP per capita ($K)'],
project_data['Vaccination rate (%)'],
alpha=0.5, color='steelblue', edgecolor='white')
axes[0].set_xlabel('GDP per capita ($K)')
axes[0].set_ylabel('Vaccination rate (%)')
axes[0].set_title(f'GDP vs Vaccination\nr = {r1:.3f}')
# Health spending vs Vaccination
r2, p2 = stats.pearsonr(project_data['Health spending (% GDP)'],
project_data['Vaccination rate (%)'])
axes[1].scatter(project_data['Health spending (% GDP)'],
project_data['Vaccination rate (%)'],
alpha=0.5, color='#e74c3c', edgecolor='white')
axes[1].set_xlabel('Health spending (% GDP)')
axes[1].set_ylabel('Vaccination rate (%)')
axes[1].set_title(f'Health Spending vs Vaccination\nr = {r2:.3f}')
# GDP vs Health spending
r3, p3 = stats.pearsonr(project_data['GDP per capita ($K)'],
project_data['Health spending (% GDP)'])
axes[2].scatter(project_data['GDP per capita ($K)'],
project_data['Health spending (% GDP)'],
alpha=0.5, color='#2ecc71', edgecolor='white')
axes[2].set_xlabel('GDP per capita ($K)')
axes[2].set_ylabel('Health spending (% GDP)')
axes[2].set_title(f'GDP vs Health Spending\nr = {r3:.3f}')
plt.suptitle('Relationships Between Economic and Health Indicators',
fontsize=14, fontweight='bold', y=1.03)
plt.tight_layout()
plt.savefig('project_correlations.png', dpi=150, bbox_inches='tight')
plt.show()
Identifying Confounders
All three variables are positively correlated with each other. But what's causing what?
# The causal diagram (as we designed the data):
#
# Institutional ──→ GDP
# Quality ──→ Health Spending
# │ ──→ Vaccination Rate
# │
# └──────────→ GDP ──→ Health Spending ──→ Vaccination Rate
#
# So institutional quality confounds ALL relationships.
# Test: what happens when we control for institutional quality?
from scipy.stats import pearsonr
# Partial correlation: GDP vs Vaccination, controlling for institutional quality
# (Simplified approach using residuals)
from sklearn.linear_model import LinearRegression
# Regress GDP on institutional quality, get residuals
iq = project_data['Institutional quality'].values.reshape(-1, 1)
gdp_residuals = (project_data['GDP per capita ($K)'].values -
LinearRegression().fit(iq, project_data['GDP per capita ($K)'].values
).predict(iq))
vax_residuals = (project_data['Vaccination rate (%)'].values -
LinearRegression().fit(iq, project_data['Vaccination rate (%)'].values
).predict(iq))
r_partial, p_partial = pearsonr(gdp_residuals, vax_residuals)
print(f"Raw correlation (GDP vs Vaccination): {r1:.3f}")
print(f"Partial correlation (controlling for institutional quality): "
f"{r_partial:.3f}")
print(f"\nThe correlation drops when we control for the confounder!")
print(f"Some of the GDP-vaccination relationship was due to institutional")
print(f"quality driving both variables.")
Writing the Analysis
Here's how you'd write up this analysis honestly:
Results: GDP per capita and vaccination rates are strongly positively correlated (r = [value], p < 0.001). However, this correlation should not be interpreted as evidence that GDP directly causes higher vaccination rates. Several confounding variables — including institutional quality, governance, and healthcare infrastructure — are positively correlated with both GDP and vaccination rates. When we control for institutional quality using partial correlation, the GDP-vaccination relationship weakens (r_partial = [value]), suggesting that part of the observed association is explained by this confounder. Multiple causal pathways are plausible: GDP may affect vaccination rates directly (through funding for healthcare) and indirectly (through institutional development), and vaccination rates may also affect GDP (through workforce health). Establishing the causal direction would require longitudinal data or a natural experiment, not cross-sectional correlations.
This is the kind of careful, honest analysis that separates good data science from misleading data science.
24.8 A Gallery of Confounders: Real-World Examples
To train your confounder-detection instinct, let's walk through several real-world examples:
Example 1: Education and Income
Correlation: People with more education earn more money. Tempting causal claim: Education causes higher income. Potential confounders: Family wealth (rich families can afford education AND provide career connections), cognitive ability (may independently affect both educational attainment and earning potential), geographic location (urban areas offer both more education and higher-paying jobs). Is there a causal effect? Probably yes, but it's smaller than the raw correlation suggests. Studies using natural experiments (like changes in compulsory schooling laws) find a real but more modest causal effect.
Example 2: Hospital Quality
Correlation: Some hospitals have higher mortality rates than others. Tempting causal claim: Those hospitals provide worse care. Potential confounder: Patient severity. The best hospitals attract the sickest patients (who are referred from smaller hospitals). They may have higher mortality rates despite providing better care, simply because their patients are sicker to begin with. Lesson: Always ask who's in the sample.
Example 3: Organic Food and Health
Correlation: People who eat organic food have better health outcomes. Tempting causal claim: Organic food is healthier. Potential confounders: People who buy organic food tend to be wealthier, more health-conscious, more likely to exercise, less likely to smoke, and more likely to have access to healthcare. All of these independently improve health outcomes. Lesson: Lifestyle choices cluster together. You're not just comparing organic vs. conventional food — you're comparing entire lifestyles.
Example 4: The Firefighter Paradox
Correlation: More firefighters at a fire are associated with more damage. Tempting causal claim: Firefighters cause damage! Reality: Bigger fires cause both more damage AND more firefighters to be called. The confounding variable is fire size.
24.9 Connecting the Threads
Let's see where this chapter fits in the bigger picture.
Looking back: - Chapter 19 gave us the tools to describe individual variables (means, standard deviations) - Chapters 22-23 gave us tools to estimate parameters and test hypotheses about one or two variables - This chapter extends the toolkit to relationships between variables and introduces the critical distinction between association and causation
Looking forward: - Chapter 25 introduces formal modeling — using one variable to predict another - Chapters 26-28 build regression and classification models that quantify relationships while controlling for confounders - Chapter 32 revisits the ethical implications of causal claims
The progressive project: You've now explored the correlations among GDP, healthcare spending, institutional quality, and vaccination rates. You've identified confounders and acknowledged the limits of correlational analysis. In Chapter 25, you'll start building predictive models that formalize these relationships — but you'll carry with you the lesson from this chapter: a model that predicts well does not necessarily describe a causal mechanism.
The threshold concept: "Correlation does not equal causation" is one of those ideas that sounds simple but takes years to fully internalize. You'll catch yourself — even after reading this chapter — wanting to make causal claims from correlational data. That's normal. The instinct to see causation in correlation is deeply human. The skill is learning to pause, ask "what else could explain this?", and be honest when you don't know.
Key Insight: Every time you find a correlation, ask three questions: (1) Could this be a coincidence? (2) Could something else be causing both variables? (3) Could the causation run in the opposite direction? If you can't rule out these alternatives, your correlation is interesting but not conclusive.
24.10 Chapter Summary
This chapter explored the relationship between variables and the critical distinction between correlation and causation.
-
Pearson's r measures the strength and direction of linear relationships. It ranges from -1 to +1 and is the most common correlation measure.
-
Spearman's ρ measures monotonic (but not necessarily linear) relationships using ranks. It's more robust to outliers and nonlinearity.
-
Correlation matrices and heatmaps let you explore many relationships at once. Always visualize with scatter plots — correlation coefficients can hide important patterns (Anscombe's quartet).
-
Correlation does not equal causation. When X and Y are correlated, possible explanations include: X causes Y, Y causes X, or a confounding variable causes both.
-
Confounding variables are the usual suspects. When you see a correlation, your first question should be: "What third variable might explain this?"
-
Simpson's paradox shows that aggregation can reverse apparent relationships. Always consider whether your analysis should be disaggregated.
-
Randomized controlled trials are the gold standard for establishing causation because randomization balances all confounders. When RCTs are impossible, observational studies with careful methodology can provide suggestive but not definitive evidence.
-
Directed acyclic graphs help you think about causal structure by drawing the arrows between variables.
-
The causal evaluation checklist: Ask about correlation, mechanism, confounders, reverse causation, study design, replication, and dose-response before accepting a causal claim.
Next up: Chapter 25, where we take the leap from describing relationships to building predictive models. You've been measuring correlations; now you'll learn to use those relationships to make predictions about new data.
After this chapter, you'll never read a headline the same way again. "Studies show X is linked to Y" will trigger an automatic response in your brain: "Linked, but caused? What's the confounder?" That reflex is one of the most valuable skills data science can teach you.
Related Reading
Explore this topic in other books
Introductory Statistics ANOVA Introductory Statistics Comparing Two Groups Introductory Statistics Chi-Square Tests