Key Takeaways: Correlation, Causation, and the Danger of Confusing the Two
This is your reference card for Chapter 24. It contains the threshold concept that changes how you read every study and headline for the rest of your career.
The Threshold Concept
Correlation does not equal causation.
This is more than a slogan. It means:
- Finding that X and Y move together tells you nothing about whether X causes Y, Y causes X, or something else causes both
- Strong correlations can be completely spurious (chocolate and Nobel Prizes, Nicolas Cage movies and drowning)
- Even "obvious" causal interpretations can be wrong (hospitals with higher mortality might actually be better)
- The instinct to see causation in correlation is deeply human — and must be actively resisted
Key Concepts
-
Pearson's r measures linear relationships (-1 to +1). Use it for continuous variables with approximately linear relationships.
-
Spearman's rho measures monotonic relationships. Use it when the relationship is nonlinear but consistently increasing/decreasing, or when data has outliers.
-
A confounding variable causes both X and Y, creating an apparent relationship between them even if neither directly causes the other. Confounders are the number-one threat to causal claims.
-
Simpson's paradox occurs when a trend in disaggregated data reverses in aggregated data. Always consider whether results should be broken down by potential confounders.
-
Spurious correlations are statistically significant associations with no causal mechanism. With enough variables, you'll find them everywhere.
-
Randomized controlled trials are the gold standard for causation because randomization balances all confounders (known and unknown).
Three Explanations for Every Correlation
When X and Y are correlated, always consider:
1. X → Y X causes Y
2. X ← Y Y causes X (reverse causation)
3. X ← Z → Y A confounder Z causes both
(most common explanation for spurious correlations)
4. Coincidence No real relationship at all
(spurious correlation from testing many variables)
Correlation Coefficient Quick Reference
Pearson's r:
| |r| | Interpretation | |---|---| | 0.00 - 0.09 | Negligible | | 0.10 - 0.29 | Weak | | 0.30 - 0.49 | Moderate | | 0.50 - 0.69 | Strong | | 0.70 - 0.89 | Very strong | | 0.90 - 1.00 | Nearly perfect |
Python functions:
from scipy import stats
# Pearson (linear)
r, p = stats.pearsonr(x, y)
# Spearman (monotonic)
rho, p = stats.spearmanr(x, y)
# Correlation matrix (pandas)
df.corr() # Pearson
df.corr(method='spearman') # Spearman
When to Use Pearson vs. Spearman
| Situation | Use |
|---|---|
| Linear relationship, no outliers | Pearson |
| Nonlinear but monotonic relationship | Spearman |
| Data has extreme outliers | Spearman |
| Ordinal (ranked) data | Spearman |
| Not sure | Compute both; compare |
The Hierarchy of Evidence for Causation
From weakest to strongest:
Anecdote / Case Report
↓
Cross-Sectional Observation (Correlation)
↓
Longitudinal / Cohort Study
↓
Natural Experiment
↓
Randomized Controlled Trial (RCT)
↓
Multiple Replicated RCTs
↓
Systematic Review / Meta-Analysis
Each level provides stronger evidence for causation than the level above it.
The Causal Evaluation Checklist
When someone claims X causes Y, ask:
- [ ] Is there a correlation? (If X and Y aren't even associated, the claim is unsupported)
- [ ] Could it be spurious? (How many variables were tested? Could this be a fluke?)
- [ ] Is there a plausible mechanism? (Can you explain HOW X would cause Y?)
- [ ] Could there be a confounder? (What third variable could explain both X and Y?)
- [ ] Could causation be reversed? (Could Y cause X instead?)
- [ ] What type of study is the evidence from? (RCT > observational > anecdote)
- [ ] Has the finding been replicated? (One study is not enough)
- [ ] Is there a dose-response relationship? (More X → more Y?)
The more boxes you check "yes" on, the stronger the causal case.
Classic Confounding Examples
| Correlation | Confounder | Explanation |
|---|---|---|
| Ice cream and drowning | Temperature | Hot weather causes both |
| Shoe size and reading ability | Age | Older children have bigger feet AND read better |
| Firefighters at a fire and fire damage | Fire severity | Bigger fires attract more fighters AND cause more damage |
| Organic food and health | Lifestyle/wealth | Health-conscious, wealthy people buy organic AND are healthier |
| Hospital ranking and mortality | Patient severity | Best hospitals take sickest patients |
Simpson's Paradox: A Visual Summary
WITHIN each group: Treatment A is better than Treatment B
(Group 1: A=90%, B=85%)
(Group 2: A=60%, B=55%)
OVERALL (combined): Treatment B appears better than Treatment A
(Because sicker patients received A more often)
→ The confounding variable is disease severity
→ The disaggregated data tells the true story
Rule: Always consider whether your analysis should be disaggregated by potential confounders. Aggregated results can be misleading.
What You Should Be Able to Do Now
- [ ] Compute Pearson's r and Spearman's ρ in Python
- [ ] Create scatter plots and correlation matrices with heatmaps
- [ ] Interpret correlation coefficients (strength, direction, limitations)
- [ ] Identify confounding variables in real-world correlations
- [ ] Recognize Simpson's paradox and explain when aggregation misleads
- [ ] Explain why RCTs are the gold standard for causation
- [ ] Evaluate causal claims using the 8-point checklist
- [ ] Distinguish between correlation, association, and causation in writing
- [ ] Catch yourself when tempted to make causal claims from correlational data
- [ ] Communicate the limitations of correlational analysis to non-technical audiences
If you checked every box, you've internalized one of the most important concepts in data science. You'll never read a headline the same way again.
The Sentence You Should Say Most Often
When someone points to a correlation and says "X causes Y," your response should be:
"That's an interesting association. But what else could explain it? Is there a confounder? Could the causation go the other way? And what kind of study is this based on?"
This isn't being pedantic. It's being scientific.
You're ready for Chapter 25, where we take the leap from describing relationships to predicting outcomes. You've been measuring correlations; now you'll learn to use those relationships to build models. But the lesson from this chapter will follow you: a model that predicts well is not necessarily a model that reveals causes.