Key Takeaways: Correlation, Causation, and the Danger of Confusing the Two

Contributors to Introduction to Data Science

Key Takeaways: Correlation, Causation, and the Danger of Confusing the Two

This is your reference card for Chapter 24. It contains the threshold concept that changes how you read every study and headline for the rest of your career.

The Threshold Concept

Correlation does not equal causation.

This is more than a slogan. It means:

Finding that X and Y move together tells you nothing about whether X causes Y, Y causes X, or something else causes both
Strong correlations can be completely spurious (chocolate and Nobel Prizes, Nicolas Cage movies and drowning)
Even "obvious" causal interpretations can be wrong (hospitals with higher mortality might actually be better)
The instinct to see causation in correlation is deeply human — and must be actively resisted

Key Concepts

Pearson's r measures linear relationships (-1 to +1). Use it for continuous variables with approximately linear relationships.
Spearman's rho measures monotonic relationships. Use it when the relationship is nonlinear but consistently increasing/decreasing, or when data has outliers.
A confounding variable causes both X and Y, creating an apparent relationship between them even if neither directly causes the other. Confounders are the number-one threat to causal claims.
Simpson's paradox occurs when a trend in disaggregated data reverses in aggregated data. Always consider whether results should be broken down by potential confounders.
Spurious correlations are statistically significant associations with no causal mechanism. With enough variables, you'll find them everywhere.
Randomized controlled trials are the gold standard for causation because randomization balances all confounders (known and unknown).

Three Explanations for Every Correlation

When X and Y are correlated, always consider:

1. X → Y          X causes Y

2. X ← Y          Y causes X (reverse causation)

3. X ← Z → Y     A confounder Z causes both
                   (most common explanation for spurious correlations)

4. Coincidence     No real relationship at all
                   (spurious correlation from testing many variables)

Correlation Coefficient Quick Reference

Pearson's r:

| |r| | Interpretation | |---|---| | 0.00 - 0.09 | Negligible | | 0.10 - 0.29 | Weak | | 0.30 - 0.49 | Moderate | | 0.50 - 0.69 | Strong | | 0.70 - 0.89 | Very strong | | 0.90 - 1.00 | Nearly perfect |

Python functions:

from scipy import stats

# Pearson (linear)
r, p = stats.pearsonr(x, y)

# Spearman (monotonic)
rho, p = stats.spearmanr(x, y)

# Correlation matrix (pandas)
df.corr()                    # Pearson
df.corr(method='spearman')   # Spearman

When to Use Pearson vs. Spearman

Situation	Use
Linear relationship, no outliers	Pearson
Nonlinear but monotonic relationship	Spearman
Data has extreme outliers	Spearman
Ordinal (ranked) data	Spearman
Not sure	Compute both; compare

The Hierarchy of Evidence for Causation

From weakest to strongest:

Anecdote / Case Report
    ↓
Cross-Sectional Observation (Correlation)
    ↓
Longitudinal / Cohort Study
    ↓
Natural Experiment
    ↓
Randomized Controlled Trial (RCT)
    ↓
Multiple Replicated RCTs
    ↓
Systematic Review / Meta-Analysis

Each level provides stronger evidence for causation than the level above it.

The Causal Evaluation Checklist

When someone claims X causes Y, ask:

[ ] Is there a correlation? (If X and Y aren't even associated, the claim is unsupported)
[ ] Could it be spurious? (How many variables were tested? Could this be a fluke?)
[ ] Is there a plausible mechanism? (Can you explain HOW X would cause Y?)
[ ] Could there be a confounder? (What third variable could explain both X and Y?)
[ ] Could causation be reversed? (Could Y cause X instead?)
[ ] What type of study is the evidence from? (RCT > observational > anecdote)
[ ] Has the finding been replicated? (One study is not enough)
[ ] Is there a dose-response relationship? (More X → more Y?)

The more boxes you check "yes" on, the stronger the causal case.

Classic Confounding Examples

Correlation	Confounder	Explanation
Ice cream and drowning	Temperature	Hot weather causes both
Shoe size and reading ability	Age	Older children have bigger feet AND read better
Firefighters at a fire and fire damage	Fire severity	Bigger fires attract more fighters AND cause more damage
Organic food and health	Lifestyle/wealth	Health-conscious, wealthy people buy organic AND are healthier
Hospital ranking and mortality	Patient severity	Best hospitals take sickest patients

Simpson's Paradox: A Visual Summary

WITHIN each group:  Treatment A is better than Treatment B
                    (Group 1: A=90%, B=85%)
                    (Group 2: A=60%, B=55%)

OVERALL (combined): Treatment B appears better than Treatment A
                    (Because sicker patients received A more often)

→ The confounding variable is disease severity
→ The disaggregated data tells the true story

Rule: Always consider whether your analysis should be disaggregated by potential confounders. Aggregated results can be misleading.

What You Should Be Able to Do Now

[ ] Compute Pearson's r and Spearman's ρ in Python
[ ] Create scatter plots and correlation matrices with heatmaps
[ ] Interpret correlation coefficients (strength, direction, limitations)
[ ] Identify confounding variables in real-world correlations
[ ] Recognize Simpson's paradox and explain when aggregation misleads
[ ] Explain why RCTs are the gold standard for causation
[ ] Evaluate causal claims using the 8-point checklist
[ ] Distinguish between correlation, association, and causation in writing
[ ] Catch yourself when tempted to make causal claims from correlational data
[ ] Communicate the limitations of correlational analysis to non-technical audiences

If you checked every box, you've internalized one of the most important concepts in data science. You'll never read a headline the same way again.

The Sentence You Should Say Most Often

When someone points to a correlation and says "X causes Y," your response should be:

"That's an interesting association. But what else could explain it? Is there a confounder? Could the causation go the other way? And what kind of study is this based on?"

This isn't being pedantic. It's being scientific.

You're ready for Chapter 25, where we take the leap from describing relationships to predicting outcomes. You've been measuring correlations; now you'll learn to use those relationships to build models. But the lesson from this chapter will follow you: a model that predicts well is not necessarily a model that reveals causes.