Chapter 24 Exercises: Correlation, Causation, and the Danger of Confusing the Two

Contributors to Introduction to Data Science

Chapter 24 Exercises: Correlation, Causation, and the Danger of Confusing the Two

How to use these exercises: Part A emphasizes the conceptual distinction between correlation and causation — can you spot confounders and evaluate causal claims? Part B applies correlation computation and interpretation. Part C requires Python code. Part D pushes toward critical evaluation and deeper thinking about causality.

Difficulty key: ⭐ Foundational | ⭐⭐ Intermediate | ⭐⭐⭐ Advanced | ⭐⭐⭐⭐ Extension

Part A: Conceptual Understanding ⭐

Exercise 24.1 — Interpreting correlation coefficients

For each Pearson r value below, describe the relationship in plain English and give a plausible real-world example:

r = +0.92
r = -0.75
r = +0.15
r = 0.00
r = -0.45

Guidance

1. **r = +0.92:** Very strong positive linear relationship. As one variable increases, the other almost always increases. Example: height and arm span in adults. 2. **r = -0.75:** Strong negative relationship. As one increases, the other tends to decrease. Example: outdoor temperature and home heating costs. 3. **r = +0.15:** Weak positive relationship. A slight tendency for both to increase together, but lots of scatter. Example: shoe size and vocabulary in children (both increase with age, the confounder). 4. **r = 0.00:** No linear relationship. Knowing one variable tells you nothing about the other. Example: birth month and adult height. 5. **r = -0.45:** Moderate negative relationship. Example: hours of TV watched per day and physical fitness score.

Exercise 24.2 — Spotting confounders ⭐

For each correlation, identify at least one plausible confounding variable.

Cities with more police officers have higher crime rates.
Countries with more hospitals have higher death rates.
Students who eat breakfast score higher on exams.
People who own horses live longer.
Neighborhoods with more trees have lower crime rates.

Guidance

1. **Population size.** Larger cities have both more police and more crime. The confounder is city size. 2. **Population size and age distribution.** Larger countries need more hospitals and naturally have more deaths. Also, countries with older populations have both more healthcare facilities and more deaths. 3. **Socioeconomic status.** Students from wealthier families are more likely to eat breakfast AND have resources (tutoring, stable home environment) that help them score higher. Family income is a confounder. 4. **Wealth.** Horses are expensive. Horse owners tend to be wealthy, and wealthy people have better access to healthcare, nutrition, and safer living conditions. Wealth is the confounder. 5. **Property values/income.** Wealthier neighborhoods have both more trees (better maintenance, more green space investment) and lower crime rates (more security, less economic desperation). Neighborhood wealth is the confounder.

Exercise 24.3 — Three explanations for every correlation ⭐

For the correlation "Countries with higher internet penetration have higher GDP," write out all three possible causal explanations:

Internet penetration causes higher GDP (and explain how)
Higher GDP causes higher internet penetration (and explain how)
A third variable causes both (identify the confounder and explain)

Then discuss: which explanation(s) do you think are most plausible, and why?

Guidance

1. **Internet → GDP:** Internet access enables e-commerce, remote work, access to information, and more efficient markets, all of which could boost economic productivity. 2. **GDP → Internet:** Wealthier countries can invest in telecommunications infrastructure, and their citizens can afford devices and subscriptions. 3. **Confounder → Both:** Institutional development, education levels, or industrialization might drive both. A country that industrialized early developed both economic strength and the infrastructure for internet deployment. Most plausible: a combination of all three. There's likely a reinforcing cycle (GDP funds internet infrastructure, which boosts GDP) plus shared causes (institutional quality, education). This is a common pattern — real-world causation often involves multiple mechanisms and feedback loops, not simple one-way arrows.

Exercise 24.4 — Evaluating causal claims ⭐⭐

For each headline, identify the causal claim, assess whether the evidence likely supports causation (or just correlation), and explain your reasoning.

"Study finds that drinking coffee reduces risk of Alzheimer's disease"
"Married people earn 10% more than single people"
"New randomized trial shows meditation app reduces anxiety by 30%"
"States with higher minimum wages have lower poverty rates"
"Children who play video games have lower attention spans"

Guidance

1. **Likely observational.** Coffee drinkers and non-drinkers may differ in many ways (SES, education, lifestyle). The claim is probably based on an observational study — correlation, not proven causation. 2. **Likely observational.** People who marry may already have characteristics (stability, higher income) that predict both marriage and higher earnings. Selection bias is a major concern. 3. **RCT — stronger evidence.** "Randomized trial" means participants were randomly assigned to use the app or not. This is much stronger evidence for causation. But check: was there a placebo control? Was it blinded? 4. **Cross-sectional observation.** States differ in many ways besides minimum wage (cost of living, industry mix, demographics). The relationship could be confounded by urbanization, state wealth, or political factors. 5. **Likely observational and possibly reversed.** Children with lower attention spans might choose video games (reverse causation). Or both could be caused by other factors (parenting style, screen time more broadly). Without an RCT, causation isn't established.

Exercise 24.5 — Simpson's paradox identification ⭐⭐

A university reports the following acceptance rates:

	Men Applied	Men Accepted	Women Applied	Women Accepted
Department A	800	62%	100	82%
Department B	200	30%	900	35%
Overall	1000	?	1000	?

Calculate the overall acceptance rates for men and women.
Which gender has a higher acceptance rate within each department?
Which gender has a higher acceptance rate overall?
Explain the paradox. What is the confounding variable?

Guidance

1. Men: (800 × 0.62 + 200 × 0.30) / 1000 = (496 + 60) / 1000 = 55.6%. Women: (100 × 0.82 + 900 × 0.35) / 1000 = (82 + 315) / 1000 = 39.7%. 2. Women have a higher rate in BOTH departments (82% vs. 62% in A; 35% vs. 30% in B). 3. Men have a higher overall rate (55.6% vs. 39.7%). 4. The confounding variable is department choice. Men overwhelmingly apply to Department A (which has a high acceptance rate), while women overwhelmingly apply to Department B (which has a low acceptance rate). The overall rate mixes the within-department advantage for women with the between-department distribution difference. Within every department, women are admitted at higher rates.

Exercise 24.6 — Pearson vs. Spearman ⭐⭐

When would you use Spearman's rank correlation instead of Pearson's r? Give a specific example for each scenario:

The relationship is monotonic but not linear
The data contains extreme outliers
One or both variables are ordinal (ranked categories)
You want to check if both measures agree

Guidance

1. **Nonlinear monotonic:** The relationship between experience (years) and salary follows a curve — salary increases with experience but at a decreasing rate (logarithmic). Spearman captures the consistent upward trend; Pearson underestimates the strength. 2. **Outliers:** A dataset of home prices vs. square footage where one mansion (10,000 sq ft, $20M) skews the Pearson r. Spearman, based on ranks, is barely affected. 3. **Ordinal data:** Correlation between education level (high school, bachelor's, master's, PhD) and job satisfaction rating (1-5 scale). These are ordered categories, not continuous measures. 4. **Agreement check:** If Pearson and Spearman give similar values, the relationship is likely linear. If Spearman is much higher, the relationship is monotonic but curved.

Exercise 24.7 — The hierarchy of evidence ⭐⭐

Rank the following study designs from weakest to strongest for establishing that "regular exercise reduces heart disease risk." Explain your ranking.

A randomized controlled trial assigning 1,000 people to exercise or no-exercise groups for 5 years
A cross-sectional survey finding that people who report exercising have lower rates of heart disease
A 20-year longitudinal study following 50,000 people and tracking exercise habits and heart disease
An anecdote: "My grandfather walked 3 miles every day and never had heart problems"
A natural experiment: a new subway line reduced commute walking, and researchers compared heart disease rates before and after

Guidance

From weakest to strongest: 4 (anecdote) → 2 (cross-sectional survey) → 3 (longitudinal study) → 5 (natural experiment) → 1 (RCT). The anecdote (4) is a single case with no comparison. The cross-sectional survey (2) shows association but can't establish temporal order or control confounders. The longitudinal study (3) tracks changes over time, reducing reverse causation concerns, but participants self-select into exercise. The natural experiment (5) provides a quasi-random "treatment" (reduced walking) that's plausibly independent of health-seeking behavior. The RCT (1) is strongest because randomization eliminates confounders, but it's expensive and raises practical challenges (compliance).

Exercise 24.8 — DAG reasoning ⭐⭐

Draw a directed acyclic graph (DAG) for each scenario. Identify the confounding variable(s) and explain whether the observed correlation between the two main variables likely reflects causation.

Sunscreen use and skin cancer incidence (surprisingly, sunscreen users have higher cancer rates in some studies)
Shoe size and reading ability in children
Smoking and lung cancer

Guidance

1. **Sunscreen and cancer:** UV exposure → both sunscreen use (people who get more sun buy more sunscreen) AND skin cancer. The confounder is sun exposure. People who use sunscreen may actually spend MORE time in the sun, not less. The correlation doesn't mean sunscreen causes cancer. 2. **Shoes and reading:** Age → larger shoe size AND better reading ability. Age is the confounder. A 10-year-old has bigger feet AND reads better than a 5-year-old, but shoe size doesn't cause reading ability. 3. **Smoking and cancer:** Here, the evidence strongly supports causation: dose-response (more smoking → more cancer), temporal order (smoking precedes cancer by decades), biological mechanism (carcinogens in smoke damage DNA), consistency across many studies, and RCT evidence from cessation (quitting reduces risk). The tobacco industry tried to argue confounding for decades, but the weight of evidence overwhelmingly supports causation.

Part B: Applied Problems ⭐⭐

Exercise 24.9 — Computing and interpreting correlations

Given the following data for 8 countries:

Country	GDP ($K)	Vaccination (%)	Life Expectancy
A	2.5	45	55
B	8.0	62	64
C	15.0	71	72
D	25.0	78	76
E	35.0	85	79
F	45.0	88	81
G	55.0	90	83
H	65.0	92	84

Compute the Pearson correlation between GDP and vaccination rate.
Compute the Pearson correlation between GDP and life expectancy.
Compute the Pearson correlation between vaccination rate and life expectancy.
Which pair has the strongest correlation? Does this mean one causes the other?
Draw a plausible DAG for these three variables. Where are the confounders?

Guidance

Using the formula or Python: r(GDP, Vax) ≈ 0.97, r(GDP, Life) ≈ 0.99, r(Vax, Life) ≈ 0.99. All three are very strongly correlated. The strongest is GDP-Life Expectancy, but this doesn't prove causation. A plausible DAG: GDP → Healthcare Spending → Vaccination Rate → (partially) Life Expectancy. GDP → Nutrition/Sanitation → Life Expectancy. Multiple pathways exist, and the variables are all part of a system rather than a simple cause-effect chain.

Exercise 24.10 — Correlation matrix interpretation ⭐⭐

You have a correlation matrix for 5 variables:

             Math  Reading  Income  TV_Hours  Exercise
Math         1.00    0.65    0.40    -0.30      0.15
Reading      0.65    1.00    0.45    -0.35      0.10
Income       0.40    0.45    1.00    -0.20      0.25
TV_Hours    -0.30   -0.35   -0.20     1.00     -0.40
Exercise     0.15    0.10    0.25    -0.40      1.00

Which pair of variables has the strongest positive correlation? What might explain it?
Which pair has the strongest negative correlation? What might explain it?
A parent says "My child needs to stop watching TV because it's making them bad at reading (r = -0.35)." Evaluate this causal claim.
Are any of the correlations likely spurious? Which ones and why?

Guidance

1. Math and Reading (r = 0.65). General academic ability (or intelligence, or parental education) likely drives both. 2. Exercise and TV_Hours (r = -0.40). Time is finite — more time exercising means less time watching TV (and vice versa). 3. The parent is making a causal claim from a correlation. Confounders include: age, parental involvement, socioeconomic status, and personality traits. Children who watch more TV might have less parental supervision, fewer books at home, or less interest in academics — all of which independently affect reading scores. 4. The Math-Exercise correlation (r = 0.15) is likely spurious or very weak. There's no obvious mechanism connecting math ability to exercise, and the correlation is close to what might arise by chance.

Exercise 24.11 — Partial correlation ⭐⭐

The correlation between ice cream sales and drowning deaths is r = 0.85. The correlation between temperature and ice cream sales is r = 0.90, and between temperature and drowning is r = 0.80.

Why is the ice cream-drowning correlation spurious?
If you "controlled for" temperature (computed the partial correlation), would you expect the ice cream-drowning correlation to drop substantially? Why?
What would a partial correlation near zero tell you?

Guidance

1. Temperature causes both — hot weather drives ice cream purchases AND swimming (which increases drowning risk). There's no causal link between ice cream and drowning. 2. Yes, dramatically. The raw correlation reflects their shared relationship with temperature. Removing the effect of temperature would remove most of the apparent relationship. 3. A partial correlation near zero would confirm that the ice cream-drowning relationship is entirely explained by temperature. Once you account for the confounder, the two variables are essentially independent — exactly what we'd expect if there's no direct causal link.

Exercise 24.12 — Correlation does not imply no confounding ⭐⭐⭐

A study finds r = 0.60 between a country's military spending (% of GDP) and infant mortality, and concludes "military spending harms children." Critique this conclusion by:

Identifying at least three confounding variables
Explaining how reverse causation might apply
Suggesting a better study design to test the causal claim
Writing a more accurate conclusion based on the data

Guidance

1. Confounders: (a) Political instability — countries in conflict zones spend more on military AND have worse health outcomes. (b) Overall development level — less developed countries may allocate more to military (as % of GDP) and have higher infant mortality. (c) Government effectiveness — poorly governed countries may both overspend on military and underinvest in healthcare. 2. Reverse causation is less plausible here (infant mortality probably doesn't cause military spending), but a feedback loop is possible: high infant mortality → political instability → more military spending. 3. A better design: (a) A panel study tracking changes in military spending and infant mortality over time within the same countries. (b) A natural experiment — compare countries that experienced an external shock to military spending (e.g., end of a neighboring conflict) to similar countries that didn't. (c) At minimum, regression controlling for GDP, conflict status, and governance quality. 4. Better conclusion: "Military spending and infant mortality are positively correlated (r = 0.60), but this association likely reflects shared underlying factors such as political instability and development level rather than a direct causal effect of military spending on child health."

Exercise 24.13 — When r = 0 doesn't mean no relationship ⭐⭐

Create a dataset where Pearson's r is approximately zero but there IS a clear relationship between x and y. (Hint: think about nonlinear relationships.)

Provide the dataset (or a formula to generate it)
Compute Pearson's r and Spearman's ρ
Plot the data and explain why r misses the relationship
What lesson does this teach about relying solely on correlation coefficients?

Guidance

Example: x = [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5], y = x². The relationship is perfectly quadratic (parabolic) — knowing x tells you EXACTLY what y is. But Pearson's r ≈ 0 because the relationship is symmetric and nonlinear. Spearman's ρ might also be low because the relationship isn't monotonic (it decreases then increases). The lesson: ALWAYS plot your data. A correlation coefficient is a summary of linear relationships only. Nonlinear relationships require visualization and other methods.

Part C: Coding Exercises ⭐⭐–⭐⭐⭐

Exercise 24.14 — Correlation exploration ⭐⭐

Using Python, create a dataset with 6 variables for 100 countries (simulate realistic values): - GDP per capita, Population, Education index, Healthcare spending, Vaccination rate, Life expectancy

Compute the full correlation matrix
Create a heatmap with seaborn
Create a scatter plot matrix (pairplot) for the 4 most interesting variables
Identify the three strongest correlations and discuss possible confounders for each

Exercise 24.15 — Simulating confounding ⭐⭐

Write a simulation that demonstrates confounding:

Create a confounder Z (e.g., normal distribution, n=500)
Create X as a function of Z (with noise)
Create Y as a function of Z (with noise) — but NOT directly caused by X
Compute the correlation between X and Y
Show that the correlation is strong even though X does not cause Y
Compute the partial correlation (controlling for Z) and show it drops to near zero

Exercise 24.16 — Simpson's paradox simulation ⭐⭐⭐

Create a complete simulation of Simpson's paradox:

Create a dataset with a treatment (A or B), an outcome, and a grouping variable
Show that Treatment A appears better overall
Show that Treatment B is better within EVERY subgroup
Create visualizations showing both the overall and disaggregated results
Explain the paradox in a paragraph suitable for a non-technical audience

Exercise 24.17 — Anscombe's quartet extended ⭐⭐

Create your own version of Anscombe's quartet:

Create 4 datasets that all have the same Pearson r (approximately 0.70) but look very different - Dataset 1: A clean linear relationship - Dataset 2: A curved (quadratic) relationship - Dataset 3: A perfect line with one outlier - Dataset 4: No relationship except one extreme point
Plot all 4 in a 2x2 grid
Compute Pearson r and Spearman ρ for each
Write a caption explaining why visualization is essential

Exercise 24.18 — Correlation significance testing ⭐⭐

Write code that:

Generates two independent random variables (n = 30, no true relationship)
Computes Pearson's r and its p-value
Repeats 10,000 times
Plots the distribution of r values and the distribution of p-values
Shows that about 5% of the r values are "significant" at α = 0.05
Discusses what this means for data dredging (testing many correlations)

Exercise 24.19 — Project: Full correlation analysis ⭐⭐⭐

Perform a comprehensive correlation analysis on the vaccination project data:

Load or create a dataset with GDP, healthcare spending, education index, urbanization rate, and vaccination rate for 150+ countries
Compute pairwise correlations (both Pearson and Spearman)
Create a correlation heatmap
For the top 3 correlations, create scatter plots with regression lines
For each correlation, identify at least one confounding variable
Compute partial correlations controlling for GDP
Write a "Results" section discussing the correlations, their possible confounders, and the limitations of correlational analysis

Part D: Synthesis and Critical Thinking ⭐⭐⭐–⭐⭐⭐⭐

Exercise 24.20 — News article analysis ⭐⭐⭐

Find (or imagine) a news article with a headline like "Study finds that [X] is linked to [Y]." Then:

Identify the implied causal claim
What type of study is it likely based on (RCT, observational, survey)?
List at least 3 confounding variables that could explain the correlation
Is reverse causation plausible?
Rewrite the headline to be more accurate
Write the paragraph that SHOULD appear in the article but probably doesn't — the one about limitations and alternative explanations

Exercise 24.21 — Designing an RCT ⭐⭐⭐

You want to determine whether a vaccination education program causes higher vaccination rates. Design a randomized controlled trial:

Define the population, the treatment, and the control
How will you randomize? At the individual level or the community level?
What is your primary outcome measure?
What confounders are you worried about, and how does randomization address them?
What are the ethical considerations? (Can you ethically deny an education program to the control group?)
What sample size would you need? (Use the principles from Chapter 23.)

Exercise 24.22 — The chocolate-Nobel Prize paper ⭐⭐⭐

The 2012 New England Journal of Medicine paper by Franz Messerli found r = 0.79 between per-capita chocolate consumption and Nobel Prize winners per capita, across countries.

Is this likely a causal relationship? Why or why not?
Identify at least 4 confounding variables
The paper was widely covered with headlines suggesting chocolate makes you smarter. Write a response explaining why the headlines are misleading.
Why might the author have published this — was it intended as a serious finding or a commentary on correlation?
What would it take to actually prove that chocolate consumption causes cognitive improvement?

Guidance

Confounders include: national wealth (rich countries consume more chocolate AND invest more in education/research), climate (colder countries consume more chocolate and have different educational traditions), European cultural influence (both chocolate consumption and Nobel Prizes are concentrated in European countries), and population size (small wealthy European countries have high per-capita rates of both). The paper was widely understood as tongue-in-cheek — a demonstration that ridiculous correlations can look compelling with the right data. Proving a causal effect would require randomized trials of chocolate consumption with cognitive outcomes, controlling for diet, genetics, and lifestyle.

Exercise 24.23 — Ethical implications of causal claims ⭐⭐⭐⭐

Consider this scenario: A data analysis shows a strong correlation between a community's vaccination rate and its average income level. A policymaker uses this finding to argue: "We should focus vaccination campaigns on low-income communities because poverty causes low vaccination rates."

Is the causal claim justified by the correlation?
What alternative explanations exist for the correlation?
Even if the causal direction is uncertain, is the policy recommendation reasonable? Why or why not?
How might the causal framing ("poverty causes low vaccination") affect the policy design compared to a non-causal framing ("poverty and low vaccination co-occur")?
Write a paragraph advising the policymaker on how to use this correlation responsibly.

Guidance

The causal claim isn't strictly justified by correlation alone. Alternative explanations: healthcare access (not income per se), geographic barriers, education, language barriers, trust in institutions. However, the policy recommendation might be reasonable regardless of the causal mechanism — low-income communities DO have lower vaccination rates for whatever reason, and targeting resources there addresses a real gap. The causal framing matters: if "poverty causes low vaccination" leads to income-transfer programs, it might miss the mark. If "various barriers co-occur with poverty" leads to multi-faceted interventions (mobile clinics, community outreach, translated materials), it might be more effective. The advisor should say: "The correlation is real, the need is real, and targeted resources are appropriate. But don't assume income is the ONLY lever — investigate the specific barriers in each community."

Exercise 24.24 — Correlation in your daily life ⭐⭐⭐

Over the next week, keep a log of causal claims you encounter in: - News articles ("Study shows X linked to Y") - Advertisements ("Using our product leads to Z") - Conversations ("I started doing X and then Y happened") - Social media ("If you do X, you'll get Y")

For each claim: 1. Identify the implied causation 2. Assess whether the evidence likely supports causation or just correlation 3. Identify at least one confounding variable

Write up 3 of your best examples with your analysis.

Exercise 24.25 — The limits of observational data ⭐⭐⭐⭐

Write an essay (500-700 words) on this question: "If we can never fully rule out confounders in observational data, should we stop making causal claims from observational studies? Or is 'imperfect causal evidence' better than 'no causal evidence at all'?"

Consider: - Situations where RCTs are impossible or unethical - The accumulated weight of multiple observational studies - The precautionary principle (acting on uncertain evidence when the stakes are high) - Historical examples where observational evidence was later confirmed by RCTs (e.g., smoking and cancer) - Historical examples where observational evidence was later contradicted (e.g., hormone replacement therapy)

Guidance

A balanced essay would acknowledge that observational studies are imperfect but often the best evidence available. The smoking-cancer link was established primarily through observational evidence (decades before RCTs of smoking cessation were feasible). Waiting for "perfect" evidence would have delayed public health action by decades. However, the HRT example shows that observational evidence can mislead — observational studies suggested HRT prevented heart disease, but RCTs showed it actually increased risk (because healthier women chose HRT — confounding by health consciousness). The key is to be transparent about the limitations, seek multiple lines of evidence, and be willing to update conclusions when better evidence arrives.