Chapter 24 Quiz: Correlation, Causation, and the Danger of Confusing the Two

Contributors to Introduction to Data Science

Chapter 24 Quiz: Correlation, Causation, and the Danger of Confusing the Two

Instructions: This quiz tests your understanding of Chapter 24. Answer all questions before checking the solutions. For multiple choice, select the best answer. For short answer questions, aim for 2-4 clear sentences. Total points: 100.

Section 1: Multiple Choice (10 questions, 4 points each)

Question 1. A Pearson correlation coefficient of r = -0.85 between two variables indicates:

(A) A weak negative relationship
(B) A strong negative relationship — as one variable increases, the other tends to decrease
(C) That one variable causes the other to decrease
(D) An error in the data — correlation cannot be negative

Answer

**Correct: (B)** r = -0.85 indicates a strong negative linear relationship. The negative sign means the variables move in opposite directions. The magnitude (0.85) indicates the relationship is strong. (A) is wrong because 0.85 is strong, not weak. (C) is wrong because correlation doesn't prove causation. (D) is wrong — negative correlations are common and valid.

Question 2. Which of the following is a confounding variable in the relationship between ice cream sales and drowning deaths?

(A) The price of ice cream
(B) The number of lifeguards
(C) Temperature / season
(D) The color of swimsuits

Answer

**Correct: (C)** Temperature is the confounding variable — it causes both increased ice cream consumption (people want cold treats in hot weather) and increased drowning risk (more people swim in hot weather). A confounding variable must be causally related to BOTH variables in the correlation. (A) affects ice cream sales but not drowning. (B) might affect drowning but doesn't cause ice cream sales. (D) is unrelated to both.

Question 3. Spearman's rank correlation (ρ) is preferred over Pearson's r when:

(A) The sample size is very large
(B) Both variables are normally distributed
(C) The relationship is monotonic but not necessarily linear
(D) You want to prove causation

Answer

**Correct: (C)** Spearman's ρ measures monotonic relationships (consistently increasing or decreasing) regardless of whether the relationship is linear. It converts data to ranks first, making it robust to nonlinearity and outliers. (A) is irrelevant — sample size doesn't determine the choice. (B) describes conditions where Pearson works well, not where Spearman is needed. (D) — neither correlation measure proves causation.

Question 4. A study finds that countries with more doctors per capita have higher obesity rates. What is the most likely explanation?

(A) Doctors cause obesity
(B) Obesity causes countries to train more doctors
(C) Wealthier countries have both more doctors and higher obesity rates (confounding by national wealth)
(D) The correlation is a mathematical error

Answer

**Correct: (C)** This is a classic confounding scenario. National wealth drives both: wealthier countries can afford more doctors AND have more processed food, sedentary lifestyles, and other factors that increase obesity. Neither (A) "doctors cause obesity" nor (B) "obesity causes more doctors" is a plausible direct causal mechanism. (D) is unlikely — the correlation is probably real, just misleading.

Question 5. Simpson's paradox occurs when:

(A) A correlation coefficient is exactly zero
(B) A trend that appears in disaggregated data reverses when the data is combined
(C) Two variables are perfectly correlated
(D) A study fails to replicate

Answer

**Correct: (B)** Simpson's paradox is the phenomenon where a relationship that holds within every subgroup reverses or disappears when the groups are combined. This happens because a confounding variable has different distributions across the groups being aggregated. The classic example is UC Berkeley admissions: women were accepted at higher rates in every department, but at a lower rate overall because they applied to more competitive departments.

Question 6. Which study design provides the strongest evidence for a causal claim?

(A) Cross-sectional survey
(B) Longitudinal observational study
(C) Randomized controlled trial
(D) Case study

Answer

**Correct: (C)** A randomized controlled trial (RCT) is the gold standard for establishing causation because random assignment balances all confounders — both known and unknown — between the treatment and control groups. (B) is better than (A) because it tracks changes over time, but participants aren't randomly assigned. (A) only shows association at one point in time. (D) is a single case and provides the weakest evidence.

Question 7. A correlation matrix shows r = 0.95 between variable A and variable B, and r = 0.93 between variable A and variable C, and r = 0.88 between variable B and variable C. What should you suspect?

(A) A causes both B and C
(B) All three variables are caused by one another in a circle
(C) There may be an underlying variable driving all three correlations
(D) The correlations are too high and must be errors

Answer

**Correct: (C)** When multiple variables are all highly correlated with each other, a common explanation is that they share an underlying common cause. For example, GDP, life expectancy, and education are all highly correlated because national development level drives all three. (A) might be true but isn't the only explanation. (B) circular causation is possible but unusual. (D) high correlations are not errors — they're common among variables that share underlying causes.

Question 8. A researcher finds r = 0.02 between hours of sleep and exam scores in a sample of 500 students. The p-value is 0.65. What can we conclude?

(A) Sleep has no effect on exam scores
(B) There is no linear relationship between sleep and exam scores in this sample
(C) There is no statistically significant linear correlation, but nonlinear relationships or confounders might exist
(D) The sample is too small to detect a relationship

Answer

**Correct: (C)** r = 0.02 with p = 0.65 means we fail to find a significant linear correlation. However, (A) is too strong — "no effect" is a causal claim we can't make from correlation. (B) is close but too absolute — r = 0.02 is very small but not exactly zero. (C) is best because it acknowledges the limitations: the relationship might be nonlinear (e.g., both too little AND too much sleep hurt performance), or confounders might mask a real relationship. (D) is wrong — 500 is a large sample; if the true correlation were moderate, we'd detect it.

Question 9. Which of the following strengthens the case that a correlation reflects a causal relationship?

(A) The correlation is computed from a very large sample
(B) There is a plausible biological or logical mechanism connecting the variables
(C) The correlation coefficient is exactly 1.0
(D) The researcher expected the result before conducting the study

Answer

**Correct: (B)** A plausible mechanism (a believable story for HOW X could cause Y) is one of the key criteria for inferring causation from correlation. Other supportive evidence includes temporal order (X precedes Y), dose-response (more X → more Y), consistency across studies, and experimental evidence. (A) large samples increase statistical significance but don't address confounding. (C) a perfect correlation could still be spurious. (D) prior expectations can help, but they also introduce confirmation bias.

Question 10. You compute a correlation matrix for 10 variables (45 unique pairs). Using α = 0.05 for each test, approximately how many spurious "significant" correlations would you expect to find even if no true relationships exist?

(A) 0
(B) About 2-3
(C) About 10
(D) About 22-23

Answer

**Correct: (B)** With 45 pairwise tests at α = 0.05, you'd expect about 45 × 0.05 = 2.25 false positives by chance alone. This is the multiple testing problem applied to correlation matrices. When exploring many correlations, some will appear significant purely by chance. This is why you should correct for multiple comparisons or treat exploratory correlations as hypotheses to be confirmed on new data.

Section 2: True or False (4 questions, 4 points each)

Question 11. True or False: If Pearson's r = 0 between two variables, there is definitely no relationship between them.

Answer

**False.** Pearson's r only measures *linear* relationships. Two variables can have a strong nonlinear relationship (e.g., a U-shaped or quadratic pattern) and still have r = 0. For example, y = x² has a perfect relationship where knowing x determines y exactly, but r ≈ 0 if x values are symmetric around zero. Always visualize your data — don't rely on correlation coefficients alone.

Question 12. True or False: A randomized controlled trial eliminates the problem of confounding variables.

Answer

**True** (approximately). Random assignment ensures that, on average, all confounding variables — both known and unknown — are balanced between treatment and control groups. This is why RCTs are the gold standard for causal inference. The qualification "approximately" is because randomization works in expectation; in any particular trial (especially small ones), imbalances can occur by chance. But with adequate sample sizes, randomization effectively eliminates confounding.

Question 13. True or False: A correlation of r = 0.99 between two variables proves that one causes the other.

Answer

**False.** Even an extremely strong correlation does not prove causation. The chocolate-Nobel Prize correlation (r = 0.79) and the Nicolas Cage movies-drowning correlation (r = 0.67) are examples of strong correlations with no causal relationship. Confounding variables, reverse causation, and coincidence can all produce strong correlations without causation. The strength of the correlation tells you about the tightness of the relationship, not about its causal nature.

Question 14. True or False: Spearman's rank correlation can detect nonlinear relationships that Pearson's r misses.

Answer

**True** (for monotonic nonlinear relationships). Spearman's ρ measures *monotonic* relationships — where Y consistently increases (or decreases) as X increases, even if the relationship is curved (like logarithmic or exponential). It converts data to ranks, so the specific shape of the curve doesn't matter as long as the direction is consistent. However, Spearman CANNOT detect non-monotonic relationships (like U-shaped or sinusoidal patterns).

Section 3: Short Answer (3 questions, 6 points each)

Question 15. Explain what a confounding variable is in 2-3 sentences. Give an original example (not from the chapter) of a correlation that is likely explained by a confounder.

Answer

A **confounding variable** is a third variable that is causally related to both the independent and dependent variables, creating a spurious (misleading) association between them. It makes it appear as though X and Y are directly related when, in fact, their relationship is partly or entirely driven by the confounder. **Example:** There is a positive correlation between the number of fire trucks responding to a fire and the amount of fire damage. The confounder is fire severity — larger fires cause both more damage AND more fire trucks to be dispatched. Sending fewer trucks wouldn't reduce damage (it would make it worse).

Question 16. In 2-3 sentences, explain why randomized controlled trials are considered the gold standard for establishing causation. What advantage does randomization provide that observational studies lack?

Answer

Randomized controlled trials are the gold standard because **random assignment** balances all characteristics — both measured and unmeasured — between the treatment and control groups. This means any difference in outcomes can be attributed to the treatment rather than to pre-existing differences between groups. Observational studies cannot achieve this because participants self-select into groups (e.g., people who choose to exercise are already different from those who don't), leaving open the possibility that unmeasured confounders explain the observed differences.

Question 17. Explain Simpson's paradox in 2-3 sentences. Why does it matter for data analysis?

Answer

**Simpson's paradox** occurs when a trend or relationship that appears in disaggregated subgroups reverses or disappears when the data is combined into a single group. This happens when a confounding variable has different distributions across the subgroups being compared, and aggregation mixes the within-group relationship with the between-group distribution difference. It matters because analyzing only the aggregated data can lead to the exact opposite conclusion from the correct one. Data scientists must always consider whether their results should be broken down by potential confounding variables — what looks true overall may be false within every subgroup (or vice versa).

Section 4: Applied Scenarios (2 questions, 7 points each)

Question 18. A health policy researcher presents the following data:

Country Group	Avg Health Spending (% GDP)	Avg Vaccination Rate
Low spending	3.2%	58%
Medium spending	6.5%	74%
High spending	10.1%	86%

The researcher concludes: "Increasing health spending causes higher vaccination rates. Governments should increase health spending to improve vaccination coverage."

(a) Identify the causal claim. (b) List three confounding variables that could explain the correlation without requiring a direct causal link. (c) Is the policy recommendation reasonable even if the causal claim isn't proven? Why or why not? (d) What study design would you recommend to test the causal claim more rigorously?

Answer

**(a)** The causal claim is that health spending *causes* higher vaccination rates — increasing spending will lead to improved coverage. **(b)** Three confounders: (1) **GDP per capita** — wealthier countries can afford both higher health spending and better vaccination infrastructure. (2) **Government effectiveness** — well-functioning governments both allocate more to health and implement programs effectively. (3) **Education levels** — more educated populations both demand more health spending and are more likely to accept vaccines. **(c)** The recommendation may be reasonable as a practical matter — more health spending probably does contribute to better vaccination (it funds clinics, cold chains, personnel). But the SIZE of the effect may be smaller than the correlation suggests, because confounders inflate the apparent relationship. More spending alone, without addressing delivery infrastructure, education, and trust, may not produce the expected gains. **(d)** A quasi-experimental study comparing countries that experienced an exogenous change in health spending (e.g., a new international funding program) to similar countries that didn't. Or a within-country study examining vaccination rate changes following budget increases, controlling for time trends and other policy changes.

Question 19. You're analyzing a dataset and find the following correlations with student exam scores:

Hours of sleep: r = 0.32 (p = 0.001)
Parental income: r = 0.45 (p < 0.001)
Hours of study: r = 0.55 (p < 0.001)
Social media hours: r = -0.38 (p < 0.001)
Distance from school: r = -0.12 (p = 0.08)

(a) Which correlations are statistically significant at α = 0.05? (b) A school administrator says: "We need to ban social media because it's hurting grades." Evaluate this causal claim. (c) A parent says: "If I just make my kid study more, their grades will go up by the amount predicted by the correlation." Why might this not work as expected? (d) You tested 5 correlations. Should you apply a multiple testing correction? What would happen if you did?

Answer

**(a)** All except distance from school (p = 0.08) are significant at α = 0.05. **(b)** The social media-grades correlation (r = -0.38) does not prove causation. Confounders: students who spend more time on social media may have less parental supervision, less interest in academics, or more stress — all of which independently affect grades. Reverse causation is also possible: students who struggle academically may turn to social media as a distraction. A ban might not improve grades if the underlying causes (motivation, home environment, learning difficulties) aren't addressed. **(c)** Correlation doesn't predict the effect of an intervention. The r = 0.55 correlation reflects a mix of causal effect AND confounding (students who study more may be more motivated, have better study skills, or have more parental support). Forcing study hours up without addressing motivation might produce little improvement — and might even backfire if forced study is unproductive. **(d)** With 5 tests, Bonferroni correction would use α = 0.05/5 = 0.01. All results would remain significant except possibly the social media correlation (p < 0.001, so it's still fine). Distance from school (p = 0.08) was already non-significant. For only 5 tests, the correction is minor, but it's good practice to mention that multiple correlations were tested.

Section 5: Code Analysis (1 question, 6 points)

Question 20. Read the following code and answer the questions:

import numpy as np
from scipy import stats

np.random.seed(42)

# The hidden causal structure
wealth = np.random.normal(50, 15, 200)
education = 0.6 * wealth + np.random.normal(0, 10, 200)
health = 0.5 * wealth + 0.3 * education + np.random.normal(0, 8, 200)

r_ed_health, p1 = stats.pearsonr(education, health)
r_wealth_health, p2 = stats.pearsonr(wealth, health)
r_ed_wealth, p3 = stats.pearsonr(education, wealth)

print(f"Education-Health: r={r_ed_health:.3f}, p={p1:.4f}")
print(f"Wealth-Health: r={r_wealth_health:.3f}, p={p2:.4f}")
print(f"Education-Wealth: r={r_ed_wealth:.3f}, p={p3:.4f}")

(a) Based on the code, draw the causal DAG. Which variable(s) directly cause health? Which variable acts as a confounder? (b) The education-health correlation will be positive. Is all of this correlation due to education directly causing health? (c) If you wanted to estimate the direct effect of education on health (controlling for wealth), what would you need to do? (d) Would the education-health correlation be smaller or larger after controlling for wealth? Why?

Answer

**(a)** The DAG is:

Wealth → Education
Wealth → Health
Education → Health

Both wealth AND education directly cause health (they both appear in the formula for health). Wealth is a confounder of the education-health relationship because it causes both education AND health independently. **(b)** No. Part of the education-health correlation is due to the *direct* effect of education on health (the `0.3 * education` term). But part is due to confounding through wealth: wealth causes both higher education AND better health, inflating the apparent relationship between education and health. **(c)** You would compute the partial correlation between education and health, controlling for wealth. This can be done by regressing both education and health on wealth, taking the residuals, and computing the correlation between the residuals. **(d)** **Smaller.** Controlling for wealth removes the confounding component of the correlation. The remaining correlation would reflect only the direct effect of education on health (from the `0.3 * education` term), which is a smaller portion of the total association.