Further Reading: Correlation, Causation, and the Danger of Confusing the Two

This chapter covers one of the most important ideas in data science — and one that rewards deeper study. Whether you want to understand confounding more rigorously, learn about modern causal inference methods, or simply enjoy a good story about data gone wrong, these resources will serve you well.

Tier 1: Verified Sources

Judea Pearl and Dana Mackenzie, The Book of Why: The New Science of Cause and Effect (Basic Books, 2018). If this chapter fascinated you, this is the book to read next. Pearl is the pioneer of modern causal inference and the inventor of directed acyclic graphs (DAGs) as tools for causal reasoning. This book, written for a general audience, explains why traditional statistics struggles with causation and how a new mathematical framework (do-calculus) can help. It's intellectually thrilling and deeply relevant to data science.

David Spiegelhalter, The Art of Statistics: How to Learn from Data (Basic Books, 2019). Spiegelhalter's chapters on correlation, confounding, and causation are some of the clearest treatments available. He's particularly good at explaining Simpson's paradox and the hierarchy of evidence with real-world medical examples. A great complement to our chapter.

Angrist and Pischke, Mostly Harmless Econometrics: An Empiricist's Companion (Princeton University Press, 2009). If you want the rigorous technical treatment of causal inference from observational data, this is the book that economists use. It covers instrumental variables, difference-in-differences, regression discontinuity, and other methods for extracting causal conclusions from non-experimental data. More technical than the other recommendations, but foundational if you're heading toward research or policy analysis.

Angrist and Pischke, Mastering 'Metrics: The Path from Cause to Effect (Princeton University Press, 2015). A more accessible version of the above, designed as an undergraduate textbook. If Mostly Harmless Econometrics feels too technical, start here. It uses engaging examples (the effect of insurance on health, the returns to education) to teach causal reasoning methods.

Tyler Vigen, Spurious Correlations (Hachette Books, 2015). Based on the popular website, this book catalogs ridiculous but statistically real correlations (US spending on science vs. suicides, Nicholas Cage movies vs. swimming pool drownings). It's entertaining and drives home the point that correlation without mechanism is meaningless. Good for a laugh and a learning moment.

Edward Tufte, The Visual Display of Quantitative Information (Graphics Press, 2nd edition, 2001). While not specifically about correlation and causation, Tufte's masterwork on data visualization is essential for anyone who creates scatter plots, correlation matrices, and other tools used in this chapter. His principles of visual integrity will help you present correlational data honestly.

Tier 2: Attributed Resources

Spurious Correlations website (Tyler Vigen). The website that inspired the book. It automatically generates correlations between randomly selected time series, producing absurd but statistically significant relationships. A fun and memorable demonstration of why correlation without mechanism is meaningless. Search "Tyler Vigen spurious correlations."

Hans Rosling's TED talks and Gapminder. Rosling's presentations of the GDP-health relationship (the subject of Case Study 2) are among the most compelling data visualizations ever created. His Gapminder tool lets you explore the relationship between wealth and health interactively across countries and time. Search "Hans Rosling TED" or visit the Gapminder website.

XKCD comic #552, "Correlation." Randall Munroe's famous strip: "Correlation doesn't imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing 'look over there.'" A perfect one-panel summary of the tension between what correlation shows and what it means.

Samuel Preston, "The Changing Relation between Mortality and Level of Economic Development," Population Studies (1975). The original paper documenting the Preston Curve — the logarithmic relationship between national income and life expectancy discussed in Case Study 2. A landmark in both demography and health economics.

Judea Pearl's "Causal Diagrams for Empirical Research" (1995). Published in Biometrika, this paper introduced DAGs as a formal tool for analyzing causal relationships in observational data. Technical but historically important — it launched a revolution in how statisticians think about causation.

Andrew Gelman and Jennifer Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models (Cambridge University Press, 2006). A practical guide to using regression for causal inference, with careful discussion of when regression can and cannot support causal claims. More advanced than our treatment, but excellent for the deep-dive reader.

Recommended Next Steps

If the correlation-causation distinction fascinated you: Read Pearl's The Book of Why. It will fundamentally change how you think about data and causation. Then come back to our chapters on regression (26-28) with fresh eyes — you'll see them differently.
If you want to practice identifying confounders: Browse the Spurious Correlations website and, for each one, try to explain why the correlation is meaningless. Then find real-world examples in the news and apply the causal evaluation checklist from the chapter.
If you're interested in causal inference methods: Start with Angrist and Pischke's Mastering 'Metrics for an accessible introduction, then move to Mostly Harmless Econometrics for the full treatment. These are the methods used by economists, epidemiologists, and policy researchers to draw causal conclusions from observational data.
If Simpson's paradox blew your mind: Spiegelhalter's The Art of Statistics has several excellent examples. For a more technical treatment, look up the Wikipedia article on Simpson's paradox, which catalogs real-world instances from medicine, education, and sports.
If you want to see the GDP-health relationship explored in depth: Watch Hans Rosling's TED talks and play with the Gapminder tools. Then read Angus Deaton's The Great Escape: Health, Wealth, and the Origins of Inequality (Princeton University Press, 2013) for a comprehensive treatment of how economic development and health co-evolve.
If you want to go deep on DAGs: Pearl's technical work (accessible through his website) provides the mathematical foundations. For a more applied treatment, look up Hernan and Robins, Causal Inference: What If (Chapman & Hall/CRC, 2020), which is freely available online and covers DAGs with medical examples.
If you're ready to move on: Chapter 25 introduces predictive modeling — using the relationships you've learned to measure in this chapter to make predictions about new data. Remember the lesson from this chapter: a model that predicts well does not necessarily describe a causal mechanism.

A Final Thought

Correlation and causation are often presented as a binary: either something IS causal or it ISN'T. In practice, the situation is usually somewhere in between. Most real-world correlations reflect a mix of direct causal effects, reverse causation, and confounding — all intertwined and often impossible to fully separate with observational data.

The right response isn't paralysis ("we can never know anything about causation!") or naivety ("the correlation is strong, so it must be causal!"). It's honest, careful reasoning: "Here's what we observe. Here are the possible explanations. Here's what we'd need to determine which explanation is correct. And here's how confident we should be in the meantime."

That's the skill this chapter is designed to build. It takes practice, and you'll get better at it with every dataset you analyze.