Chapter 2: Further Reading

Chapter 2: Further Reading

The Data Science Mindset and Process

Davenport, T.H., & Harris, J.G. (2007). Competing on Analytics: The New Science of Winning. Harvard Business School Press. The foundational business text on building analytically competitive organizations. Particularly strong on the organizational and cultural dimensions of analytics adoption — the people and process challenges that this chapter identifies as more important than technology.

Provost, F., & Fawcett, T. (2013). Data Science for Business: What You Need to Know About Data Mining and Data-Analytic Thinking. O'Reilly Media. One of the best bridges between technical data science and business application. Covers CRISP-DM, analytical thinking, and model evaluation in a way that's accessible to business professionals without sacrificing rigor. Excellent companion reading for this chapter and the chapters that follow.

Chapman, P., et al. (2000). CRISP-DM 1.0: Step-by-Step Data Mining Guide. SPSS Inc. The original CRISP-DM documentation. Freely available online. More detailed than the overview in this chapter, with specific guidance on deliverables and activities for each phase. Remains the authoritative reference despite its age.

Hubbard, D.W. (2014). How to Measure Anything: Finding the Value of Intangibles in Business. 3rd edition. Wiley. An excellent resource on the business understanding phase of analytics work. Hubbard's central argument — that anything that matters can be measured, and that the real obstacle is usually conceptual, not technical — is a powerful complement to the data science mindset described in this chapter.

Correlation, Causation, and Statistical Thinking

Pearl, J., & Mackenzie, D. (2018). The Book of Why: The New Science of Cause and Effect. Basic Books. Judea Pearl is one of the most important thinkers in causal inference. This book, written for a general audience, explains why correlation doesn't imply causation and what we can do about it. Covers causal diagrams, confounders, and the logic of intervention in an accessible and often entertaining way.

Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux. The definitive work on cognitive biases that affect analytical judgment — including confirmation bias, anchoring, regression to the mean, and the overconfidence that leads managers to act on spurious correlations. Essential reading for anyone who makes decisions under uncertainty, which is everyone.

Wheelan, C. (2013). Naked Statistics: Stripping the Dread from the Data. W.W. Norton. An engaging, jargon-free introduction to statistical thinking. Covers distributions, sampling, significance, regression, and many of the concepts from Section 2.8 in greater depth. An excellent choice for readers who found the statistical thinking section of this chapter useful and want to go deeper without diving into formulas.

Vigen, T. (2015). Spurious Correlations. Hachette Books. The book based on the website referenced in this chapter. Hundreds of examples of correlations that are statistically real but causally meaningless. Funny, memorable, and a powerful inoculation against the temptation to treat correlation as causation. The website (tylervigen.com) is freely available and equally instructive.

Silver, N. (2012). The Signal and the Noise: Why So Many Predictions Fail — but Some Don't. Penguin Press. A sweeping examination of prediction across domains — baseball, elections, weather, earthquakes, poker, and the financial crisis. Strong on the themes of uncertainty, overconfidence, and the difference between signal and noise. Chapter 7 on climate models and Chapter 1 on the financial crisis are particularly relevant to business analytics.

Hypothesis-Driven Analysis and Scientific Thinking

Ritchie, S. (2020). Science Fictions: How Fraud, Bias, Negligence, and Hype Undermine the Search for Truth. Metropolitan Books. A sobering examination of how the scientific enterprise goes wrong — through p-hacking, publication bias, and the perverse incentives that reward novel findings over rigorous ones. Directly relevant to the hypothesis-driven analysis section of this chapter and to anyone building an analytical culture in an organization.

Leamer, E.E. (1983). "Let's Take the Con Out of Econometrics." American Economic Review, 73(1), 31–43. A classic paper (accessible despite its age and venue) arguing that empirical researchers' choices about model specification, variable selection, and data handling are rarely transparent — and that different reasonable choices can lead to different conclusions. A foundational text on the importance of analytical transparency and reproducibility.

McRaney, D. (2011). You Are Not So Smart: Why You Have Too Many Friends on Facebook, Why Your Memory Is Mostly Fiction, and 46 Other Ways You're Deluding Yourself. Gotham Books. A lively, accessible tour of cognitive biases and logical fallacies. Lighter than Kahneman but covers many of the same biases that affect business analytical thinking — confirmation bias, survivorship bias, the narrative fallacy, and the Dunning-Kruger effect.

Data Strategy and the Data Pipeline

Redman, T.C. (2008). Data Driven: Profiting from Your Most Important Business Asset. Harvard Business Press. Focused specifically on data quality — the topic that Section 2.10 identifies as the weakest link in most data pipelines. Redman makes a compelling business case for investing in data quality and provides practical frameworks for assessing and improving it.

Patil, D.J., & Mason, H. (2015). Data Driven: Creating a Data Culture. O'Reilly Media. A short, practical guide to building a data-driven organizational culture. Addresses many of the "last mile" challenges discussed in Section 2.7 — how to embed analytics in decision processes, how to communicate findings, and how to build organizational trust in data.

Reis, J. (2022). Fundamentals of Data Engineering. O'Reilly Media. A comprehensive introduction to modern data engineering — the infrastructure that makes data science possible. Covers the data pipeline in far more depth than this chapter, including ingestion patterns, storage architectures (warehouses, lakes, lakehouses), transformation frameworks, and serving layers. Recommended for readers who want to understand the technical foundation their analytical work depends on.

The Business of Analytics

Lewis, M. (2003). Moneyball: The Art of Winning an Unfair Game. W.W. Norton. The full Moneyball story, treated as a case study in this chapter. Lewis's narrative brings to life the organizational resistance to data-driven decision-making in a way that no business school case study can match. Required reading for understanding the human dimension of analytical transformation.

Duhigg, C. (2012). The Power of Habit: Why We Do What We Do in Life and Business. Random House. Contains the detailed Target pregnancy prediction story that forms Case Study 1 of this chapter. More broadly, explores the science of habit formation and organizational culture change — relevant to the challenge of building new analytical habits in established organizations.

O'Neil, C. (2016). Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown. A critical examination of how data science and machine learning can cause harm when deployed without adequate safeguards. Covers algorithmic bias, feedback loops, and the opacity of models used in hiring, lending, policing, and education. Essential reading for the ethical dimensions that the Target case study raises.

Mayer-Schonberger, V., & Cukier, K. (2013). Big Data: A Revolution That Will Transform How We Live, Work, and Think. Houghton Mifflin Harcourt. An accessible introduction to how large-scale data analysis is changing business, science, and society. Covers the shift from sampling to census data, from causation to correlation, and from clean data to messy data. Useful for the broader context of why data science thinking matters.

Academic References

Ioannidis, J.P.A. (2005). "Why Most Published Research Findings Are False." PLoS Medicine, 2(8), e124. The most cited paper in the history of scientific methodology. Ioannidis shows how the combination of small sample sizes, small effect sizes, flexibility in research design, and publication bias virtually guarantees that a large proportion of published findings are false. The statistical reasoning underpinning this paper connects directly to the multiple comparisons and p-hacking concerns in Section 2.4.

Wirth, R., & Hipp, J. (2000). "CRISP-DM: Towards a Standard Process Model for Data Mining." Proceedings of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining, 29–39. The academic paper describing the CRISP-DM process model. More concise than the full CRISP-DM documentation and useful for understanding the framework's intellectual foundations.

Stevens, S.S. (1946). "On the Theory of Scales of Measurement." Science, 103(2684), 677–680. The original paper defining the four measurement scales (nominal, ordinal, interval, ratio) discussed in Section 2.9. Brief, readable, and foundational. Understanding why Stevens made these distinctions helps clarify why they matter for modern data analysis and machine learning.

Hand, D.J. (2020). Dark Data: Why What You Don't Know Matters. Princeton University Press. A systematic examination of the many ways that missing, hidden, or overlooked data can mislead analysis. Covers fifteen types of "dark data" — from data we choose not to collect, to data distorted by measurement, to data that doesn't exist yet. Directly relevant to the data quality and data pipeline discussions in this chapter.