Further Reading: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
This chapter introduced the foundational ideas behind modeling — ideas that entire books are written about. Whether you want to deepen your understanding of the bias-variance tradeoff, explore the philosophy of modeling, or dive into the ethical dimensions of predictive systems, these resources will serve you well.
Tier 1: Verified Sources
Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, An Introduction to Statistical Learning (Springer, 2nd edition, 2021). This is arguably the best introduction to statistical learning for non-specialists. Chapter 2 covers the bias-variance tradeoff with excellent visualizations, and the rest of the book builds directly on the concepts from this chapter. A free PDF is available from the authors' website. If you read one additional book from this list, make it this one.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The Elements of Statistical Learning (Springer, 2nd edition, 2009). The more advanced companion to the book above. If An Introduction to Statistical Learning is the undergraduate text, this is the graduate text. Chapter 7 contains a rigorous treatment of the bias-variance tradeoff. Free PDF available from the authors. More mathematical, but definitive.
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World (Basic Books, 2015). A readable, non-technical tour of the five main schools of machine learning. Domingos explains how different communities think about models — from symbolic reasoning to neural networks — and argues that the bias-variance tradeoff is one of the central tensions in all of machine learning.
Cathy O'Neil, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy (Crown, 2016). The essential book on the ethical dangers of predictive models. O'Neil, a mathematician turned data scientist, catalogs cases where models have caused real harm — in education, criminal justice, lending, and hiring. Directly relevant to Case Study 2 and the ethical concerns discussed in this chapter.
George E. P. Box, "Science and Statistics," Journal of the American Statistical Association 71, no. 356 (1976): 791-799. The paper where Box first articulated the famous "all models are wrong" philosophy. Short, readable, and foundational. Box argues that the goal of modeling is not truth but utility — a philosophy that underpins our entire approach.
Leo Breiman, "Statistical Modeling: The Two Cultures," Statistical Science 16, no. 3 (2001): 199-231. A landmark paper about the tension between prediction and explanation (what Breiman calls the "data modeling culture" vs. the "algorithmic modeling culture"). Provocative and still debated. Directly relevant to Section 25.2 of this chapter.
Tier 2: Attributed Resources
StatQuest with Josh Starmer (YouTube channel). Josh Starmer's videos on the bias-variance tradeoff, overfitting, and train-test splits are some of the clearest explanations available anywhere. He uses simple visualizations and a conversational style that makes complex ideas accessible. Search "StatQuest bias variance tradeoff."
3Blue1Brown, "But what is a neural network?" (YouTube). While neural networks are beyond the scope of this book, Grant Sanderson's visual explanations of how models learn from data are illuminating. The first video in his neural network series beautifully illustrates what it means for a model to "fit" data.
Galit Shmueli, "To Explain or to Predict?" Statistical Science 25, no. 3 (2010): 289-310. A careful analysis of the distinction between prediction and explanation in statistical modeling, with implications for how models should be evaluated. More technical than our treatment but excellent for the deep-dive reader.
Jake VanderPlas, Python Data Science Handbook (O'Reilly, 2016). Chapter 5 covers machine learning with scikit-learn, including a clear treatment of overfitting, underfitting, and model validation. The code examples complement our approach and use the same libraries.
scikit-learn documentation (scikit-learn.org). The official documentation includes excellent user guides on concepts like cross-validation, model selection, and the bias-variance tradeoff. The "Tutorial" and "User Guide" sections are well-written and include runnable code examples.
Recommended Next Steps
-
If the bias-variance tradeoff fascinates you: Start with James et al.'s An Introduction to Statistical Learning, Chapter 2. The visualizations of bias and variance as functions of model complexity are outstanding and will deepen the intuition from this chapter.
-
If the ethical concerns worry you: Read O'Neil's Weapons of Math Destruction. Every data scientist should read this book. Then explore the Algorithmic Justice League (founded by Joy Buolamwini) and the ACM Conference on Fairness, Accountability, and Transparency (FAccT) for current research.
-
If you want to understand the prediction vs. explanation debate: Read Breiman's "Two Cultures" paper and Shmueli's "To Explain or to Predict?" Together, they frame a debate that continues to shape the field of data science.
-
If you learn best from video: Watch StatQuest's series on machine learning fundamentals. Start with "Bias and Variance" and work through "Cross Validation" and "The Main Ideas of Fitting a Line to Data."
-
If you want more hands-on practice: Work through the scikit-learn tutorials at scikit-learn.org. The "Getting Started" tutorial walks you through the fit-predict-score workflow with real datasets.
-
If you're interested in the history of modeling: Box's 1976 paper is short and accessible. For a broader history, try Nate Silver's The Signal and the Noise: Why So Many Predictions Fail — but Some Don't (Penguin, 2012), which traces the history of prediction from weather forecasting to baseball to elections.
-
If you're ready to move on: Chapter 26 introduces linear regression — your first predictive model. You'll use the train-test splitting, baseline comparison, and overfitting awareness from this chapter with actual scikit-learn code. The concepts become concrete.
A Final Thought
This chapter asked you to think about what a model is before building one. That might have seemed like unnecessary philosophy — why not just start coding? But the students who struggle most with machine learning are not the ones who can't write the code. They're the ones who don't understand what the code is doing or why.
Overfitting is not a software bug. It's a consequence of a deep mathematical tradeoff between capturing patterns and capturing noise. Understanding that tradeoff — really understanding it, not just knowing the vocabulary — will make you a better modeler than any amount of code memorization.
You now have the conceptual foundation. Let's build on it.