Further Reading: Chapter 1

From Analysis to Prediction


Foundational Papers and Books

1. "Statistical Modeling: The Two Cultures" --- Leo Breiman (2001) Statistical Science, Vol. 16, No. 3, pp. 199-231. The paper that crystallized the divide between statistical modeling (inference) and algorithmic modeling (prediction). Breiman argues that statisticians' focus on data models has prevented them from using more flexible, predictive approaches. Essential reading for understanding why ML thinks differently from statistics. Available freely online.

2. An Introduction to Statistical Learning (ISLR) --- James, Witten, Hastie, Tibshirani (2nd edition, 2021) The best bridge between statistical thinking and machine learning. Chapters 1-2 cover the bias-variance tradeoff, model assessment, and the distinction between inference and prediction at exactly the right level of mathematical rigor. The Python edition (ISLP, 2023) includes code labs. Free PDF available at statlearning.com.

3. The Elements of Statistical Learning (ESL) --- Hastie, Tibshirani, Friedman (2nd edition, 2009) The graduate-level companion to ISLR. Chapter 7 (Model Assessment and Selection) provides the definitive treatment of bias-variance decomposition, cross-validation, and the dangers of in-sample evaluation. More mathematical than ISLR, but Chapter 7 alone is worth the effort. Free PDF available at the authors' website.


Practical Guides

4. "Rules of Machine Learning: Best Practices for ML Engineering" --- Martin Zinkevich (Google) A concise document of 43 rules for applied ML, organized by project phase. Rule 1: "Don't be afraid to launch a product without machine learning." Rule 4: "Keep the first model simple and get the infrastructure right." This captures the practitioner mindset better than any textbook. Freely available from Google's ML guides.

5. Designing Machine Learning Systems --- Chip Huyen (O'Reilly, 2022) Chapter 2 ("Introduction to Machine Learning Systems Design") covers problem framing, objective functions, and the gap between ML in research and ML in production. Huyen's treatment of how to translate business requirements into ML specifications is the best practical guide available. Required reading for anyone building models that will actually be deployed.


Blog Posts and Articles

6. "The Problem with Accuracy" --- Various sources Multiple excellent blog posts exist on why accuracy is misleading for imbalanced classification. Start with Google's Machine Learning Crash Course section on classification metrics, which walks through accuracy, precision, recall, and AUC with interactive examples. The key insight: accuracy tells you nothing useful when 92% of your data belongs to one class.

7. "How to Frame a Machine Learning Problem" --- Google Cloud Architecture Center A structured guide to the six decisions that define an ML problem: what to predict, the observation unit, the prediction window, feature availability, evaluation criteria, and how the prediction will be consumed. Aligns closely with the framing checklist introduced in this chapter. Freely available in Google's Cloud architecture documentation.


Video and Multimedia

8. "Machine Learning for Everyone" --- Visually Explained (YouTube Series) A series of short, visually rich explanations of core ML concepts. The episodes on bias-variance tradeoff and overfitting use animations that make the concepts click in a way that text alone cannot. Particularly recommended for visual learners who want a complement to the mathematical treatment.

9. Andrew Ng --- Stanford CS229 Lecture 1 (YouTube) The opening lecture of Stanford's machine learning course. Ng's explanation of supervised vs. unsupervised learning and his framing of "when should you use ML?" remains one of the clearest introductions to the field. The first 30 minutes cover the territory of this chapter at a slightly more technical level.


Documentation and Reference

10. scikit-learn User Guide --- "An introduction to machine learning with scikit-learn" The official scikit-learn tutorial walks through a complete classification example: loading data, splitting into train/test, fitting a model, and evaluating. It is sparse but precise, and it establishes the API patterns you will use for the rest of this book. Start here if you want to get your hands on code immediately. Available at scikit-learn.org.

11. scikit-learn --- model_selection.train_test_split Documentation The reference page for train_test_split, the function you will use hundreds of times. Pay attention to the stratify parameter (essential for imbalanced classification) and random_state (essential for reproducibility). Understanding this one function deeply is a better investment than skimming ten tutorials.


Domain-Specific Reading

12. "Hospital Readmission Reduction Program (HRRP)" --- CMS.gov The official CMS documentation on the Hospital Readmissions Reduction Program. Understanding the regulatory context behind Metro General's readmission prediction problem makes the case study far more concrete. The penalty structure, the target conditions, and the measurement methodology are all specified here. Useful for understanding how real-world constraints shape ML problem framing.


How to Use This List

If you read nothing else, read Breiman (item 1) and the ISLR Chapters 1-2 (item 2). Breiman sets the philosophical stage; ISLR provides the mathematical foundation. Together they take about 3 hours and will fundamentally reshape how you think about modeling.

If you are more practice-oriented, start with Zinkevich (item 4) and the scikit-learn tutorial (item 10). Zinkevich tells you what experienced ML engineers wish they had known on day one; scikit-learn gets you writing code.

If you want to go deep on problem framing specifically, Huyen Chapter 2 (item 5) and the Google framing guide (item 7) are the strongest resources available.


This reading list supports Chapter 1: From Analysis to Prediction. Return to the chapter to review concepts before diving in.