Further Reading: Evaluating Models

Model evaluation is where data science stops being an academic exercise and starts being a professional discipline. These resources will deepen your understanding of the metrics, trade-offs, and decision-making that separate good analysis from publishable analysis.

Tier 1: Verified Sources

Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, An Introduction to Statistical Learning with Applications in Python (Springer, 2nd edition, 2023). Chapter 5 covers cross-validation and the bootstrap with exceptional clarity. Chapter 2 discusses the bias-variance trade-off in detail. The mathematical precision is balanced with intuitive explanations, and the Python edition is freely available online. If you read one additional source on model evaluation, this is the one.

Sebastian Raschka and Vahid Mirjalili, Machine Learning with PyTorch and Scikit-Learn (Packt, 2022). Chapter 6 covers model evaluation and hyperparameter tuning in depth, with practical scikit-learn code throughout. Raschka's treatment of cross-validation, learning curves, and evaluation metrics is among the best in any applied ML textbook. His discussion of nested cross-validation (an advanced topic we only hinted at) is particularly clear.

Andreas Mueller and Sarah Guido, Introduction to Machine Learning with Python (O'Reilly, 2nd edition, 2024). Chapter 5 covers evaluation metrics, cross-validation, and grid search. The examples are practical and the explanations are accessible. Mueller is a core contributor to scikit-learn, so his treatment of the library's evaluation tools is authoritative.

Aurélien Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (O'Reilly, 3rd edition, 2022). Chapter 3 focuses on classification metrics including precision, recall, ROC curves, and the precision-recall trade-off. Géron's MNIST digit classification example is one of the best extended demonstrations of how these metrics work in practice.

David Spiegelhalter, The Art of Statistics: How to Learn from Data (Basic Books, 2019). While not a machine learning book, Spiegelhalter's discussion of screening tests, sensitivity/specificity, and Bayesian reasoning provides essential context for understanding why precision and recall matter. His medical screening examples will deepen your intuition for the accuracy paradox.

Tier 2: Attributed Resources

Scikit-learn documentation: Model evaluation (User Guide, Section 3.3). The official scikit-learn user guide on model evaluation is comprehensive and well-organized. It covers every scoring metric available in the library, explains when to use each one, and includes code examples. Search for "scikit-learn model evaluation user guide."

Jesse Davis and Mark Goadrich, "The Relationship between Precision-Recall and ROC Curves" (Proceedings of the 23rd International Conference on Machine Learning, 2006). This paper formally establishes the relationship between ROC curves and precision-recall curves, showing that a model dominating in ROC space also dominates in precision-recall space, but not vice versa. If you want to understand the mathematical relationship between these two evaluation frameworks, this is the definitive reference.

Tom Fawcett, "An Introduction to ROC Analysis" (Pattern Recognition Letters, 2006). A clear, accessible tutorial on ROC curves and AUC. Fawcett walks through the construction and interpretation of ROC curves step by step, covering both the theory and practical considerations. This paper has been cited thousands of times and remains the standard reference for understanding ROC analysis.

Rachel Thomas and Jeremy Howard (fast.ai), lectures on practical model evaluation. The fast.ai course materials include excellent practical advice on choosing evaluation metrics, handling class imbalance, and avoiding common pitfalls. Their emphasis on using domain-appropriate metrics (not just accuracy) aligns closely with this chapter's themes. Search for "fast.ai practical deep learning" or their written materials on evaluation.

Christoph Molnar, Interpretable Machine Learning (self-published, 2nd edition, 2022). While primarily about interpretability, Molnar's discussion of performance metrics in the context of real-world deployment is valuable. He discusses how metric choice affects model selection and ultimately affects the people impacted by model decisions.

Recommended Next Steps

If you want deeper understanding of cross-validation: Read Chapter 5 of An Introduction to Statistical Learning. It covers leave-one-out cross-validation, bootstrap methods, and the theoretical justification for why k-fold works. Understanding the math behind cross-validation will make you a better practitioner.
If you want more practice with metrics: Work through Géron's Chapter 3 (MNIST classification). It provides an extended hands-on exercise in computing and interpreting every metric we covered, with a rich enough dataset to see the metrics in action.
If the class imbalance problem resonated: Research SMOTE (Synthetic Minority Oversampling Technique) and other resampling methods. The imbalanced-learn library (which integrates with scikit-learn) provides implementations. Dealing with imbalanced classes is one of the most common challenges in applied ML.
If you're interested in fairness and evaluation by subgroup: Look into the fairlearn library for Python, which provides tools for assessing model fairness across demographic groups. Elena's vaccination model and Jordan's grading model both raise fairness questions that standard metrics can't answer.
If you're ready to move on: Chapter 30 ties everything together with scikit-learn pipelines. Everything you learned about evaluation in this chapter becomes more powerful when embedded in a proper ML workflow.