Further Reading: Chapter 16
Model Evaluation Deep Dive
Foundational Papers
1. "Cross-Validatory Choice and Assessment of Statistical Predictions" --- M. Stone (1974) The paper that introduced cross-validation as a formal method for model assessment. Stone showed that leave-one-out cross-validation provides an approximately unbiased estimate of prediction error. While the computational aspects have evolved (K-fold is now preferred to LOO), the theoretical foundation laid here remains the basis for all modern cross-validation practice. Published in the Journal of the Royal Statistical Society, Series B.
2. "No Unbiased Estimator of the Variance of K-Fold Cross-Validation" --- Yoshua Bengio and Yves Grandvalet (2004) A crucial paper for anyone who reports cross-validation standard deviations. Bengio and Grandvalet proved that there is no unbiased way to estimate the variance of K-fold cross-validation, because the training sets overlap across folds. This means the confidence intervals you compute from CV folds are approximate at best. The paper also introduces the corrected resampled t-test for comparing classifiers. Published in the Journal of Machine Learning Research, Vol. 5.
3. "Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms" --- Thomas Dietterich (1998) The definitive reference on statistical tests for comparing classifiers. Dietterich evaluates five methods (paired t-test, cross-validated t-test, McNemar's test, 5x2 cross-validation, and the difference of proportions test) on both Type I error rate and statistical power. His recommendation: the 5x2 cross-validated paired t-test is the best general-purpose test. Published in Neural Computation, Vol. 10, No. 7.
Cross-Validation Practice
4. An Introduction to Statistical Learning (ISLR) --- James, Witten, Hastie, Tibshirani (2nd edition, 2021) Chapter 5 provides the most accessible treatment of cross-validation in any textbook: the bias-variance tradeoff of validation set size, K-fold vs. leave-one-out, and stratification. The Python edition (ISLP, 2023) includes updated lab exercises. Free at statlearning.com.
5. scikit-learn User Guide --- "Cross-validation: evaluating estimator performance" The official scikit-learn documentation on cross-validation strategies is comprehensive and practical. It covers KFold, StratifiedKFold, GroupKFold, StratifiedGroupKFold, TimeSeriesSplit, RepeatedKFold, and custom iterators, with code examples for each. The section on using Pipelines to prevent data leakage during cross-validation is essential. Available at scikit-learn.org.
6. "A Survey of Cross-Validation Procedures for Model Selection" --- Sylvain Arlot and Alain Celisse (2010) A comprehensive survey covering the theoretical properties of different cross-validation methods. Especially useful for understanding when leave-one-out is preferable to K-fold (small datasets, linear models) and when repeated K-fold provides better estimates than single K-fold. Published in Statistics Surveys, Vol. 4.
Data Leakage
7. "Leakage in Data Mining: Formulation, Detection, and Avoidance" --- Kaufman et al. (2012) The first systematic treatment of data leakage in machine learning. The authors categorize leakage into training examples that do not meet the independence assumption, features that encode the target, and temporal violations. Published in ACM Transactions on Knowledge Discovery from Data, Vol. 6, No. 4. The examples are drawn from real Kaggle competitions where leakage determined the winning solutions.
8. "Data Leakage in Machine Learning" --- Nisbet, Elder, and Miner (in Handbook of Statistical Analysis and Data Mining Applications, 2009) A practitioner-oriented chapter on detecting and preventing leakage, with a taxonomy of leakage types and a checklist for auditing pipelines. The case studies are drawn from real consulting engagements where leakage inflated model performance. Less theoretical than Kaufman et al. but more practically actionable.
9. Kaggle Data Leakage Case Studies Several high-profile Kaggle competitions have been won (or invalidated) by exploiting data leakage. The "Don't Overfit!" competition, the "BNP Paribas Cardif" competition, and the "Two Sigma Connect" competition all had leakage issues documented in post-competition discussions. Search the Kaggle forums for "data leakage" to find these write-ups. They are the most vivid illustration of why leakage matters.
Metrics for Imbalanced Classification
10. "The Relationship Between Precision-Recall and ROC Curves" --- Jesse Davis and Mark Goadrich (2006) The paper that formally demonstrated why AUC-PR is more informative than AUC-ROC for imbalanced datasets. Davis and Goadrich proved that a curve dominates in ROC space if and only if it dominates in PR space, but PR curves can reveal differences between algorithms that ROC curves obscure. Published in ICML 2006.
11. "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets" --- Saito and Rehmsmeier (2015) A practical follow-up to Davis and Goadrich that provides guidelines for when to use AUC-PR over AUC-ROC. The authors show that for positive rates below 10%, AUC-ROC can be misleadingly optimistic. Published in PLoS ONE, Vol. 10, No. 3.
12. Imbalanced Learning: Foundations, Algorithms, and Applications --- He and Ma, eds. (2013) A comprehensive book covering both the evaluation challenges and the algorithmic solutions for imbalanced classification. Chapter 2 on evaluation metrics is particularly relevant to this chapter, covering precision, recall, G-mean, Matthews Correlation Coefficient, and cost-sensitive evaluation.
Calibration
13. "Predicting Good Probabilities with Supervised Learning" --- Niculescu-Mizil and Caruana (2005) The landmark study on model calibration that showed Random Forests, SVMs, and boosted trees produce poorly calibrated probabilities, while logistic regression is naturally well-calibrated. The authors compare Platt scaling and isotonic regression as post-hoc calibration methods. Published in ICML 2005. This paper is the foundation for all modern calibration practice.
14. "Obtaining Well Calibrated Probabilities Using Bayesian Binning into Quantiles" --- Naeini, Cooper, and Hauskrecht (2015) Introduces Bayesian Binning into Quantiles (BBQ), a calibration method that extends isotonic regression by using a Bayesian framework to select the optimal binning. Published in AAAI 2015. Read this if isotonic regression produces ragged calibration curves on your data.
15. "On Calibration of Modern Neural Networks" --- Guo et al. (2017) While focused on neural networks, this paper introduces the Expected Calibration Error (ECE) metric and temperature scaling, both of which are now widely used for evaluating and improving calibration in any probabilistic model. Published in ICML 2017.
Statistical Comparison of Classifiers
16. "Statistical Comparisons of Classifiers over Multiple Data Sets" --- Janez Demsar (2006) The definitive guide to comparing multiple classifiers across multiple datasets. Demsar introduces the Friedman test (non-parametric alternative to repeated-measures ANOVA) with Nemenyi post-hoc test for pairwise comparisons, and provides critical difference diagrams for visualizing the results. Published in the Journal of Machine Learning Research, Vol. 7. This paper should be required reading for anyone who compares more than two models.
17. "Inference for the Generalization Error" --- Claude Nadeau and Yoshua Bengio (2003) Introduces the corrected resampled t-test that accounts for the dependence between cross-validation folds. The standard paired t-test on CV scores underestimates variance because folds share training data. The Nadeau-Bengio correction adjusts for this, producing more conservative (and more honest) p-values. Published in Machine Learning, Vol. 52, No. 3.
Practical Guides
18. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow --- Aurelien Geron (3rd edition, 2022) Chapter 2 covers evaluation methodology in a practical, code-first style: train-test split, cross-validation, stratification, and the importance of choosing the right metric. Chapter 3 adds precision-recall tradeoffs, ROC curves, and multi-class metrics. Geron's treatment of the confusion matrix as a storytelling tool is one of the clearest available.
19. Applied Predictive Modeling --- Kuhn and Johnson (2013) Chapters 4 (Over-Fitting and Model Tuning), 11 (Measuring Performance in Classification Models), and 16 (Remedies for Severe Class Imbalance) together form one of the most thorough practical treatments of model evaluation. Kuhn and Johnson provide detailed guidance on resampling strategies, metric selection for imbalanced problems, and cost-sensitive evaluation. Written before the scikit-learn era, but the principles are timeless.
20. The Elements of Statistical Learning --- Hastie, Tibshirani, Friedman (2nd edition, 2009) Chapter 7 ("Model Assessment and Selection") is the rigorous mathematical treatment of cross-validation, bootstrap estimation of prediction error, and the bias-variance-complexity tradeoff. Dense but essential if you want to understand why cross-validation works, not just how. Free PDF at the authors' website.
Healthcare-Specific Evaluation
21. "Prediction Models for Diagnosis and Prognosis of Medical Outcomes" --- Steyerberg (2009) A comprehensive guide to evaluating clinical prediction models, covering discrimination (AUC), calibration, decision curve analysis, and external validation. Written for clinical researchers but essential for any data scientist working with health data. The discussion of net benefit and decision curves is particularly relevant to Case Study 2 of this chapter.
22. "Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research" --- Luo et al. (2016) Published in the Journal of Medical Internet Research, this paper provides a checklist for reporting ML model performance in clinical settings. It emphasizes calibration, temporal validation, and the distinction between discrimination and clinical utility --- all themes from this chapter applied to the healthcare domain.
Video Resources
23. StatQuest with Josh Starmer --- "ROC and AUC, Clearly Explained" Starmer's visual explanation of the ROC curve and AUC is the best entry point for learners who find the concept abstract. He builds the curve point by point, showing exactly what each threshold contributes. 12 minutes on YouTube.
24. StatQuest with Josh Starmer --- "Cross Validation" A visual, step-by-step walkthrough of K-fold cross-validation, including why leave-one-out is usually worse than 5-fold or 10-fold. Covers both the intuition and the common mistakes. 8 minutes on YouTube.
How to Use This List
If you read nothing else, read Davis and Goadrich (item 10) on AUC-PR vs. AUC-ROC and Dietterich (item 3) on statistical comparison of classifiers. Together they take about 3 hours and will fundamentally change how you evaluate models.
If you work with imbalanced data (and you probably do), read Saito and Rehmsmeier (item 11) for practical guidance on when AUC-ROC misleads.
If you have been bitten by data leakage (or want to avoid it), read Kaufman et al. (item 7) and search Kaggle forums for real-world leakage case studies.
If you work with health data, Steyerberg (item 21) is essential. Decision curve analysis --- which extends the cost-benefit threshold optimization from Case Study 2 --- is a tool every clinical data scientist should know.
If you want to go deep on theory, Chapter 7 of ESL (item 20) and Bengio/Grandvalet (item 2) on CV variance are the places to start.
This reading list supports Chapter 16: Model Evaluation Deep Dive. Return to the chapter to review concepts before diving in.