Chapter 11 Further Reading: Model Evaluation and Selection


Evaluation Metrics and Methodology

1. Powers, D. M. W. (2011). "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation." Journal of Machine Learning Technologies, 2(1), 37-63. A comprehensive survey of classification evaluation metrics, including several that are less commonly discussed but can be useful in specific contexts (e.g., Matthews Correlation Coefficient, informedness). Powers provides clear mathematical definitions and guidance on when each metric is appropriate. An excellent reference for anyone who wants to go deeper than the precision-recall-F1 triad.

2. Davis, J., & Goadrich, M. (2006). "The Relationship Between Precision-Recall and ROC Curves." Proceedings of the 23rd International Conference on Machine Learning, 233-240. The definitive paper on when and why precision-recall curves are more informative than ROC curves, particularly for imbalanced datasets. Davis and Goadrich prove that a curve dominates in ROC space if and only if it dominates in PR space, but the visual impression of the two curves can differ dramatically for imbalanced data. Required reading for anyone evaluating models on rare-event prediction problems.

3. Provost, F., & Fawcett, T. (2013). Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking. O'Reilly Media. One of the best introductions to the business logic of model evaluation. Chapters 7 and 8 cover ROC curves, cost-sensitive evaluation, and expected profit calculations in a style that is rigorous but accessible to business readers. The expected value framework presented here directly influenced the cost-sensitive evaluation approach in this chapter. Highly recommended for MBA students who want a more detailed treatment.

4. Flach, P. (2012). Machine Learning: The Art and Science of Algorithms That Make Sense of Data. Cambridge University Press. Flach's textbook provides the most thorough treatment of model evaluation in any introductory ML textbook. His "coverage plot" framework unifies ROC analysis, cost curves, and calibration analysis into a single visual framework. More technical than Provost and Fawcett, but rewarding for readers who want to understand the mathematical foundations of evaluation metrics.


Cost-Sensitive Learning and Evaluation

5. Elkan, C. (2001). "The Foundations of Cost-Sensitive Learning." Proceedings of the 17th International Joint Conference on Artificial Intelligence, 973-978. A foundational paper that formally establishes when and how cost-sensitive evaluation should be applied. Elkan demonstrates that the optimal decision threshold depends on the cost ratio of false positives to false negatives and the class distribution — a result that directly underpins the profit curve analysis in this chapter. Short, elegant, and highly influential.

6. Domingos, P. (1999). "MetaCost: A General Method for Making Classifiers Cost-Sensitive." Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 155-164. Introduces a practical method for incorporating business costs directly into the model training process, rather than only at the evaluation stage. MetaCost works by relabeling the training data based on the cost matrix and retraining the model, effectively teaching the model to optimize for business value rather than accuracy. Relevant for readers who want to go beyond cost-sensitive evaluation to cost-sensitive training.


Cross-Validation and Hyperparameter Tuning

7. Bergstra, J., & Bengio, Y. (2012). "Random Search for Hyper-Parameter Optimization." Journal of Machine Learning Research, 13, 281-305. The paper that established random search as a serious alternative to grid search for hyperparameter tuning. Bergstra and Bengio prove that random search is more efficient than grid search for most hyperparameter optimization problems because it is more likely to explore values near the (unknown) optimum of each hyperparameter. A landmark paper that changed standard practice across the ML community.

8. Arlot, S., & Celisse, A. (2010). "A Survey of Cross-Validation Procedures for Model Selection." Statistics Surveys, 4, 40-79. A rigorous survey of cross-validation methods, including leave-one-out, K-fold, stratified, and more exotic variants. The paper analyzes the bias-variance tradeoff of different cross-validation strategies and provides guidance on choosing the right approach for different dataset sizes and structures. More mathematical than most sources listed here, but an essential reference for practitioners who need to justify their cross-validation choices.

9. Hutter, F., Kotthoff, L., & Vanschoren, J. (Eds.) (2019). Automated Machine Learning: Methods, Systems, Challenges. Springer. A comprehensive overview of AutoML, including extensive coverage of hyperparameter optimization methods (Bayesian optimization, multi-fidelity optimization, neural architecture search). Chapter 1 provides an accessible introduction to the field; later chapters cover state-of-the-art methods. Relevant for readers interested in the practical tools (Optuna, Hyperopt, Auto-sklearn) that implement the Bayesian optimization concepts discussed in this chapter.


A/B Testing and Online Evaluation

10. Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press. The definitive guide to A/B testing, written by the team that built Microsoft's experimentation platform (which runs thousands of concurrent experiments). Covers experimental design, statistical analysis, practical pitfalls (novelty effects, interference, Simpson's paradox), and organizational challenges. If you read one book on A/B testing, this should be it. Directly relevant to the online evaluation section of this chapter.

11. Johari, R., Koomen, P., Pekelis, L., & Walsh, D. (2022). "Always Valid Inference: Continuous Monitoring of A/B Tests." Operations Research, 70(3), 1806-1821. Traditional A/B testing requires committing to a fixed sample size in advance — but in practice, teams often want to peek at results during the experiment. This paper introduces "always valid" inference methods that allow continuous monitoring without inflating false positive rates. A practical advancement that is increasingly adopted by experimentation platforms at companies like Optimizely, Netflix, and Spotify.


Model Selection and Deployment

12. Sculley, D., Holt, G., Golovin, D., et al. (2015). "Hidden Technical Debt in Machine Learning Systems." Advances in Neural Information Processing Systems, 28, 2503-2511. A seminal Google paper arguing that the ML model itself is a small fraction of a real-world ML system, surrounded by infrastructure for data collection, feature engineering, monitoring, and maintenance. The paper introduces the concept of "technical debt" in ML systems — including the risk of dead code and unused models — that is directly relevant to the Knight Capital case study and to the deployment challenges discussed in this chapter and Chapter 12.

13. Paleyes, A., Urma, R. G., & Lawrence, N. D. (2022). "Challenges in Deploying Machine Learning: A Survey of Case Studies." ACM Computing Surveys, 55(6), 1-29. A systematic review of real-world ML deployment failures, categorized by failure mode (data issues, model degradation, infrastructure problems, organizational challenges). The paper provides concrete examples of the gap between offline evaluation and online performance — the central concern of this chapter's A/B testing section. Valuable for building a taxonomy of "what can go wrong."


Calibration and Probability Estimation

14. McMahan, H. B., Holt, G., Sculley, D., et al. (2013). "Ad Click Prediction: A View from the Trenches." Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1222-1230. Google's landmark paper on ad click prediction, which provides rare insight into how one of the world's largest ML systems is evaluated and deployed. The paper's emphasis on calibration, log-loss, and the gap between offline and online metrics directly informs Case Study 1 in this chapter. One of the most cited papers in applied ML, and essential reading for anyone interested in large-scale prediction systems.

15. Niculescu-Mizil, A., & Caruana, R. (2005). "Predicting Good Probabilities with Supervised Learning." Proceedings of the 22nd International Conference on Machine Learning, 625-632. Demonstrates that many classification algorithms (including SVMs, boosted decision trees, and random forests) produce poorly calibrated probability estimates, even when they achieve high AUC. The paper evaluates calibration methods (Platt scaling, isotonic regression) that can improve probability estimates post-hoc. Relevant for any application where calibrated probabilities matter — pricing, risk scoring, clinical decision support.


Business Context and Case Studies

16. SEC (Securities and Exchange Commission). (2013). "In the Matter of Knight Capital Americas LLC: Administrative Proceeding File No. 3-15570." Washington, DC. The official SEC enforcement action against Knight Capital, documenting the regulatory findings and the specific failures in risk management and deployment procedures. A primary source for Case Study 2 that provides details not available in secondary accounts. Freely available on the SEC website.

17. Patterson, S. (2012). "Knight Capital's $440 Million Loss Is Traced to Old Software, Routed to Wrong Venue." The Wall Street Journal, August 14, 2012. A well-reported journalistic account of the Knight Capital disaster that provides context for the SEC's findings. Useful for understanding the human and organizational dimensions of the failure, beyond the technical details.

18. Lewis, M. (2014). Flash Boys: A Wall Street Revolt. W. W. Norton. While not directly about Knight Capital, Lewis's investigation into high-frequency trading provides essential context for understanding the environment in which the disaster occurred — a financial system where algorithmic speed, complexity, and inadequate oversight create systemic risks. Engaging narrative nonfiction that makes the abstract risks of algorithmic decision-making vivid and concrete.


Fairness and Responsible Evaluation

19. Mitchell, M., Wu, S., Zaldivar, A., et al. (2019). "Model Cards for Model Reporting." Proceedings of the Conference on Fairness, Accountability, and Transparency, 220-229. Introduces "model cards" — standardized documentation that reports a model's performance across different demographic groups, use cases, and evaluation conditions. Model cards formalize the idea that evaluation must disaggregate performance by subgroup, not just report overall metrics. This paper directly connects model evaluation (Chapter 11) to fairness (Chapter 25) and is increasingly adopted as an industry standard.

20. Barocas, S., Hardt, M., & Narayanan, A. (2023). Fairness and Machine Learning: Limitations and Opportunities. MIT Press. The most rigorous and comprehensive treatment of fairness in ML, covering definitions, impossibility results, and practical approaches. Chapter 2's discussion of fairness metrics connects directly to the model selection dimension of "fairness and compliance" presented in this chapter. Available as a free online textbook at fairmlbook.org. We will return to this source extensively in Chapter 25.


Practical Tools and Implementation

21. scikit-learn Documentation: "Model Evaluation: Quantifying the Quality of Predictions." Available at scikit-learn.org. The official documentation for scikit-learn's evaluation module, including detailed explanations and code examples for every metric discussed in this chapter. Includes practical guidance on choosing metrics, interpreting results, and implementing custom scoring functions. The most useful day-to-day reference for practitioners implementing the concepts in this chapter.

22. Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). "Optuna: A Next-Generation Hyperparameter Optimization Framework." Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2623-2631. Introduces Optuna, an open-source hyperparameter optimization framework that implements Bayesian optimization with pruning — automatically stopping unpromising trials early to save compute. Optuna has become one of the most popular tools for hyperparameter tuning in practice. Relevant for readers who want to implement the Bayesian optimization concepts discussed in this chapter.

23. Bouthillier, X., Konda, K., Vincent, P., & Memisevic, R. (2021). "Accounting for Variance in Machine Learning Benchmarks." Proceedings of Machine Learning and Systems, 3, 747-769. Demonstrates that much of the claimed improvement in ML research papers is within the range of variance caused by random seeds, data splits, and hardware differences. The paper argues for more rigorous reporting of variance — exactly the practice that cross-validation enables. A sobering read that reinforces why single-number performance claims should always be accompanied by confidence intervals.

24. Raschka, S. (2018). "Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning." arXiv preprint arXiv:1811.12808. A comprehensive tutorial covering the statistical foundations of model evaluation and selection, including bootstrap methods, nested cross-validation, and statistical tests for comparing models. More technical than the treatment in this chapter, but accessible to readers with basic statistics knowledge. Excellent for anyone who wants to understand why the evaluation practices in this chapter work, not just how to implement them.