Chapter 8: Further Reading

Foundational Texts

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. Chapter 7 (Model Assessment and Selection) is the definitive reference for cross-validation, bootstrap estimation, and the bias-variance tradeoff. Freely available at https://hastie.su.domains/ElemStatLearn/.
Raschka, S. (2018). "Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning." arXiv:1811.12808. A comprehensive and highly practical survey covering holdout methods, cross-validation, nested cross-validation, and statistical comparison tests. Essential reading for practitioners.
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 5 covers model selection and evaluation from a Bayesian perspective, including Bayesian model comparison and marginal likelihood. Freely available at https://probml.github.io/pml-book/.
Japkowicz, N. and Shah, M. (2011). Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press. Dedicated entirely to evaluation methodology, covering metrics, statistical tests, and domain-specific evaluation.

Stone, M. (1974). "Cross-Validatory Choice and Assessment of Statistical Predictions." Journal of the Royal Statistical Society, Series B, 36(2), 111--147. The foundational paper on cross-validation.
Dietterich, T. G. (1998). "Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms." Neural Computation, 10(7), 1895--1923. The seminal paper on statistical comparison of classifiers, introducing the 5x2 cross-validation paired t-test.
Varma, S. and Simon, R. (2006). "Bias in Error Estimation When Using Cross-Validation for Model Selection." BMC Bioinformatics, 7, 91. Demonstrates the optimistic bias when cross-validation is used simultaneously for model selection and performance estimation, motivating nested cross-validation.

Davis, J. and Goadrich, M. (2006). "The Relationship Between Precision-Recall and ROC Curves." Proceedings of the 23rd International Conference on Machine Learning. Shows when PR curves are more informative than ROC curves, particularly for imbalanced datasets.
Saito, T. and Rehmsmeier, M. (2015). "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets." PLoS ONE, 10(3). Provides empirical evidence for preferring PR curves over ROC curves for skewed class distributions.
Ferri, C., Hernandez-Orallo, J., and Modroiu, R. (2009). "An Experimental Comparison of Performance Measures for Classification." Pattern Recognition Letters, 30(1), 27--38. A thorough comparison of over 20 classification metrics, including recommendations for different scenarios.

Bergstra, J. and Bengio, Y. (2012). "Random Search for Hyper-Parameter Optimization." Journal of Machine Learning Research, 13, 281--305. Demonstrates that random search is more efficient than grid search, a result that changed common practice.
Snoek, J., Larochelle, H., and Adams, R. P. (2012). "Practical Bayesian Optimization of Machine Learning Algorithms." Advances in Neural Information Processing Systems, 25. Introduces Bayesian optimization with Gaussian processes for hyperparameter tuning.
Falkner, S., Klein, A., and Hutter, F. (2018). "BOHB: Robust and Efficient Hyperparameter Optimization at Scale." Proceedings of the 35th International Conference on Machine Learning. Combines Bayesian optimization with bandit-based methods for scalable hyperparameter search.

Geman, S., Bienenstock, E., and Doursat, R. (1992). "Neural Networks and the Bias/Variance Dilemma." Neural Computation, 4(1), 1--58. The classic formalization of the bias-variance decomposition.
Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019). "Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-Off." Proceedings of the National Academy of Sciences, 116(32), 15849--15854. Introduces the double descent curve.
Nakkiran, P., Kaplun, G., Bansal, Y., et al. (2021). "Deep Double Descent: Where Bigger Models and More Data Can Hurt." Journal of Statistical Mechanics. Extends double descent observations to deep networks and varying dataset sizes.

Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). "On Calibration of Modern Neural Networks." Proceedings of the 34th International Conference on Machine Learning. Shows that modern neural networks are poorly calibrated and that temperature scaling is a simple, effective fix.
Chouldechova, A. (2017). "Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments." Big Data, 5(2), 153--163. Proves impossibility results for simultaneously satisfying multiple fairness criteria.
Kleinberg, J., Mullainathan, S., and Raghavan, M. (2017). "Inherent Trade-Offs in the Fair Determination of Risk Scores." Proceedings of Innovations in Theoretical Computer Science. Foundational impossibility results for algorithmic fairness.

scikit-learn Model Evaluation Guide: https://scikit-learn.org/stable/modules/model_evaluation.html --- Comprehensive documentation covering all evaluation metrics, cross-validation strategies, and scoring functions.
scikit-learn Hyperparameter Tuning Guide: https://scikit-learn.org/stable/modules/grid_search.html --- Practical guide for GridSearchCV, RandomizedSearchCV, and related tools.
Google's ML Crash Course - Classification: https://developers.google.com/machine-learning/crash-course/classification --- Interactive lessons on precision, recall, ROC, and AUC with visualizations.
Kaggle "Evaluation Metrics" Notebooks: Kaggle hosts many community notebooks demonstrating evaluation strategies in competitive settings.

scikit-learn (sklearn): Provides cross_val_score, GridSearchCV, RandomizedSearchCV, StratifiedKFold, TimeSeriesSplit, and all metrics used in this chapter.
scikit-optimize (skopt): Bayesian optimization with scikit-learn integration. Install with pip install scikit-optimize. Provides BayesSearchCV.
Optuna (optuna): A modern hyperparameter optimization framework with pruning, distributed optimization, and dashboard visualization. Install with pip install optuna.
Hyperopt (hyperopt): Tree of Parzen Estimators (TPE) based hyperparameter optimization. Install with pip install hyperopt.
FLAML (flaml): Microsoft's fast and lightweight AutoML library that performs efficient hyperparameter tuning. Install with pip install flaml.

Conformal Prediction: Provides distribution-free prediction intervals with guaranteed coverage. See Vovk, Gammerman, and Shafer (2005), Algorithmic Learning in a Random World. Implementation: mapie Python package.
Multi-Objective Optimization: When you need to optimize multiple metrics simultaneously (e.g., accuracy and latency). Optuna supports multi-objective optimization natively.
Online Evaluation (A/B Testing): See Kohavi, Tang, and Xu (2020), Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. O'Reilly.
Concept Drift Detection: Monitoring model performance in production for distribution shift. See river Python library for online learning with drift detection.
Uncertainty Quantification: Beyond point estimates --- Bayesian methods, ensemble uncertainty, and conformal prediction for quantifying prediction confidence. Covered in Chapter 10 (Bayesian methods) and further in Part III (deep learning).