Chapter 8: Further Reading
Foundational Texts
-
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. Chapter 7 (Model Assessment and Selection) is the definitive reference for cross-validation, bootstrap estimation, and the bias-variance tradeoff. Freely available at https://hastie.su.domains/ElemStatLearn/.
-
Raschka, S. (2018). "Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning." arXiv:1811.12808. A comprehensive and highly practical survey covering holdout methods, cross-validation, nested cross-validation, and statistical comparison tests. Essential reading for practitioners.
-
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 5 covers model selection and evaluation from a Bayesian perspective, including Bayesian model comparison and marginal likelihood. Freely available at https://probml.github.io/pml-book/.
-
Japkowicz, N. and Shah, M. (2011). Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press. Dedicated entirely to evaluation methodology, covering metrics, statistical tests, and domain-specific evaluation.
Key Papers
Cross-Validation
-
Stone, M. (1974). "Cross-Validatory Choice and Assessment of Statistical Predictions." Journal of the Royal Statistical Society, Series B, 36(2), 111--147. The foundational paper on cross-validation.
-
Dietterich, T. G. (1998). "Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms." Neural Computation, 10(7), 1895--1923. The seminal paper on statistical comparison of classifiers, introducing the 5x2 cross-validation paired t-test.
-
Varma, S. and Simon, R. (2006). "Bias in Error Estimation When Using Cross-Validation for Model Selection." BMC Bioinformatics, 7, 91. Demonstrates the optimistic bias when cross-validation is used simultaneously for model selection and performance estimation, motivating nested cross-validation.
Metrics and Evaluation
-
Davis, J. and Goadrich, M. (2006). "The Relationship Between Precision-Recall and ROC Curves." Proceedings of the 23rd International Conference on Machine Learning. Shows when PR curves are more informative than ROC curves, particularly for imbalanced datasets.
-
Saito, T. and Rehmsmeier, M. (2015). "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets." PLoS ONE, 10(3). Provides empirical evidence for preferring PR curves over ROC curves for skewed class distributions.
-
Ferri, C., Hernandez-Orallo, J., and Modroiu, R. (2009). "An Experimental Comparison of Performance Measures for Classification." Pattern Recognition Letters, 30(1), 27--38. A thorough comparison of over 20 classification metrics, including recommendations for different scenarios.
Hyperparameter Optimization
-
Bergstra, J. and Bengio, Y. (2012). "Random Search for Hyper-Parameter Optimization." Journal of Machine Learning Research, 13, 281--305. Demonstrates that random search is more efficient than grid search, a result that changed common practice.
-
Snoek, J., Larochelle, H., and Adams, R. P. (2012). "Practical Bayesian Optimization of Machine Learning Algorithms." Advances in Neural Information Processing Systems, 25. Introduces Bayesian optimization with Gaussian processes for hyperparameter tuning.
-
Falkner, S., Klein, A., and Hutter, F. (2018). "BOHB: Robust and Efficient Hyperparameter Optimization at Scale." Proceedings of the 35th International Conference on Machine Learning. Combines Bayesian optimization with bandit-based methods for scalable hyperparameter search.
Bias-Variance and Double Descent
-
Geman, S., Bienenstock, E., and Doursat, R. (1992). "Neural Networks and the Bias/Variance Dilemma." Neural Computation, 4(1), 1--58. The classic formalization of the bias-variance decomposition.
-
Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019). "Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-Off." Proceedings of the National Academy of Sciences, 116(32), 15849--15854. Introduces the double descent curve.
-
Nakkiran, P., Kaplun, G., Bansal, Y., et al. (2021). "Deep Double Descent: Where Bigger Models and More Data Can Hurt." Journal of Statistical Mechanics. Extends double descent observations to deep networks and varying dataset sizes.
Calibration and Fairness
-
Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). "On Calibration of Modern Neural Networks." Proceedings of the 34th International Conference on Machine Learning. Shows that modern neural networks are poorly calibrated and that temperature scaling is a simple, effective fix.
-
Chouldechova, A. (2017). "Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments." Big Data, 5(2), 153--163. Proves impossibility results for simultaneously satisfying multiple fairness criteria.
-
Kleinberg, J., Mullainathan, S., and Raghavan, M. (2017). "Inherent Trade-Offs in the Fair Determination of Risk Scores." Proceedings of Innovations in Theoretical Computer Science. Foundational impossibility results for algorithmic fairness.
Online Resources and Tutorials
-
scikit-learn Model Evaluation Guide: https://scikit-learn.org/stable/modules/model_evaluation.html --- Comprehensive documentation covering all evaluation metrics, cross-validation strategies, and scoring functions.
-
scikit-learn Hyperparameter Tuning Guide: https://scikit-learn.org/stable/modules/grid_search.html --- Practical guide for GridSearchCV, RandomizedSearchCV, and related tools.
-
Google's ML Crash Course - Classification: https://developers.google.com/machine-learning/crash-course/classification --- Interactive lessons on precision, recall, ROC, and AUC with visualizations.
-
Kaggle "Evaluation Metrics" Notebooks: Kaggle hosts many community notebooks demonstrating evaluation strategies in competitive settings.
Software Libraries
-
scikit-learn (
sklearn): Providescross_val_score,GridSearchCV,RandomizedSearchCV,StratifiedKFold,TimeSeriesSplit, and all metrics used in this chapter. -
scikit-optimize (
skopt): Bayesian optimization with scikit-learn integration. Install withpip install scikit-optimize. ProvidesBayesSearchCV. -
Optuna (
optuna): A modern hyperparameter optimization framework with pruning, distributed optimization, and dashboard visualization. Install withpip install optuna. -
Hyperopt (
hyperopt): Tree of Parzen Estimators (TPE) based hyperparameter optimization. Install withpip install hyperopt. -
FLAML (
flaml): Microsoft's fast and lightweight AutoML library that performs efficient hyperparameter tuning. Install withpip install flaml.
Advanced Topics for Further Study
-
Conformal Prediction: Provides distribution-free prediction intervals with guaranteed coverage. See Vovk, Gammerman, and Shafer (2005), Algorithmic Learning in a Random World. Implementation:
mapiePython package. -
Multi-Objective Optimization: When you need to optimize multiple metrics simultaneously (e.g., accuracy and latency). Optuna supports multi-objective optimization natively.
-
Online Evaluation (A/B Testing): See Kohavi, Tang, and Xu (2020), Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. O'Reilly.
-
Concept Drift Detection: Monitoring model performance in production for distribution shift. See
riverPython library for online learning with drift detection. -
Uncertainty Quantification: Beyond point estimates --- Bayesian methods, ensemble uncertainty, and conformal prediction for quantifying prediction confidence. Covered in Chapter 10 (Bayesian methods) and further in Part III (deep learning).