Chapter 30 Further Reading: Model Evaluation and Selection

The following annotated bibliography provides resources for deeper exploration of the model evaluation and selection concepts introduced in Chapter 30. Entries are organized by category and chosen for their relevance to evaluating probabilistic prediction models in sports betting contexts.


Books: Forecasting and Evaluation

1. Gneiting, Tilmann and Raftery, Adrian E. "Strictly Proper Scoring Rules, Prediction, and Estimation." Journal of the American Statistical Association, 102(477), 2007, pp. 359-378. The definitive theoretical treatment of proper scoring rules. Provides a unified framework for understanding the Brier score, log loss, and the continuous ranked probability score (CRPS) as members of a single family. Essential reading for understanding why proper scoring rules are necessary and how they relate to each other. The proofs of properness for common scoring rules are clearly presented.

2. Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. The Elements of Statistical Learning. 2nd ed., Springer, 2009. The comprehensive reference for model selection theory. Chapter 7 (Model Assessment and Selection) provides the mathematical foundations for cross-validation, bootstrap estimation, AIC, BIC, and the bias-variance tradeoff. Chapter 7.10 on cross-validation specifically addresses the challenges with dependent data. The treatment of effective degrees of freedom is particularly relevant for comparing neural networks to simpler models.

3. Hyndman, Rob J. and Athanasopoulos, George. Forecasting: Principles and Practice. 3rd ed., OTexts, 2021. A practical introduction to forecasting with strong coverage of evaluation metrics and cross-validation for time series. Chapter 5 on forecast accuracy and Chapter 5.8 on time series cross-validation directly inform the walk-forward validation framework in this chapter. Freely available at otexts.com/fpp3. Recommended for readers who want additional perspective on temporal validation beyond the sports prediction context.

4. Brier, Glenn W. "Verification of Forecasts Expressed in Terms of Probability." Monthly Weather Review, 78(1), 1950, pp. 1-3. The original paper introducing the Brier score for evaluating probabilistic weather forecasts. At only three pages, it is one of the most influential short papers in the history of forecasting. The decomposition into calibration and resolution components (developed later by Murphy) was implicit in Brier's original framework.


Academic Papers: Scoring Rules and Calibration

5. Murphy, Allan H. "A New Vector Partition of the Probability Score." Journal of Applied Meteorology and Climatology, 12(4), 1973, pp. 595-600. Introduced the three-component decomposition of the Brier score into reliability, resolution, and uncertainty. This decomposition, used extensively in Chapter 30, provides diagnostic insight that the raw Brier score alone cannot offer. Murphy's interpretation of each component remains the standard reference.

6. Niculescu-Mizil, Alexandru and Caruana, Rich. "Predicting Good Probabilities with Supervised Learning." Proceedings of the 22nd International Conference on Machine Learning, 2005, pp. 625-632. The landmark paper on calibration of machine learning classifiers. Demonstrates that many commonly used classifiers (SVMs, boosted trees, random forests) produce poorly calibrated probabilities and that post-hoc calibration (Platt scaling, isotonic regression) can significantly improve probability quality. The finding that boosted trees tend to produce overconfident predictions near 0 and 1 is directly relevant to the XGBoost calibration analysis in this chapter.

7. Guo, Chuan, Pleiss, Geoff, Sun, Yu, and Weinberger, Kilian Q. "On Calibration of Modern Neural Networks." Proceedings of the 34th International Conference on Machine Learning, 2017, pp. 1321-1330. Demonstrates that modern deep neural networks are significantly less calibrated than earlier architectures, and that temperature scaling (a simpler variant of Platt scaling using a single scalar parameter) is often sufficient for recalibration. The finding that model capacity, batch normalization, and weight decay all affect calibration is relevant for the neural network evaluation in this chapter.

8. Kull, Meelis, Silva Filho, Telmo, and Flach, Peter. "Beta Calibration: A Well-Founded and Easily Implemented Improvement on Logistic Calibration for Binary Classifiers." Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017, pp. 623-631. Introduces beta calibration as a theoretically grounded alternative to Platt scaling. Beta calibration fits a three-parameter mapping (compared to Platt's two parameters) that can handle a wider range of miscalibration patterns while remaining more constrained than isotonic regression. Recommended for practitioners seeking a middle ground between Platt scaling and isotonic regression.


Academic Papers: Model Comparison and Selection

9. Diebold, Francis X. and Mariano, Roberto S. "Comparing Predictive Accuracy." Journal of Business and Economic Statistics, 13(3), 1995, pp. 253-263. The original paper introducing the Diebold-Mariano test for comparing forecast accuracy. Provides the asymptotic theory, the Newey-West variance estimator, and practical guidance on lag selection. This paper is the foundation for the diebold_mariano_test function in Chapter 30. The authors' later commentary (2002) addresses finite-sample corrections.

10. Hansen, Peter R., Lunde, Asger, and Nason, James M. "The Model Confidence Set." Econometrica, 79(2), 2011, pp. 453-497. Introduces the Model Confidence Set (MCS) procedure, which identifies the set of models that includes the best model with a given confidence level. Unlike pairwise DM tests, MCS properly handles multiple comparisons and produces a set of "statistically indistinguishable" best models. Ideal for the scenario in Exercise 26 where many models are compared simultaneously.

11. Burnham, Kenneth P. and Anderson, David R. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. 2nd ed., Springer, 2002. The comprehensive treatment of AIC, BIC, and information-theoretic model selection. Chapter 2 provides intuitive explanations of Kullback-Leibler distance and its relationship to AIC. Chapter 4 addresses the multi-model inference approach (model averaging weighted by AIC weights), which is directly applicable to the ensemble predictions discussed in Section 30.5.

12. Newey, Whitney K. and West, Kenneth D. "A Simple, Positive Semi-definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix." Econometrica, 55(3), 1987, pp. 703-708. The original paper on the Newey-West heteroskedasticity and autocorrelation consistent (HAC) variance estimator used in the Diebold-Mariano test implementation. Provides the theoretical justification for the Bartlett kernel weights and guidance on truncation lag selection. Essential for understanding why the standard variance estimator is inadequate for serially correlated loss differentials.


Books and Papers: Backtesting and Sports Betting

13. Lopez de Prado, Marcos. Advances in Financial Machine Learning. Wiley, 2018. While focused on financial markets, this book's treatment of backtesting pitfalls (Chapter 11), cross-validation for financial data (Chapter 7), and the purged cross-validation method (Chapter 7.4) is directly applicable to sports betting. The concept of "backtest overfitting" --- running many backtests and selecting the best result --- is a critical warning for sports bettors. The combinatorial purged cross-validation method provides a more rigorous alternative to simple walk-forward validation.

14. Bailey, David H. and Lopez de Prado, Marcos. "The Probability of Backtest Overfitting." Journal of Computational Finance, 17(1), 2014, pp. 39-70. Provides a framework for estimating the probability that a backtested strategy's performance is due to overfitting rather than genuine skill. The Minimum Backtest Length formula gives the minimum number of trials needed to distinguish skill from luck at a given confidence level. Directly relevant to the question of how many seasons of data are needed for a reliable backtest.

15. Hubacek, Ondrej, Sourek, Gustav, and Zelezny, Filip. "Exploiting Sports-betting Market Using Machine Learning." International Journal of Forecasting, 35(2), 2019, pp. 783-796. One of the most rigorous evaluations of ML models for sports betting, including walk-forward validation, proper scoring rules, and realistic backtesting with vigorish. The paper's evaluation methodology --- which includes Brier score, calibration analysis, and profitability simulation --- is a model for the evaluation pipeline described in Chapter 30.


Technical Resources and Tools

16. scikit-learn Calibration Module (scikit-learn.org/stable/modules/calibration.html) The official scikit-learn documentation for calibration methods, including CalibratedClassifierCV, Platt scaling (method='sigmoid'), and isotonic regression (method='isotonic'). The calibration curve visualization tools (calibration_display) provide the foundation for the reliability diagram code in this chapter. The documentation includes guidance on when to use each method and sample size requirements.

17. Optuna Documentation: Multi-Objective Optimization (optuna.readthedocs.io) Optuna's multi-objective optimization capability is relevant when you want to simultaneously optimize multiple evaluation metrics (e.g., minimize Brier score and ECE). The Pareto front visualization shows the tradeoffs between metrics and can inform model selection when no single model dominates on all criteria. Pairs well with the hyperparameter tuning concepts from Chapter 29.

18. Weights and Biases: Model Registry and Evaluation (wandb.ai) W&B provides tools for tracking model evaluation metrics across experiments, including calibration curves, scoring rule comparisons, and backtest result visualization. For teams evaluating many model configurations, W&B's experiment comparison features enable systematic model selection. The free tier supports the full evaluation pipeline for individual practitioners.


How to Use This Reading List

For readers working through this textbook sequentially, the following prioritization is suggested:

  • Start with: Gneiting and Raftery (entry 1) for scoring rule theory, and Murphy (entry 5) for the Brier decomposition. These provide the theoretical foundations for all evaluation in Chapter 30.
  • Go deeper on calibration: Niculescu-Mizil and Caruana (entry 6) for practical calibration of ML classifiers, and Guo et al. (entry 7) for neural network-specific calibration issues.
  • Go deeper on model comparison: Diebold and Mariano (entry 9) for the DM test, and Hansen et al. (entry 10) for the Model Confidence Set when comparing many models.
  • Go deeper on backtesting: Lopez de Prado (entry 13) for backtesting pitfalls and purged cross-validation, and Bailey and Lopez de Prado (entry 14) for backtest overfitting probability.
  • For production systems: scikit-learn calibration module (entry 16) and W&B (entry 18) for implementing the evaluation pipeline in production.

These resources will be referenced again in later chapters as evaluation concepts are applied to ensemble methods, reinforcement learning for bet sizing, and live deployment.