Chapter 27 Further Reading: Advanced Regression and Classification

Gradient Boosting and XGBoost

Chen, T. and Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794. The original XGBoost paper. Describes the algorithmic innovations (second-order gradients, regularized objective, column subsampling, sparsity-aware splits) that made XGBoost the dominant algorithm for tabular prediction. Essential reading for understanding why XGBoost works.
Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." Annals of Statistics, 29(5), 1189-1232. The foundational paper on gradient boosting. Friedman introduces the concept of fitting trees to the gradient of a loss function, establishing the theoretical framework that XGBoost, LightGBM, and CatBoost all build upon.
Ke, G. et al. (2017). "LightGBM: A Highly Efficient Gradient Boosting Decision Tree." Advances in Neural Information Processing Systems, 30. LightGBM introduces histogram-based splitting and leaf-wise growth, making gradient boosting faster and often competitive with XGBoost. Relevant for sports bettors processing large datasets or needing rapid retraining.
Prokhorenkova, L. et al. (2018). "CatBoost: Unbiased Boosting with Categorical Features." Advances in Neural Information Processing Systems, 31. CatBoost addresses target leakage in categorical feature handling and introduces ordered boosting. Useful for sports prediction models with categorical features like team names, venues, and division indicators.

Random Forests and Ensemble Methods

Breiman, L. (2001). "Random Forests." Machine Learning, 45(1), 5-32. Leo Breiman's influential paper introducing random forests. Establishes the mathematical foundation (variance reduction through decorrelation) and demonstrates the method's robustness across diverse prediction tasks.
Wolpert, D. H. (1992). "Stacked Generalization." Neural Networks, 5(2), 241-259. The original stacking paper. Introduces the concept of training a meta-learner on out-of-fold predictions from base models. The theoretical foundation for the model combination strategies used throughout Chapter 27.
van der Laan, M. J., Polley, E. C., and Hubbard, A. E. (2007). "Super Learner." Statistical Applications in Genetics and Molecular Biology, 6(1). Extends stacking with formal optimality results, showing that the super learner performs asymptotically as well as the best possible combination of base learners. Provides theoretical justification for ensemble approaches in sports prediction.

Probability Calibration

Platt, J. C. (1999). "Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods." Advances in Large-Margin Classifiers, 61-74. The original Platt scaling paper. Introduces the sigmoid calibration method $P(y|f) = 1/(1 + \exp(Af + B))$. Although developed for SVMs, the method applies to any model producing uncalibrated scores.
Niculescu-Mizil, A. and Caruana, R. (2005). "Predicting Good Probabilities with Supervised Learning." Proceedings of the 22nd International Conference on Machine Learning, 625-632. Comprehensive empirical study of calibration across different algorithms. Shows that boosted trees are typically well-calibrated for binary classification but can be overconfident. Motivates the use of post-hoc calibration for XGBoost.
Guo, C. et al. (2017). "On Calibration of Modern Neural Networks." Proceedings of the 34th International Conference on Machine Learning. Demonstrates that modern deep learning models are often poorly calibrated and that simple post-hoc methods (temperature scaling, a generalization of Platt scaling) are surprisingly effective. The findings extend to gradient-boosted tree ensembles.

SHAP and Model Interpretability

Lundberg, S. M. and Lee, S.-I. (2017). "A Unified Approach to Interpreting Model Predictions." Advances in Neural Information Processing Systems, 30. The foundational SHAP paper. Unifies several existing explanation methods (LIME, DeepLIFT, Shapley regression) under a single theoretical framework based on Shapley values. Establishes the properties that make SHAP explanations uniquely trustworthy.
Lundberg, S. M., Erion, G. G., and Lee, S.-I. (2018). "Consistent Individualized Feature Attribution for Tree Ensembles." arXiv:1802.03888. Introduces TreeSHAP, the polynomial-time exact algorithm for computing SHAP values in tree-based models. This paper makes SHAP practical for the large XGBoost models used in sports prediction.
Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). "'Why Should I Trust You?': Explaining the Predictions of Any Classifier." KDD, 1135-1144. The LIME paper. Introduces local interpretable model-agnostic explanations through perturbation-based local approximation. While SHAP is generally preferred for tree models, LIME remains useful for model-agnostic explanation.

Class Imbalance

Chawla, N. V. et al. (2002). "SMOTE: Synthetic Minority Over-Sampling Technique." Journal of Artificial Intelligence Research, 16, 321-357. The original SMOTE paper. Introduces the idea of generating synthetic minority examples through interpolation between nearest neighbors. The most widely used oversampling method for imbalanced classification.
He, H. and Garcia, E. A. (2009). "Learning from Imbalanced Data." IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284. Comprehensive survey of techniques for handling class imbalance, covering sampling methods (SMOTE, undersampling), cost-sensitive learning (class weights), and ensemble methods (EasyEnsemble, BalancedBagging). Useful reference for sports prediction of rare events.

Sports Prediction Applications

Hubacek, O., Sourek, G., and Zelezny, F. (2019). "Exploiting Sports-Betting Market Using Machine Learning." International Journal of Forecasting, 35(2), 783-796. Demonstrates profitable betting using gradient-boosted trees with careful feature engineering and temporal validation. One of the few peer-reviewed papers showing machine learning beating sports betting markets.
Constantinou, A. C., Fenton, N. E., and Neil, M. (2012). "pi-football: A Bayesian Network Model for Forecasting Association Football Match Outcomes." Knowledge-Based Systems, 36, 322-339. Applies Bayesian networks to soccer match prediction. Relevant for comparison with the machine learning approaches in Chapter 27 and for understanding how probabilistic graphical models differ from tree-based ensembles.
Sauer, R. D. (1998). "The Economics of Wagering Markets." Journal of Economic Literature, 36(4), 2021-2064. Comprehensive review of the economics of sports betting markets, including market efficiency, the role of informed bettors, and the relationship between model predictions and market prices. Provides the economic context for why probability calibration matters.

Practical Implementation

Boehmke, B. and Greenwell, B. (2019). Hands-On Machine Learning with R. CRC Press. (Chapters on gradient boosting and interpretability.) While R-focused, the conceptual explanations of gradient boosting tuning, feature importance, and partial dependence plots are excellent and language-agnostic.
Molnar, C. (2022). Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. (Available free online.) Comprehensive guide to model interpretability covering SHAP, LIME, partial dependence, accumulated local effects, and more. Highly recommended for anyone building interpretable sports prediction models. The online format is regularly updated.