Chapter 19: Further Reading

Textbooks and General References

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer. --- The authoritative reference for the statistical foundations of machine learning. Chapters on tree-based methods, boosting, and model selection are particularly relevant.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning with Applications in Python (2nd ed.). Springer. --- A more accessible introduction than ESL, with practical Python examples. Freely available online.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. --- Comprehensive treatment of probabilistic machine learning, including Gaussian Mixture Models and the EM algorithm.
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. --- A modern, thorough treatment of ML from a probabilistic perspective. Freely available online.

Geron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd ed.). O'Reilly Media. --- Excellent practical guide covering scikit-learn pipelines, ensemble methods, and deployment.
Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling. Springer. --- Strong coverage of feature engineering, model tuning, and evaluation strategies with practical advice.

Rathke, A. (2017). "An examination of expected goals and shot efficiency in soccer." Journal of Human Sport and Exercise, 12(2proc), S514--S529. --- One of the early academic treatments of xG methodology.
Anzer, G., & Bauer, P. (2021). "A Goal Scoring Probability Model for Shots Based on Synchronized Positional and Event Data in Football (Soccer)." Frontiers in Sports and Active Living, 3, 624475. --- Demonstrates the value of incorporating tracking data (defender and goalkeeper positions) into xG models.
Robberechts, P., Davis, J., & Haaren, J. V. (2021). "A Bayesian Approach to In-Game Win Probability in Soccer." Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. --- Extends xG concepts to full-match win probability modeling.

Decroos, T., Bransen, L., Van Haaren, J., & Davis, J. (2019). "Actions Speak Louder than Goals: Valuing Player Actions in Soccer." Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1851--1861. --- Introduces VAEP (Valuing Actions by Estimating Probabilities), a framework for valuing all on-the-ball actions.
Fernandez, J., Bornn, L., & Cervone, D. (2021). "A Framework for the Fine-Grained Evaluation of the Instantaneous Expected Value of Soccer Possessions." Machine Learning, 110(6), 1389--1427. --- Expected Possession Value (EPV) model using tracking data.
Singh, K. (2019). "Introducing Expected Threat (xT)." karun.in/blog. --- The original blog post introducing Expected Threat as a Markov-chain valuation of pitch zones.

Decroos, T., & Davis, J. (2020). "Player Vectors: Characterizing Soccer Players' Playing Style from Match Event Streams." Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2019), pp. 569--584. --- Uses action sequences to create dense vector representations of player styles.
Aalbers, B., & Van Haaren, J. (2023). "Distinguishing Between Player Roles in Football Using Clustering." Journal of Sports Sciences. --- Systematic comparison of clustering methods for player role identification.

Chen, T., & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785--794. --- The foundational XGBoost paper, essential for understanding modern gradient boosting.
Ke, G., Meng, Q., Finley, T., et al. (2017). "LightGBM: A Highly Efficient Gradient Boosting Decision Tree." Advances in Neural Information Processing Systems, 30, 3146--3154. --- Introduces histogram-based splitting and gradient-based one-side sampling.
Wolpert, D. H. (1992). "Stacked Generalization." Neural Networks, 5(2), 241--259. --- The original paper on stacking (stacked generalization).

Guyon, I., & Elisseeff, A. (2003). "An Introduction to Variable and Feature Selection." Journal of Machine Learning Research, 3, 1157--1182. --- Comprehensive overview of feature selection methods.

Niculescu-Mizil, A., & Caruana, R. (2005). "Predicting Good Probabilities with Supervised Learning." Proceedings of the 22nd International Conference on Machine Learning, pp. 625--632. --- Demonstrates that boosted trees tend to push predictions away from 0 and 1, motivating post-hoc calibration.

Lundberg, S. M., & Lee, S.-I. (2017). "A Unified Approach to Interpreting Model Predictions." Advances in Neural Information Processing Systems, 30. --- Introduces SHAP (SHapley Additive exPlanations) for model interpretability.
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "'Why Should I Trust You?' Explaining the Predictions of Any Classifier." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. --- Introduces LIME (Local Interpretable Model-agnostic Explanations).

Sculley, D., Holt, G., Golovin, D., et al. (2015). "Hidden Technical Debt in Machine Learning Systems." Advances in Neural Information Processing Systems, 28. --- Seminal paper on the challenges of maintaining ML systems in production.

StatsBomb Open Data: Free event data for select competitions. GitHub: statsbomb/open-data.
Wyscout Public Datasets: Publicly released match event data used in academic research.
FBref.com: Comprehensive aggregated statistics for players and teams across top leagues.

scikit-learn User Guide: https://scikit-learn.org/stable/user_guide.html --- Comprehensive documentation with theoretical explanations for all algorithms used in this chapter.
SHAP Documentation: https://shap.readthedocs.io --- Tutorial and API reference for SHAP interpretability.
MLflow Documentation: https://mlflow.org/docs/latest/index.html --- Open-source platform for model tracking, versioning, and deployment.

Friends of Tracking: YouTube channel with video tutorials on soccer analytics methods, including ML-based models.
Soccermatics by David Sumpter: Blog and book bridging mathematics and soccer analytics.
The Athletic / StatsBomb blog: Regular articles applying ML concepts to real-world soccer analysis.

For readers new to machine learning in soccer, we recommend the following sequence:

Start with James et al. (ISLP) for ML fundamentals.
Apply with Geron's Hands-On ML for practical scikit-learn skills.
Specialize with the Decroos et al. VAEP paper and the Anzer & Bauer xG paper for soccer-specific methods.
Deepen with Chen & Guestrin (XGBoost) and Lundberg & Lee (SHAP) for advanced techniques.
Productionize with Sculley et al. on ML technical debt and the MLflow documentation for deployment.