Chapter 19: Further Reading
Textbooks and General References
Machine Learning Foundations
-
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer. --- The authoritative reference for the statistical foundations of machine learning. Chapters on tree-based methods, boosting, and model selection are particularly relevant.
-
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning with Applications in Python (2nd ed.). Springer. --- A more accessible introduction than ESL, with practical Python examples. Freely available online.
-
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. --- Comprehensive treatment of probabilistic machine learning, including Gaussian Mixture Models and the EM algorithm.
-
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. --- A modern, thorough treatment of ML from a probabilistic perspective. Freely available online.
Applied Machine Learning
-
Geron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd ed.). O'Reilly Media. --- Excellent practical guide covering scikit-learn pipelines, ensemble methods, and deployment.
-
Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling. Springer. --- Strong coverage of feature engineering, model tuning, and evaluation strategies with practical advice.
Soccer Analytics Literature
Expected Goals and Shot Models
-
Rathke, A. (2017). "An examination of expected goals and shot efficiency in soccer." Journal of Human Sport and Exercise, 12(2proc), S514--S529. --- One of the early academic treatments of xG methodology.
-
Anzer, G., & Bauer, P. (2021). "A Goal Scoring Probability Model for Shots Based on Synchronized Positional and Event Data in Football (Soccer)." Frontiers in Sports and Active Living, 3, 624475. --- Demonstrates the value of incorporating tracking data (defender and goalkeeper positions) into xG models.
-
Robberechts, P., Davis, J., & Haaren, J. V. (2021). "A Bayesian Approach to In-Game Win Probability in Soccer." Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. --- Extends xG concepts to full-match win probability modeling.
Player Valuation and Action Value
-
Decroos, T., Bransen, L., Van Haaren, J., & Davis, J. (2019). "Actions Speak Louder than Goals: Valuing Player Actions in Soccer." Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1851--1861. --- Introduces VAEP (Valuing Actions by Estimating Probabilities), a framework for valuing all on-the-ball actions.
-
Fernandez, J., Bornn, L., & Cervone, D. (2021). "A Framework for the Fine-Grained Evaluation of the Instantaneous Expected Value of Soccer Possessions." Machine Learning, 110(6), 1389--1427. --- Expected Possession Value (EPV) model using tracking data.
-
Singh, K. (2019). "Introducing Expected Threat (xT)." karun.in/blog. --- The original blog post introducing Expected Threat as a Markov-chain valuation of pitch zones.
Clustering and Player Roles
-
Decroos, T., & Davis, J. (2020). "Player Vectors: Characterizing Soccer Players' Playing Style from Match Event Streams." Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2019), pp. 569--584. --- Uses action sequences to create dense vector representations of player styles.
-
Aalbers, B., & Van Haaren, J. (2023). "Distinguishing Between Player Roles in Football Using Clustering." Journal of Sports Sciences. --- Systematic comparison of clustering methods for player role identification.
Technical Topics
Ensemble Methods
-
Chen, T., & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785--794. --- The foundational XGBoost paper, essential for understanding modern gradient boosting.
-
Ke, G., Meng, Q., Finley, T., et al. (2017). "LightGBM: A Highly Efficient Gradient Boosting Decision Tree." Advances in Neural Information Processing Systems, 30, 3146--3154. --- Introduces histogram-based splitting and gradient-based one-side sampling.
-
Wolpert, D. H. (1992). "Stacked Generalization." Neural Networks, 5(2), 241--259. --- The original paper on stacking (stacked generalization).
Feature Selection and Engineering
- Guyon, I., & Elisseeff, A. (2003). "An Introduction to Variable and Feature Selection." Journal of Machine Learning Research, 3, 1157--1182. --- Comprehensive overview of feature selection methods.
Model Calibration
- Niculescu-Mizil, A., & Caruana, R. (2005). "Predicting Good Probabilities with Supervised Learning." Proceedings of the 22nd International Conference on Machine Learning, pp. 625--632. --- Demonstrates that boosted trees tend to push predictions away from 0 and 1, motivating post-hoc calibration.
Model Interpretability
-
Lundberg, S. M., & Lee, S.-I. (2017). "A Unified Approach to Interpreting Model Predictions." Advances in Neural Information Processing Systems, 30. --- Introduces SHAP (SHapley Additive exPlanations) for model interpretability.
-
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "'Why Should I Trust You?' Explaining the Predictions of Any Classifier." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. --- Introduces LIME (Local Interpretable Model-agnostic Explanations).
Model Deployment and MLOps
- Sculley, D., Holt, G., Golovin, D., et al. (2015). "Hidden Technical Debt in Machine Learning Systems." Advances in Neural Information Processing Systems, 28. --- Seminal paper on the challenges of maintaining ML systems in production.
Online Resources
Data Sources
- StatsBomb Open Data: Free event data for select competitions.
GitHub:
statsbomb/open-data. - Wyscout Public Datasets: Publicly released match event data used in academic research.
- FBref.com: Comprehensive aggregated statistics for players and teams across top leagues.
Software Documentation
- scikit-learn User Guide: https://scikit-learn.org/stable/user_guide.html --- Comprehensive documentation with theoretical explanations for all algorithms used in this chapter.
- SHAP Documentation: https://shap.readthedocs.io --- Tutorial and API reference for SHAP interpretability.
- MLflow Documentation: https://mlflow.org/docs/latest/index.html --- Open-source platform for model tracking, versioning, and deployment.
Community and Blogs
- Friends of Tracking: YouTube channel with video tutorials on soccer analytics methods, including ML-based models.
- Soccermatics by David Sumpter: Blog and book bridging mathematics and soccer analytics.
- The Athletic / StatsBomb blog: Regular articles applying ML concepts to real-world soccer analysis.
Recommended Learning Path
For readers new to machine learning in soccer, we recommend the following sequence:
- Start with James et al. (ISLP) for ML fundamentals.
- Apply with Geron's Hands-On ML for practical scikit-learn skills.
- Specialize with the Decroos et al. VAEP paper and the Anzer & Bauer xG paper for soccer-specific methods.
- Deepen with Chen & Guestrin (XGBoost) and Lundberg & Lee (SHAP) for advanced techniques.
- Productionize with Sculley et al. on ML technical debt and the MLflow documentation for deployment.