Chapter 19: Further Reading

Textbooks and General References

Machine Learning Foundations

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer. --- The authoritative reference for the statistical foundations of machine learning. Chapters on tree-based methods, boosting, and model selection are particularly relevant.

  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning with Applications in Python (2nd ed.). Springer. --- A more accessible introduction than ESL, with practical Python examples. Freely available online.

  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. --- Comprehensive treatment of probabilistic machine learning, including Gaussian Mixture Models and the EM algorithm.

  • Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. --- A modern, thorough treatment of ML from a probabilistic perspective. Freely available online.

Applied Machine Learning

  • Geron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd ed.). O'Reilly Media. --- Excellent practical guide covering scikit-learn pipelines, ensemble methods, and deployment.

  • Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling. Springer. --- Strong coverage of feature engineering, model tuning, and evaluation strategies with practical advice.

Soccer Analytics Literature

Expected Goals and Shot Models

  • Rathke, A. (2017). "An examination of expected goals and shot efficiency in soccer." Journal of Human Sport and Exercise, 12(2proc), S514--S529. --- One of the early academic treatments of xG methodology.

  • Anzer, G., & Bauer, P. (2021). "A Goal Scoring Probability Model for Shots Based on Synchronized Positional and Event Data in Football (Soccer)." Frontiers in Sports and Active Living, 3, 624475. --- Demonstrates the value of incorporating tracking data (defender and goalkeeper positions) into xG models.

  • Robberechts, P., Davis, J., & Haaren, J. V. (2021). "A Bayesian Approach to In-Game Win Probability in Soccer." Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. --- Extends xG concepts to full-match win probability modeling.

Player Valuation and Action Value

  • Decroos, T., Bransen, L., Van Haaren, J., & Davis, J. (2019). "Actions Speak Louder than Goals: Valuing Player Actions in Soccer." Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1851--1861. --- Introduces VAEP (Valuing Actions by Estimating Probabilities), a framework for valuing all on-the-ball actions.

  • Fernandez, J., Bornn, L., & Cervone, D. (2021). "A Framework for the Fine-Grained Evaluation of the Instantaneous Expected Value of Soccer Possessions." Machine Learning, 110(6), 1389--1427. --- Expected Possession Value (EPV) model using tracking data.

  • Singh, K. (2019). "Introducing Expected Threat (xT)." karun.in/blog. --- The original blog post introducing Expected Threat as a Markov-chain valuation of pitch zones.

Clustering and Player Roles

  • Decroos, T., & Davis, J. (2020). "Player Vectors: Characterizing Soccer Players' Playing Style from Match Event Streams." Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2019), pp. 569--584. --- Uses action sequences to create dense vector representations of player styles.

  • Aalbers, B., & Van Haaren, J. (2023). "Distinguishing Between Player Roles in Football Using Clustering." Journal of Sports Sciences. --- Systematic comparison of clustering methods for player role identification.

Technical Topics

Ensemble Methods

  • Chen, T., & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785--794. --- The foundational XGBoost paper, essential for understanding modern gradient boosting.

  • Ke, G., Meng, Q., Finley, T., et al. (2017). "LightGBM: A Highly Efficient Gradient Boosting Decision Tree." Advances in Neural Information Processing Systems, 30, 3146--3154. --- Introduces histogram-based splitting and gradient-based one-side sampling.

  • Wolpert, D. H. (1992). "Stacked Generalization." Neural Networks, 5(2), 241--259. --- The original paper on stacking (stacked generalization).

Feature Selection and Engineering

  • Guyon, I., & Elisseeff, A. (2003). "An Introduction to Variable and Feature Selection." Journal of Machine Learning Research, 3, 1157--1182. --- Comprehensive overview of feature selection methods.

Model Calibration

  • Niculescu-Mizil, A., & Caruana, R. (2005). "Predicting Good Probabilities with Supervised Learning." Proceedings of the 22nd International Conference on Machine Learning, pp. 625--632. --- Demonstrates that boosted trees tend to push predictions away from 0 and 1, motivating post-hoc calibration.

Model Interpretability

  • Lundberg, S. M., & Lee, S.-I. (2017). "A Unified Approach to Interpreting Model Predictions." Advances in Neural Information Processing Systems, 30. --- Introduces SHAP (SHapley Additive exPlanations) for model interpretability.

  • Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "'Why Should I Trust You?' Explaining the Predictions of Any Classifier." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. --- Introduces LIME (Local Interpretable Model-agnostic Explanations).

Model Deployment and MLOps

  • Sculley, D., Holt, G., Golovin, D., et al. (2015). "Hidden Technical Debt in Machine Learning Systems." Advances in Neural Information Processing Systems, 28. --- Seminal paper on the challenges of maintaining ML systems in production.

Online Resources

Data Sources

  • StatsBomb Open Data: Free event data for select competitions. GitHub: statsbomb/open-data.
  • Wyscout Public Datasets: Publicly released match event data used in academic research.
  • FBref.com: Comprehensive aggregated statistics for players and teams across top leagues.

Software Documentation

  • scikit-learn User Guide: https://scikit-learn.org/stable/user_guide.html --- Comprehensive documentation with theoretical explanations for all algorithms used in this chapter.
  • SHAP Documentation: https://shap.readthedocs.io --- Tutorial and API reference for SHAP interpretability.
  • MLflow Documentation: https://mlflow.org/docs/latest/index.html --- Open-source platform for model tracking, versioning, and deployment.

Community and Blogs

  • Friends of Tracking: YouTube channel with video tutorials on soccer analytics methods, including ML-based models.
  • Soccermatics by David Sumpter: Blog and book bridging mathematics and soccer analytics.
  • The Athletic / StatsBomb blog: Regular articles applying ML concepts to real-world soccer analysis.

For readers new to machine learning in soccer, we recommend the following sequence:

  1. Start with James et al. (ISLP) for ML fundamentals.
  2. Apply with Geron's Hands-On ML for practical scikit-learn skills.
  3. Specialize with the Decroos et al. VAEP paper and the Anzer & Bauer xG paper for soccer-specific methods.
  4. Deepen with Chen & Guestrin (XGBoost) and Lundberg & Lee (SHAP) for advanced techniques.
  5. Productionize with Sculley et al. on ML technical debt and the MLflow documentation for deployment.