Chapter 19: Further Reading - Modeling Soccer

Academic Papers and Research

  1. Dixon, Mark J., and Stuart G. Coles. "Modelling Association Football Scores and Inefficiencies in the Football Betting Market." Journal of the Royal Statistical Society, Series C (1997). The foundational paper that introduced the bivariate Poisson model with correlation adjustment. Essential reading for anyone implementing a soccer prediction model. The paper's approach to parameter estimation via maximum likelihood and its identification of market inefficiencies remain directly relevant today.

  2. Maher, Michael J. "Modelling Association Football Scores." Statistica Neerlandica (1982). The precursor to Dixon-Coles that established the independent Poisson framework for soccer scoring. Maher's parameterization of attack and defense strengths laid the groundwork for all subsequent Poisson-based soccer models.

  3. Karlis, Dimitris, and Ioannis Ntzoufras. "Analysis of Sports Data by Using Bivariate Poisson Models." Journal of the Royal Statistical Society, Series D (2003). Extends the bivariate Poisson approach beyond Dixon-Coles, exploring alternative correlation structures and applying them to multiple European leagues. Provides a rigorous statistical framework for comparing different Poisson-based models.

  4. Eggels, Harm, Ruud van Elk, and Mykola Pechenizkiy. "Expected Goals in Soccer: Explaining Match Results Using Predictive Analytics." Machine Learning and Data Mining for Sports Analytics Workshop (2016). An early systematic treatment of expected goals modeling using machine learning techniques. Compares logistic regression with gradient-boosted trees and random forests for shot-level prediction.

  5. Rathke, Alex. "An Examination of Expected Goals and Shot Efficiency in Soccer." Journal of Human Sport and Exercise (2017). Provides a thorough statistical analysis of xG model accuracy and the relationship between xG and actual goals at both match and season level. Quantifies the regression effect that underpins the xG betting trade.

  6. Koopman, Siem Jan, and Rutger Lit. "A Dynamic Bivariate Poisson Model for Analysing and Forecasting Match Results in the English Premier League." Journal of the Royal Statistical Society, Series A (2015). Implements a state-space version of the Poisson model where team strengths evolve dynamically over time. More sophisticated than Dixon-Coles time-decay but computationally more demanding.

  7. Hvattum, Lars Magnus, and Halvard Arntzen. "Using ELO Ratings for Match Result Prediction in Association Football." International Journal of Forecasting (2010). Demonstrates that a well-calibrated Elo system can compete with more complex models for soccer prediction. Provides a useful baseline against which to evaluate Dixon-Coles and xG-based approaches.

Books

  1. Anderson, Chris, and David Sally. "The Numbers Game: Why Everything You Know About Football Is Wrong." Penguin Books (2013). An accessible introduction to soccer analytics that covers many of the themes in this chapter, including the randomness of goals, the importance of defense, and the disconnect between performance and results.

  2. Sumpter, David. "Soccermatics: Mathematical Adventures in the Beautiful Game." Bloomsbury Sigma (2016). A mathematician's exploration of soccer analytics, covering Poisson models, network analysis, and spatial modeling. Provides excellent intuition for the statistical foundations of soccer modeling without requiring advanced mathematical background.

  3. Wilson, Jonathan. "Inverting the Pyramid: The History of Soccer Tactics." Nation Books (2013). While not a quantitative book, understanding tactical evolution is crucial for anyone building league-specific models. Wilson's comprehensive history explains why different leagues developed different playing styles, which directly affects scoring rates and model calibration.

Data Sources

  1. FBref (Football Reference). Comprehensive statistical database for soccer leagues worldwide, including xG data provided by StatsBomb. Available at https://fbref.com/. Provides season-level and match-level statistics including xG, xG against, progressive passes, and pressing metrics. The primary free data source for serious soccer modeling.

  2. Understat. Expected goals data for the top five European leagues, with shot-level xG values and shot maps. Available at https://understat.com/. Particularly useful for building and validating xG models, as it provides individual shot coordinates and xG values.

  3. Football-Data.co.uk. Historical match results and betting odds for dozens of leagues worldwide. Available at https://www.football-data.co.uk/. Includes 1X2 odds, Asian handicap lines, and over/under lines from multiple bookmakers. Essential for backtesting betting strategies.

  4. Transfermarkt. Squad market values and transfer data for professional soccer worldwide. Available at https://www.transfermarkt.com/. Market values serve as a useful proxy for squad quality, particularly for initializing models for newly promoted teams or leagues with limited statistical data.

Online Resources and Communities

  1. StatsBomb Open Data. Free event-level soccer data for select competitions, including shot coordinates, pass sequences, and defensive actions. Available on GitHub at https://github.com/statsbomb/open-data. The richest freely available event data for building xG models and possession-based analytics.

  2. The Expected Value Blog. Technical writing on soccer analytics and betting, with a focus on model building and market analysis. Covers topics from basic Poisson modeling through advanced machine learning approaches, with code examples in Python.

  3. Pinnacle Sports Betting Resources. Pinnacle's editorial content on soccer betting covers Asian handicap mechanics, market efficiency, and value identification strategies. As the sharpest major bookmaker for soccer, Pinnacle's closing lines serve as the benchmark for model evaluation.

  4. American Soccer Analysis. Data-driven analysis of MLS and US soccer, including expected goals models adapted for the American league context. Available at https://www.americansocceranalysis.com/. Useful for understanding how xG models need to be adapted for different league environments.

Methodological References

  1. Baio, Gianluca, and Marta A. Blangiardo. "Bayesian Hierarchical Model for the Prediction of Football Results." Journal of Applied Statistics (2010). Implements a full Bayesian version of the Poisson model with team-specific random effects. The hierarchical structure naturally regularizes parameter estimates and provides uncertainty quantification, making it particularly useful for leagues with many teams and limited data.

  2. Boshnakov, Georgi, Tarak Kharrat, and Ian G. McHale. "A Bivariate Weibull Count Model for Forecasting Association Football Scores." International Journal of Forecasting (2017). Proposes an alternative to the Poisson distribution based on Weibull count processes, which can better capture the time-varying intensity of goal scoring within a match. Relevant for in-play modeling and for understanding the limitations of the static Poisson assumption.