Decroos et al. (2019) - KU Leuven Research - VAEP (Valuing Actions by Estimating Probabilities) - Academic foundation for action-based valuation → Chapter 7: Further Reading
"An Introduction to Statistical Learning"
James et al. (2021) - Essential statistical learning concepts - Regression and classification methods - Free PDF available (statlearning.com) → Chapter 8: Further Reading
William Spearman (2018) - MIT Sloan Sports Analytics Conference - Introduces Expected Possession Value and advanced frameworks - Pioneering work on pitch control models - Available at: sloansportsconference.com → Chapter 7: Further Reading
"Calibration of Probabilities"
Niculescu-Mizil & Caruana (2005) - ICML Conference Paper - Definitive guide to Platt scaling and isotonic regression - Essential reading for probability model deployment → Chapter 7: Further Reading
"Chemistry in Football"
Various research - Passer-shooter combinations - Network analysis applications - Team cohesion metrics → Chapter 8: Further Reading
Focus: Delivery patterns and outcome modeling - Key Contribution: Framework for corner classification → Chapter 14: Further Reading
"Cross Claiming and Aerial Dominance Models"
Focus: Claiming probability models - Key Contribution: Expected claim success framework → Chapter 13: Further Reading
"Cruyff's Football Total Football"
Focus: Foundational pressing concepts - Relevance: Theoretical basis for modern pressing → Chapter 12: Further Reading
"Decomposing the Immeasurable Sport"
Fernández & Bornn (2018) - Expected Possession Value framework - Advanced tracking data methodology → Chapter 7: Further Reading
"Deep Soccer Analytics"
Decroos & Davis (2020) - Neural network approaches to action valuation - Comparison with tree-based methods → Chapter 7: Further Reading
"Defensive Set Piece Organization"
Focus: Zonal vs. man-marking effectiveness - Key Contribution: Comparative analysis of defensive systems → Chapter 14: Further Reading
"Distribution Networks: Goalkeepers as Playmakers"
Focus: Passing network analysis including goalkeepers - Key Contribution: Quantifying goalkeeper distribution value → Chapter 13: Further Reading
"Expected Goals and Support Vector Machines"
Mark Eastwood (2014) - Explores machine learning approaches to xG - Early application of SVM to shot classification - https://pena.lt/y/ → Chapter 7: Further Reading
"Expected Goals"
Tippett, J. (2023) - Bloomsbury Sport - The first book-length treatment dedicated to xG - Covers history, methodology, and applications - ISBN: 978-1399401845 → Chapter 7: Further Reading
"Expected Threat"
Karun Singh (2019) - Original xT blog post and implementation - Extends xG concept to all ball actions - https://karun.in/blog/expected-threat.html → Chapter 7: Further Reading
"Famous Set Piece Goals in History"
Focus: Historical analysis of memorable set pieces - Relevance: Understanding tactical evolution → Chapter 14: Further Reading
"Football Hackers"
Christoph Biermann (2019) - History of analytics adoption in soccer - Profiles of analysts and their methods - Industry context and evolution → Chapter 8: Further Reading
"Free Kick Trajectory Modeling"
Focus: Physics of free kick ball flight - Key Contribution: Technical analysis of shooting technique → Chapter 14: Further Reading
"Friends of Tracking" YouTube Channel
Free lectures by academics and practitioners - Covers xG, tracking data, and advanced metrics - https://www.youtube.com/friendsoftracking → Chapter 7: Further Reading
"From Key Passes to xA"
Industry evolution - Metric development timeline - Provider adoption - Current standards → Chapter 8: Further Reading
"Game Theory and Penalty Kicks"
Authors: Various economics researchers - Focus: Mixed strategy equilibrium in penalty situations - Key Contribution: Mathematical framework for penalty analysis → Chapter 14: Further Reading
Focus: Training approaches with data integration - Relevance: Connecting analysis to development → Chapter 13: Further Reading
"Hands-On Machine Learning"
Aurélien Géron (2022, 3rd Ed.) - Scikit-learn and TensorFlow applications - Model building and evaluation - Applicable to xA model development → Chapter 8: Further Reading
"Introducing Expected Threat (xT)"
Karun Singh (2019) - Original blog post introducing the xT concept - Clear explanation with visualizations - Implementation details - Link: karun.in/blog/expected-threat.html → Chapter 9: Further Reading
"Machine Learning for Defensive Player Evaluation"
Focus: ML approaches to defender assessment - Key Contribution: Feature engineering for defensive metrics → Chapter 12: Further Reading
"Measuring Ball Progression in Football"
Various analysts - Multiple blog posts on progression metrics - Different definitions and implementations - FBref, StatsBomb, and Wyscout methodologies → Chapter 9: Further Reading
"Moneyball: The Art of Winning an Unfair Game"
Author: Michael Lewis - Relevance: Framework for valuing underrated contributions (parallel to defense) → Chapter 12: Further Reading
"Net Gains"
Ryan O'Hanlon (2022) - Modern soccer analytics landscape - Player evaluation methods - Industry applications and case studies → Chapter 8: Further Reading
Peña & Touchette (2012) - Network analysis of passing patterns - Understanding team passing structure - Chaos, Complexity, and Entropy conference → Chapter 8: Further Reading
"Pattern Recognition and Machine Learning"
Bishop, C. M. (2006) - Springer - Rigorous treatment of logistic regression and probability calibration - ISBN: 978-0387310732 → Chapter 7: Further Reading
Hastie, T., Tibshirani, R., & Friedman, J. (2009) - Stanford University (free online) - Theoretical foundation for gradient boosting and model evaluation - https://web.stanford.edu/~hastie/ElemStatLearn/ → Chapter 7: Further Reading
"The Expected Goals Philosophy"
James Tippett (2019) - Accessible introduction to xG concepts - Includes xA discussion and applications - Good for conceptual understanding → Chapter 8: Further Reading
Anderson & Sally (2013) - Statistical analysis of soccer - Foundation for advanced metrics - Accessible statistical thinking → Chapter 8: Further Reading
"The Problem with Expected Goals"
Spearman (2017) - Critical analysis of xG limitations - Implications for xA interpretation - OptaPro Analytics Forum → Chapter 8: Further Reading
"The Science of Set Piece Goals"
Focus: Quantifying set piece contribution to goal scoring - Key Contribution: Comprehensive breakdown of set piece types and values → Chapter 14: Further Reading
Decroos et al. (2019) - Introduces VAEP framework for valuing all on-ball actions including passes - Provides mathematical foundation for action valuation - Link: KU Leuven DTAI Research Group publications → Chapter 8: Further Reading
"Wide Open Spaces"
Fernández & Bornn (2018) - Space creation quantification - Off-ball contribution measurement - Tracking data requirements → Chapter 9: Further Reading
Z-score standardization converts metrics measured on different scales (e.g., passes per 90 vs. xG per 90) to a common scale with mean 0 and standard deviation 1. → Chapter 21: Quiz — Player Recruitment and Scouting
Arsenal's acquisition of StatDNA in 2012 brought sophisticated analytics in-house. StatDNA, founded by Jaeson Rosenfeld, had built a comprehensive system for evaluating player performance using data, and Arsenal saw enough value in it to acquire the entire company rather than just licensing the prod → Chapter 1: Introduction to Soccer Analytics
Overpaying relative to market value - Long contracts for aging players - High wage demands relative to squad structure - Sell-on value uncertainty → Chapter 21: Player Recruitment and Scouting
A
A four-pillar ethical framework
transparency, consent and agency, proportionality, and fairness --- should guide all data collection and use in soccer analytics. → Chapter 30: Key Takeaways
a) Hypotheses:
H₀: p₁ = p₂ (home win rates are equal in both periods) - H₁: p₁ ≠ p₂ (home win rates differ between periods) - OR specifically: H₁: p₁ > p₂ (home advantage has declined) → Quiz: Statistical Foundations for Soccer Analysis
Academic Journals:
*Journal of Sports Analytics* (IOS Press) - *Journal of Quantitative Analysis in Sports* (De Gruyter) - *International Journal of Performance Analysis in Sport* (Taylor & Francis) - *Scientific Reports* and *PLOS ONE* for interdisciplinary sports science research → Chapter 30: The Future of Soccer Analytics
Academic route:
Degree in statistics, computer science, sports science, mathematics, physics, or related field - Research projects or thesis focused on soccer analytics - Publications or public portfolio demonstrating capability - Masters or PhD programs with sports analytics specializations (e.g., the University o → Chapter 1: Introduction to Soccer Analytics
Accelerometer
A sensor that measures the rate of change of velocity, commonly embedded in GPS vests to track player movements and impacts during training and matches. (Chapter 18) → Appendix E: Glossary of Soccer Analytics Terms
Accuracy
In classification, the proportion of correct predictions out of total predictions. In soccer analytics, often applied to pass completion or shot-on-target rates. (Chapter 3) → Appendix E: Glossary of Soccer Analytics Terms
Acquisition:
Source: Independiente del Valle (Ecuador) - Age at signing: 19 - Fee: Approximately GBP 4.5 million - Identified through: Event data analytics flagging elite pressing metrics in the Ecuadorian top flight and Copa Libertadores → Case Study 2: Brighton's Value-Based Recruitment
Action Valuation
A framework for assigning a numerical value to every on-ball action (pass, carry, shot, tackle) based on its contribution to the probability of scoring or conceding. VAEP and xT are prominent action valuation models. (Chapter 9) → Appendix E: Glossary of Soccer Analytics Terms
The ratio of a player's recent training load (typically 7-day rolling average) to their longer-term baseline (typically 28-day exponentially weighted moving average). Values between 0.8 and 1.3 are generally considered the safe zone. (Chapter 26) → Appendix E: Glossary of Soccer Analytics Terms
Focus on Part III (Chapters 15-21) and selected Part IV chapters - Emphasis on projects and case studies → How to Use This Book
Advantages of continuous models:
No arbitrary boundary effects - Can capture fine-grained spatial variation - Naturally integrate with tracking data - Can be conditioned on game context (score state, time remaining) → Chapter 17: Spatial Analysis and Pitch Control
Advantages of zone-based models:
Computationally simple and fast - Easy to interpret and communicate to coaches - Work with event data (no tracking data required) - Aggregate naturally across matches and seasons → Chapter 17: Spatial Analysis and Pitch Control
Advantages over xT:
Accounts for action-specific features (pass type, body part, pressure) - Includes defensive value (preventing conceding) - Can handle complex sequences - Values the action itself, not just the positional change → Chapter 9: Expected Threat (xT) and Ball Progression
Advantages:
Much smaller data volume than full tracking (thousands of frames per match versus millions of data points) - Enables spatial analysis of key moments without the infrastructure requirements of full tracking - Often included with event data subscriptions at no additional cost - Sufficient for many ana → Chapter 2: Data Sources and Collection in Soccer
Aerial Duel
A contest between two players for a ball in the air, typically from a long pass, cross, or goal kick. Win rate is expressed as a percentage. (Chapter 12) → Appendix E: Glossary of Soccer Analytics Terms
Aerial Stopper Requirements:
Aerial win rate > 70% - Clearances > 7.0 per 90 - Heading accuracy in both boxes → Chapter 12: Key Takeaways
[ ] Can explain what soccer analytics is and why it matters - [ ] Can load, clean, and explore soccer datasets - [ ] Can perform basic statistical analysis on soccer data - [ ] Can create pitch visualizations - [ ] Can calculate and interpret common soccer metrics → How to Use This Book
After Part II (Core Analytics):
[ ] Can build and evaluate an xG model - [ ] Can construct and analyze passing networks - [ ] Can implement possession value frameworks - [ ] Can evaluate players using multiple metrics - [ ] Can analyze team performance and style → How to Use This Book
After Part III (Advanced Analytics):
[ ] Can work with spatial and tracking data - [ ] Can apply machine learning to soccer problems - [ ] Can build predictive models for outcomes - [ ] Can support scouting with data analysis - [ ] Can perform tactical analysis → How to Use This Book
After Parts IV-V:
[ ] Can apply deep learning techniques - [ ] Can conduct economic analysis - [ ] Can integrate analytics into organizational workflows - [ ] Can complete end-to-end analysis projects → How to Use This Book
Varane (25): Prime years ahead, highest long-term value - Umtiti (24): Prime years ahead, injury concerns - Stones (24): Development trajectory positive - Piqué (31): Experience but limited future value - Alderweireld (29): Near-prime, 3-4 year window → Case Study 2: Identifying Ball-Progressing Players for Recruitment
Age Curve
A statistical model describing how player performance changes as a function of age, typically showing peak performance between ages 25 and 29 for most outfield positions. (Chapter 21) → Appendix E: Glossary of Soccer Analytics Terms
Algorithmic Bias
Systematic errors in model predictions that create unfair outcomes for particular groups, such as undervaluing players from certain leagues or backgrounds. (Chapter 30) → Appendix E: Glossary of Soccer Analytics Terms
Which delivery type generates the highest xG per corner? - Are teams over- or under-performing their corner xG? - What is the total xG from corners vs. open play? → Chapter 14: Exercises
Pressures per 90: 26.3 (98th percentile for Ecuadorian league midfielders) - Tackles + interceptions per 90: 7.8 (95th percentile) - Progressive passes per 90: 4.2 (adjusted: 3.3 PL equivalent) - Ball recoveries in middle third per 90: 9.1 → Case Study 2: Brighton's Value-Based Recruitment
Proprietary player valuation models - Recruitment screening tools with customizable filters - Performance benchmarking dashboards - Tactical analysis tools integrated with video - Injury risk and workload monitoring → Case Study 2: Scaling Analytics at Manchester City Football Group
Analytics Philosophy:
"Moneyball" approach: identifying market inefficiencies - Statistical models for player valuation and recruitment - Willingness to sell high-performing players at peak value - Reinvestment of transfer profits into analytics and recruitment infrastructure → Chapter 28: Building an Analytics Department
Analytics Translators:
Bridge technical and non-technical stakeholders - Communicate insights to coaches and executives - Ensure analytics is actionable and understood - Build relationships across departments - Skills: Communication, domain expertise, relationship building, data visualization → Chapter 1: Introduction to Soccer Analytics
Stanford CS231n. The foundational deep learning for vision course. While not soccer- specific, it provides the theoretical background for all CNN-based approaches discussed in this chapter. → Chapter 23: Further Reading
Angle to Goal
The angle subtended by the goal posts from the location of a shot, calculated using trigonometry. A key feature in expected goals models. (Chapter 7) → Appendix E: Glossary of Soccer Analytics Terms
Anti-Patterns to Avoid:
Running cells out of order (creates hidden state bugs) - Putting all code in one massive notebook - Leaving commented-out experimental code everywhere - Defining the same function in multiple notebooks → Chapter 4: Python Tools for Soccer Analytics
Applications:
Training load management and periodization - Injury risk prediction (players whose acute load significantly exceeds their chronic load are at elevated injury risk—the "acute:chronic workload ratio") - Return-to-play monitoring (comparing current physical output to pre-injury baselines) - Performance → Chapter 2: Data Sources and Collection in Soccer
The final pass or action leading directly to a goal. Expected assists (xA) models assign a probability that a given pass will result in a goal. (Chapter 8) → Appendix E: Glossary of Soccer Analytics Terms
FIFA World Cup 2018 (Men's) - complete event data for all 64 matches - FIFA World Cup 2022 (Men's) - complete event data - UEFA Euro 2020 (2021) - complete event data - UEFA Euro 2024 - complete event data - FA Women's Super League (multiple seasons) - NWSL (multiple seasons) - La Liga (selected sea → Chapter 2: Data Sources and Collection in Soccer
Available Data:
Event data (passes, shots, carries, pressures, duels, etc.) - Lineup data - Match metadata - 360 freeze-frame data (for select matches) → Appendix D: Data Sources and Tools
B
Backpass
A pass directed away from the opponent's goal, often used to retain possession and reset the attacking build-up. Excessive backpass frequency may indicate a team under pressing pressure. (Chapter 22) → Appendix E: Glossary of Soccer Analytics Terms
Backup target: Player Beta
Slightly lower current level but strong upside - Lower fee provides better financial value - Scout wants further evaluation -- monitor for the remainder of the season - If Alpha is unavailable or price escalates, Beta becomes the primary target → Case Study 21.2: Scouting a Replacement — Finding the Next N'Golo Kanté
Pass completion > 88% - Progressive passes > 3.5 per 90 - Comfortable under pressure → Chapter 12: Key Takeaways
Bayesian Inference
A statistical framework that updates beliefs (prior probabilities) with observed data (likelihood) to produce updated beliefs (posterior probabilities). Particularly useful in soccer analytics due to small sample sizes. (Chapter 19) → Appendix E: Glossary of Soccer Analytics Terms
A step-by-step tutorial for implementing the Dixon-Coles model from scratch, including parameter estimation and prediction. → Chapter 20: Further Reading
Best Practices:
Write functions with clear parameters, type hints, and documentation - Use classes for complex, stateful analysis - Handle errors gracefully with logging and input validation - Optimize for large datasets with dtype downcasting and Parquet files → Chapter 4: Python Tools for Soccer Analytics
Between the Posts
xG shot maps and analysis - Match-level xA summaries - European league coverage - Link: betweentheposts.net → Chapter 8: Further Reading
A network metric measuring how often a node (player) lies on the shortest path between other nodes. High betweenness centrality in a passing network indicates a player who is critical to the team's ball circulation. (Chapter 10) → Appendix E: Glossary of Soccer Analytics Terms
Big Chance
A shot opportunity where the scorer would reasonably be expected to score, typically defined as situations with xG above 0.35. Used by some data providers as a categorical metric alongside xG. (Chapter 7) → Appendix E: Glossary of Soccer Analytics Terms
Biochemical Recovery:
Creatine kinase (CK): A marker of muscle damage, typically peaks 24-48 hours post-match and returns to baseline within 72-96 hours. - Cortisol and testosterone: Indicators of the stress-recovery balance. → Chapter 26: Injury Prevention and Load Management
Books and Textbooks:
*Soccermatics* by David Sumpter --- accessible introduction to mathematical modeling in soccer - *The Expected Goals Philosophy* by James Tippett --- deep dive into xG theory and practice - *Football Hackers* by Christoph Biermann --- history and culture of data-driven football - *The Numbers Game* → Chapter 30: The Future of Soccer Analytics
Books:
*Python Crash Course* by Eric Matthes - *Automate the Boring Stuff with Python* by Al Sweigart (free online) - *Learning Python* by Mark Lutz (comprehensive reference) → Prerequisites
Bootstrap
A resampling technique that creates multiple samples by drawing with replacement from the original dataset. Used to estimate confidence intervals for metrics like xG model coefficients. (Chapter 3) → Appendix E: Glossary of Soccer Analytics Terms
Progressive carries and progressive passes per 90 - Tackles + interceptions per 90 - Goal-creating actions per 90 - Distance covered and high-intensity sprints → Chapter 15: Player Performance Metrics
Branching strategies for analytics teams:
**Feature branches**: One branch per analysis task (e.g., `feature/corner-kick-analysis`, `feature/player-recruitment-report`). Merge into `main` when complete and reviewed. - **Experimentation branches**: Use branches to test alternative modeling approaches without committing unfinished work to the → Chapter 4: Python Tools for Soccer Analytics
Brier Score
A scoring rule that measures the accuracy of probabilistic predictions by computing the mean squared difference between predicted probabilities and actual binary outcomes. Lower is better. (Chapter 3) → Appendix E: Glossary of Soccer Analytics Terms
Computer vision applied to standard broadcast television footage - No stadium hardware required—works with any televised match - Provider: SkillCorner (the leading provider in this category) - Lower accuracy than dedicated optical systems (estimated 1-2 meters) - Coverage is dramatically broader—any → Chapter 2: Data Sources and Collection in Soccer
Broadcasting revenue growth
new TV deals flood clubs with cash 2. **Competitive pressure** --- clubs spend to avoid relegation or achieve qualification 3. **Agent fees and intermediary costs** --- increasing intermediary involvement 4. **Market psychology** --- anchor effects from record-breaking transfers → Chapter 25: Economic Analysis and Player Valuation
Build custom models when:
You need consistent methodology across different data sources - You're developing predictive systems - You want to incorporate proprietary features → Chapter 7: Expected Goals (xG) Models
Build-Up Play
The phase of play where a team progresses the ball from their own defensive third toward the opponent's goal. Characterized by passing sequences, ball carries, and positional movements. (Chapter 22) → Appendix E: Glossary of Soccer Analytics Terms
C
Calibration
The property that a model's predicted probabilities match observed frequencies. A well-calibrated xG model assigning 0.20 to shots means approximately 20% of those shots result in goals. (Chapter 7) → Appendix E: Glossary of Soccer Analytics Terms
Camera Setup:
12-20+ cameras providing overlapping coverage of the entire pitch - Positioned high in stands for optimal viewing angles (typically at the top of the stands or on the stadium roof) - Calibrated to stadium dimensions using known reference points (pitch markings, stadium features) - Cameras operate at → Chapter 2: Data Sources and Collection in Soccer
Cardiac Autonomic Recovery:
Heart rate variability (HRV), particularly the natural log of the root mean square of successive differences (LnrMSSD). - Reduced HRV suggests incomplete autonomic recovery and may indicate elevated injury risk. → Chapter 26: Injury Prevention and Load Management
Carry
An on-ball event where a player moves with the ball at their feet. Progressive carries advance the ball at least 10 meters toward the opponent's goal. (Chapter 5) → Appendix E: Glossary of Soccer Analytics Terms
Catapult / STATSports:
GPS/wearable solutions for teams - Physical performance focus with detailed biomechanical metrics - Used in both training and matches - Team must own and deploy equipment (cost: approximately 200,000-500,000 USD for a full team setup) - Data is proprietary to the team—not shared with leagues or othe → Chapter 2: Data Sources and Collection in Soccer
Categories:
Direct cross into box - Short combination play - Lay-off for edge of box shot - Driven near post delivery → Chapter 14: Exercises
Central hub responsibilities:
Maintaining the group-wide data platform - Developing and maintaining core models (player valuation, recruitment screening, performance benchmarking) - Setting methodological standards - Conducting research and development - Supporting Manchester City's first-team analytics - Training and upskilling → Case Study 2: Scaling Analytics at Manchester City Football Group
Centre-back metrics:
**Aerial duel win rate:** $\frac{\text{Aerial Duels Won}}{\text{Aerial Duels Contested}}$ - **Tackles + Interceptions per 90:** Combined ball-winning volume - **Clearances per 90:** How often the defender resolves danger - **Progressive passes per 90:** Passes that move the ball at least 10 yards to → Chapter 15: Player Performance Metrics
Challenges:
Promotion/relegation creates existential stakes - Wage structures escalate with each promotion - Competing clubs can outspend on talent - Traditional scouting networks favor larger clubs → Case Study: Brentford's Moneyball Approach
Chapter 7: Expected Goals (xG) Models
foundational understanding of the xG metric used extensively in recruitment evaluation. - **Chapter 9: Bayesian Methods in Soccer Analytics** -- details on the Bayesian shrinkage techniques used for small-sample adjustment in player evaluation. - **Chapter 14: Player Valuation and Market Analysis** → Chapter 21: Further Reading
Check error messages carefully
they often identify the problem 2. **Print intermediate results** to trace execution 3. **Consult Appendix C** for common patterns 4. **Search Stack Overflow** for similar issues 5. **Review the complete solution** in `exercise-solutions.py` → How to Use This Book
Checking Assumptions:
Plot residuals vs. fitted values (should show no pattern---a funnel shape suggests heteroscedasticity) - Check residual distribution with a histogram or Q-Q plot (should be approximately normal) - Calculate VIF for each predictor (should be below 5, ideally below 3) → Chapter 3: Statistical Foundations for Soccer Analysis
ChyronHego (TRACAB):
One of the earliest optical tracking providers, with systems installed in many major European stadiums - The official tracking data provider for the Bundesliga (since 2011) and Serie A, among others - Long-established technology with extensive validation - Hardware installed in stadiums using a comb → Chapter 2: Data Sources and Collection in Soccer
Cloud data warehouses (BigQuery, Snowflake, Redshift) are increasingly used by clubs and data providers for large-scale analytics - Object storage (S3, GCS) for raw data archival - Managed databases for production applications → Chapter 2: Data Sources and Collection in Soccer
An unsupervised machine learning technique that groups similar observations together. In scouting, used to identify player archetypes. K-means and hierarchical clustering are common methods. (Chapter 21) → Appendix E: Glossary of Soccer Analytics Terms
Coach Positioning Analysis Courses
Platform: Various coaching education - Focus: Defensive positioning principles - Level: Intermediate → Chapter 12: Further Reading
Code:
Clean, documented code on GitHub demonstrating engineering practices (version control, documentation, testing) - Contributions to open-source packages used by the community - Tutorial notebooks that teach methods to others → Chapter 1: Introduction to Soccer Analytics
The ratio of the standard deviation to the mean, used to measure relative variability. In analytics, useful for assessing consistency of performance metrics. (Chapter 3) → Appendix E: Glossary of Soccer Analytics Terms
combination play
quick passing sequences involving three or more players---is a strong indicator of team chemistry. Key metrics include: → Chapter 16: Team Performance Analysis
Presenting findings clearly and compellingly - Adapting message to audience - Visualizing data effectively - Choosing the right medium: a slide deck for a presentation, a dashboard for ongoing monitoring, a written report for detailed documentation → Chapter 1: Introduction to Soccer Analytics
Communities:
Soccer Analytics Handbook (online resource) - Analytics FC community - Various Twitter/X analytics communities organized by region and topic - Reddit r/socceranalytics and related subreddits → Chapter 30: The Future of Soccer Analytics
Compactness
A measure of how tightly grouped a team's players are on the pitch, typically computed as the area of the convex hull of player positions. Low compactness indicates a compact defensive shape. (Chapter 17) → Appendix E: Glossary of Soccer Analytics Terms
Competitions Included (selection):
FIFA World Cup (Men's and Women's, multiple editions) - UEFA Euro (multiple editions) - FA Women's Super League (multiple seasons) - La Liga (select seasons) - UEFA Champions League (select seasons) - National Women's Soccer League (NWSL) - Indian Super League - Various international tournaments → Appendix D: Data Sources and Tools
Complete all exercises
learning requires practice 3. **Run every code example** and experiment with modifications 4. **Attempt case studies** before reading the solutions 5. **Take each quiz** and aim for 70%+ before proceeding → Professional Soccer Analytics and Visualization
Computer Vision Source:
Optical tracking systems derive positions from video feeds - Emerging AI systems automate event detection from video (companies like Metrica Sports and others are developing these capabilities) - Pose estimation technology extracts body positioning and movement patterns - Action recognition systems → Chapter 2: Data Sources and Collection in Soccer
League quality gap (e.g., Eredivisie to Premier League is a bigger jump than La Liga to Premier League) - Playing style compatibility (does the player's style match the new team's system?) - Competitive context (is the player joining a title challenger or a mid-table team?) - Language and cultural p → Chapter 20: Predictive Modeling
Context Validation:
Verify questionable data points by watching the original video - Understand context behind unusual statistics (why did a player have zero passes in a ten-minute period? The video might reveal they were receiving treatment for an injury) - Communicate findings to non-technical stakeholders who unders → Chapter 2: Data Sources and Collection in Soccer
Contextual Data:
Weather conditions (temperature, precipitation, wind speed, humidity) - Injuries and suspensions (who was available for selection) - Transfer history and valuations - Managerial changes (new manager appointments often cause short-term performance fluctuations) - Historical results - Stadium dimensio → Chapter 2: Data Sources and Collection in Soccer
Contextualization:
How does this compare to benchmarks? Is a player's 0.35 xG per 90 minutes good or bad? It depends on the player's position, league, role, and the context in which they play. - What factors might explain the patterns? If a team's pressing intensity has declined, is it fatigue, tactical adjustment, or → Chapter 1: Introduction to Soccer Analytics
Related-party sponsorship deals at above-market rates - Player swap deals at inflated valuations - Creative accounting around amortization periods - Loan armies that circumvent squad cost calculations → Chapter 25: Economic Analysis and Player Valuation
confounding, reverse causation, and coincidence are common - R² indicates proportion of variance explained, not prediction accuracy - Multiple regression allows controlling for confounders but cannot prove causation → Key Takeaways: Statistical Foundations for Soccer Analysis
Cosine Similarity
A measure of similarity between two vectors, computed as the cosine of the angle between them. Used in scouting to find players with similar statistical profiles. (Chapter 21) → Appendix E: Glossary of Soccer Analytics Terms
Counter-Attack
A rapid attacking transition following a turnover, designed to exploit the opponent's disorganized defensive shape. Characterized by directness and speed. (Chapter 22) → Appendix E: Glossary of Soccer Analytics Terms
Coursera - Statistics for Sports Analysis
Focus: Statistical methods for sports data - Relevance: Foundational statistics for defensive metrics → Chapter 12: Further Reading
Coursera: Applied Data Science with Python
University of Michigan specialization - Pandas, visualization, machine learning - Applicable skills for soccer analysis → Chapter 8: Further Reading
Coverage
does the provider cover the leagues and matches you need? (2) **Accuracy** --- what positional accuracy is required for your analysis? Sprint detection requires higher accuracy than formation analysis. (3) **Latency** --- do you need real-time data during matches, or is post-match delivery sufficien → Chapter 18: Tracking Data Analytics
Coverage:
Top 5 European leagues (England, Spain, Germany, Italy, France) - Major secondary leagues (Netherlands, Portugal, Belgium, etc.) - UEFA Champions League and Europa League - Major international tournaments - Historical data going back several decades (with varying detail) → Appendix D: Data Sources and Tools
critical connectors
if removed, passing routes between other players become longer or impossible. In soccer terms, betweenness centrality often identifies the "metronome" of the team -- the player through whom the ball must flow for the team to transition between phases of play. Classic examples include Sergio Busquets → Chapter 10: Passing Networks and Analysis
Cross-Validation
A model evaluation technique that partitions data into complementary subsets for training and testing. Stratified k-fold cross-validation is standard for imbalanced classification problems like xG. (Chapter 19) → Appendix E: Glossary of Soccer Analytics Terms
Cutback
A pass or cross delivered backward from the byline toward the edge of the penalty area, creating high-quality shooting opportunities. Cutback assists often lead to high-xG shots. (Chapter 22) → Appendix E: Glossary of Soccer Analytics Terms
D
Dangerous Attack
A possession sequence that enters the final third and results in a shot or penalty area entry. The ratio of attacks to dangerous attacks measures a team's attacking quality. (Chapter 22) → Appendix E: Glossary of Soccer Analytics Terms
The practice of collecting, cleaning, transforming, and organizing data for analysis. Typically consumes 50-70% of effort in analytics projects. (Chapter 2) → Appendix E: Glossary of Soccer Analytics Terms
Data Engineers:
Build and maintain data infrastructure (databases, pipelines, APIs) - Ensure data quality and accessibility - Integrate multiple data sources into coherent systems - Manage cloud infrastructure and computing resources - Skills: SQL, Python, cloud platforms (AWS, GCP, Azure), databases, ETL tools → Chapter 1: Introduction to Soccer Analytics
Data Ingestion Layer:
Automated feeds from multiple event data providers (adapted per league) - Tracking data integration where available - GPS and physical performance data from each club's systems - Video feeds and tagging data - Internal scouting reports and evaluations - Financial and contract data → Case Study 2: Scaling Analytics at Manchester City Football Group
Data Output:
Raw positional data at 25 fps with positional accuracy of approximately 10-30 centimeters - Derived velocities and accelerations (calculated from position changes between frames) - Synchronized with match clock and typically with event data feeds - Delivered as structured data files (CSV, JSON, or p → Chapter 2: Data Sources and Collection in Soccer
Data Processing:
Position smoothing and correction (raw GPS data is noisy and requires filtering) - Physical metric calculation (distances, speeds, accelerations derived from raw sensor data) - Team-level synchronization (ensuring all devices are reporting on the same time base) - Integration with video systems for → Chapter 2: Data Sources and Collection in Soccer
Data Products:
**Opta F24:** The core event data feed, providing detailed event records for each match. The F24 format includes events with coordinates, qualifiers, timestamps, and player/team identifiers. - **Opta F1/F9:** Match and season-level aggregated statistics - **Opta F40:** Advanced possession and passin → Chapter 2: Data Sources and Collection in Soccer
Data Scientists:
Build statistical models and machine learning systems - Develop new metrics and frameworks - Conduct research into new methodologies - Evaluate and improve existing models - Skills: Statistics, ML, Python/R, research methods, mathematical modeling → Chapter 1: Introduction to Soccer Analytics
**Competitions:** Metadata about available competitions and seasons - **Matches:** Match-level information including teams, scores, managers, stadium, and referee - **Events:** Detailed event records with all qualifiers, coordinates, and metadata - **Lineups:** Player information and tactical positi → Chapter 2: Data Sources and Collection in Soccer
Data Warehouse:
Centralized storage with club-specific access controls - Standardized data models that enable cross-club comparison - Historical data spanning multiple seasons and leagues - Player tracking across the group (monitoring loaned and former players) → Case Study 2: Scaling Analytics at Manchester City Football Group
DataCamp - Sports Analytics Track
Focus: General sports analytics with soccer applications - Level: Beginner to Intermediate → Chapter 12: Further Reading
DataCamp Soccer Analytics Courses
"Introduction to Soccer Analytics" - Practical Python implementations - https://www.datacamp.com/ → Chapter 7: Further Reading
Interactive Python learning - Statistics and visualization - Machine learning fundamentals → Chapter 8: Further Reading
David Sumpter, "Is Maths Killing Football?"
TEDx talk. An accessible overview of how mathematical models are used in football, including match prediction and player evaluation. → Chapter 20: Further Reading
Day 1 (Match Day -4): Initial Data Pull
Aggregate event and tracking data from the opponent's last 5-10 matches - Generate automated tactical fingerprint - Run formation detection algorithms - Produce initial statistical summary → Chapter 22: Match Strategy and Tactics
Day 2 (Match Day -3): Deep Analysis
Detailed build-up play analysis with passing network graphs - Defensive vulnerability mapping (spatial, temporal, personnel) - Counter-attack and transition profiling - Set-piece cataloging and classification → Chapter 22: Match Strategy and Tactics
Day 4 (Match Day -1): Presentation and Refinement
Present to coaching staff - Incorporate coach feedback and adjust recommendations - Prepare player-facing materials (simplified messaging, key video clips) - Finalize set-piece plans → Chapter 22: Match Strategy and Tactics
An umbrella term for tackles, interceptions, clearances, blocks, and pressures performed to regain possession or prevent the opponent from progressing. (Chapter 12) → Appendix E: Glossary of Soccer Analytics Terms
Team ranking by corner effectiveness - Scatter plot: corners taken vs. goals from corners - Statistical analysis of corner efficiency → Chapter 14: Exercises
We process at 5 fps rather than the full 25 fps to reduce computational load. For basic tracking, 5 fps is sufficient; faster motion requires higher rates. - Resizing to 960 pixels wide reduces processing time by approximately 75% compared to full HD, with acceptable accuracy loss for detection. → Case Study 2: Building a Simple Player Tracker from Broadcast Video
Device Components:
GPS receiver: determines position at approximately 10-18 Hz (depending on the device) - Accelerometer: measures acceleration forces at 100-1000 Hz - Gyroscope: measures rotational movement - Magnetometer: provides heading/orientation - Heart rate monitor (in some devices) - Wireless transmission cap → Chapter 2: Data Sources and Collection in Soccer
Different methods have different efficiency
through balls generate higher xG but are harder to execute 3. **Set pieces matter significantly** - contributing roughly a quarter of all xA 4. **Success comes through multiple routes** - the finalists used different tactical approaches 5. **Context matters for scouting** - a player's xA profile sho → Case Study 2: Analyzing Team Chance Creation Patterns
Difficulty Levels:
⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Project (40+ min each) → Exercises: Python Tools for Soccer Analytics
Digital Twin
A continuously updated computational model of a player that integrates physical, tactical, technical, and psychological data to predict performance under different conditions. (Chapter 30) → Appendix E: Glossary of Soccer Analytics Terms
directed
a pass from Player A to Player B is distinct from a pass from Player B to Player A. However, for some analyses (particularly visualization), we may aggregate into undirected networks. → Chapter 10: Passing Networks and Analysis
**Spain's** approach under De Gea emphasized short passing, aligning with their possession philosophy despite early elimination. - **Germany's** Manuel Neuer maintained his sweeper-keeper style but Germany's poor tournament masked his distribution quality. - **England and France** both used direct d → Case Study 1: Goalkeeper Performance at the 2018 World Cup
All code should be version-controlled and documented - Analytical findings should be stored in a searchable repository - Methodological decisions should be recorded with rationale - Post-project retrospectives should be conducted and archived → Chapter 28: Building an Analytics Department
measures how much space that player "owns" at a given instant. Large cells for defenders may indicate a stretched back line; large cells for attackers may suggest isolation. → Chapter 17: Spatial Analysis and Pitch Control
Processing data near the point of collection (e.g., at the stadium) rather than in a remote cloud, enabling real-time analytics with low latency. (Chapter 30) → Appendix E: Glossary of Soccer Analytics Terms
Eigenvector Centrality
A network metric that measures a node's influence based on the influence of its neighbors. In passing networks, identifies players connected to other highly connected players. (Chapter 10) → Appendix E: Glossary of Soccer Analytics Terms
Respect the website's terms of service. Some sites explicitly prohibit scraping; others tolerate it within limits. - Do not overload servers with excessive requests. Insert delays between requests and scrape during off-peak hours. - Attribute the data source in any publication or presentation. - Do → Chapter 2: Data Sources and Collection in Soccer
Event Data
A structured record of every on-ball action in a match, including passes, shots, tackles, and carries, with spatial coordinates and outcome labels. Provided by companies such as StatsBomb, Opta, and Wyscout. (Chapter 2) → Appendix E: Glossary of Soccer Analytics Terms
Example:
2.5 tackles per 90, team possession 65% - PADA = 2.5 / 0.35 = **9.14 tackles per opponent possession** → Chapter 12: Key Takeaways
Examples:
"Player X has scored in 8 of their last 11 away games against top-6 opposition on Saturdays" - Only showing seasons where a pattern holds - Trying many different model specifications and reporting only the one that "works" → Chapter 3: Statistical Foundations for Soccer Analysis
Exercise 17.1
Explain in your own words why possession percentage is a poor proxy for territorial control. Give a concrete tactical example involving two teams with identical possession but different spatial profiles. → Chapter 17 Exercises
Exercise 17.10
Implement a **weighted Voronoi diagram** where the distance from a point $x$ to player $i$ is scaled by the inverse of the player's current speed: $d_i(x) = \|x - p_i\| / (1 + \alpha \|v_i\|)$. Plot the result for $\alpha = 0.3$ and compare with the unweighted version. → Chapter 17 Exercises
Exercise 17.11
Derive the influence function $I_i(x)$ for a player at position $(30, 40)$ with velocity $(3, 1)$ m/s. Use $\Delta t = 0.7$ s, $\sigma_{\parallel} = 10$ m, and $\sigma_{\perp} = 5$ m. Compute $I_i(x)$ at the point $(38, 43)$. → Chapter 17 Exercises
Exercise 17.12
Explain the role of the look-ahead time $\Delta t$ in the Fernandez--Bornn model. What happens if $\Delta t$ is set to (a) 0 and (b) a very large value like 5 s? → Chapter 17 Exercises
Exercise 17.13
Implement the Fernandez--Bornn pitch control model for a simplified scenario with 4 players (2 per team). Evaluate the pitch control surface on a $53 \times 34$ grid and produce a heat-map visualisation. → Chapter 17 Exercises
Exercise 17.14
Compare the pitch control surfaces generated by the Fernandez--Bornn (Gaussian influence) model and the Spearman (time-to-intercept) model for the same 4-player scenario from Exercise 17.13. Discuss the visual and quantitative differences. → Chapter 17 Exercises
Exercise 17.15
A striker is running at 10.5 m/s toward goal. A centre-back 12 m away is stationary. Using the time-to-intercept formula from Section 17.3.3, compute which player arrives first at a point 6 m ahead of the centre-back, directly between the two players. Assume $a_{\max} = 4.0$ m/s$^2$ for both players → Chapter 17 Exercises
Exercise 17.16
The computational cost of evaluating pitch control on an $M \times N$ grid for $P$ players is $O(MNP)$ per frame. For a 90-minute match at 25 Hz with 22 players and a $105 \times 68$ grid, calculate the total number of influence evaluations. Propose two strategies for reducing this cost without sign → Chapter 17 Exercises
Exercise 17.17
Implement a pitch control model that accepts a ball position and adjusts each player's influence based on distance from the ball. Players closer to the ball should have slightly expanded influence (reflecting urgency to contest) while distant players should have standard influence. Define a reasonab → Chapter 17 Exercises
Exercise 17.18
Define "space creation" in your own words. Explain why a player can create space for a team-mate without ever touching the ball. → Chapter 17 Exercises
Exercise 17.19
A centre-forward drops from position $(85, 34)$ to $(70, 34)$ over 3 seconds, pulling a centre-back with them. The centre-back moves from $(82, 34)$ to $(72, 34)$. Meanwhile, a winger's Voronoi cell area increases from 120 m$^2$ to 195 m$^2$. Compute the space created by the centre-forward for the w → Chapter 17 Exercises
Exercise 17.2
A Voronoi diagram partitions the pitch into dominant regions based on Euclidean distance. List three factors present in real soccer that violate the Euclidean-distance assumption, and for each, explain whether it would cause the Voronoi model to *overestimate* or *underestimate* a player's true domi → Chapter 17 Exercises
Exercise 17.20
Implement the counterfactual space-creation method described in Section 17.4.1. Given tracking data for 11 attacking players across 50 frames, freeze one designated player at their initial position and recompute Voronoi areas for the remaining team-mates. Report $\Delta A$ for each team-mate. → Chapter 17 Exercises
Exercise 17.21
Design a metric called **Space Exploitation Efficiency** (SEE) that rewards players who receive the ball in space that was recently created by a team-mate. Define the metric formally, state your assumptions, and suggest a reasonable time window. → Chapter 17 Exercises
Exercise 17.22
Using synthetic data, simulate a team with five attackers and five defenders. One attacker makes a diagonal run into the channel. Compute and visualise (a) the Voronoi diagram before and after the run, and (b) the $\Delta A$ for each team-mate. → Chapter 17 Exercises
Exercise 17.23
Classify the following movements using the taxonomy from Section 17.5.2: (a) A striker runs from the centre circle toward the corner flag. (b) A central midfielder drops 15 m toward their own goal to receive. (c) A left-winger cuts inside while the left-back overlaps. (d) A striker makes a run in be → Chapter 17 Exercises
Exercise 17.24
Implement the `detect_penetrating_runs` function from Section 17.5.3. Generate synthetic tracking data for a single attacker making three runs in behind over a 60-second period. Verify that your implementation detects all three runs. → Chapter 17 Exercises
Exercise 17.25
Extend the run-detection algorithm to also classify **lateral runs** (movement perpendicular to the goal direction). Define an appropriate velocity threshold and a minimum duration. Test on synthetic data. → Chapter 17 Exercises
Exercise 17.26
Compute the **run quality score** $Q_{\text{run}}$ for the following run: - Depth gained: 18 m - Space created for team-mates: 85 m$^2$ - Defenders engaged: 2 - Pass received: No - $\Delta xT$: 0.04 → Chapter 17 Exercises
Exercise 17.27
Compute the Dangerous Space Matrix (DSM) for a simplified scenario. Place 4 defenders at known positions and define an xT grid (you may use a simple distance-based approximation). Produce a heat-map of the DSM and identify the most dangerous undefended zone. → Chapter 17 Exercises
Exercise 17.28
A team's spatial entropy in the final third is $H = 1.8$ nats (using $K = 6$ zones). Compute the maximum possible entropy for $K = 6$ zones. What fraction of maximum entropy does this team achieve? Interpret the result. → Chapter 17 Exercises
Exercise 17.29
Design a **pressing trigger detector** using pitch control. Define a pressing trigger as a moment when the opponent's pitch control in their own defensive third drops below a threshold $\tau$. Implement this detector on synthetic tracking data and evaluate how the threshold $\tau$ affects the number → Chapter 17 Exercises
Exercise 17.3
The Delaunay triangulation is the dual of the Voronoi diagram. Explain what a Delaunay edge between two players represents in tactical terms. Why might an analyst use Delaunay edges rather than the full set of pairwise connections? → Chapter 17 Exercises
Exercise 17.30
A club's analytics department has computed that their team creates an average of 35 m$^2$ of dangerous space in the final third per possession, while the league average is 28 m$^2$. Their dangerous-space exploitation rate is 12 %, while the league average is 18 %. Write a briefing memo (200--300 wor → Chapter 17 Exercises
Exercise 17.31
Implement the **spatial value added** (SVA) metric described in Section 17.7.5. For a synthetic pass from $(50, 34)$ to $(75, 45)$, compute the SVA by differencing the pitch-control- weighted xT before and after the pass. Visualise both pitch control surfaces and annotate the SVA value. → Chapter 17 Exercises
Exercise 17.32
Using the concepts from this chapter, design a complete analytical pipeline for evaluating a team's performance in a single match. Your pipeline should include: (1) Voronoi-based compactness over time, (2) pitch control surfaces at key moments, (3) space creation in the final third, (4) dangerous sp → Chapter 17 Exercises
Exercise 17.33
Compare and contrast the Voronoi-based and pitch- control-based approaches to spatial analysis. Under what circumstances would you recommend each? → Chapter 17 Exercises
Exercise 17.34
A recruitment analyst wants to identify centre-forwards who create significant space for team-mates despite modest goal-scoring records. Design a screening methodology using the spatial metrics from this chapter. Which metrics would you prioritise, and what thresholds would you set? → Chapter 17 Exercises
Exercise 17.35
Discuss two ethical or practical limitations of using tracking-data-based spatial models in professional soccer. For each, propose a mitigation strategy. → Chapter 17 Exercises
Exercise 17.4
A team's average Voronoi cell area for its four defenders is 320 m$^2$, while the opponent's four defenders average 180 m$^2$. Interpret this difference in the context of pressing intensity and defensive compactness. → Chapter 17 Exercises
Exercise 17.5
Calculate the total number of data points produced by a 90-minute match with 25 Hz tracking data for 22 players (each with $x$ and $y$ coordinates) plus the ball (with $x$, $y$, and $z$ coordinates). Show your working. → Chapter 17 Exercises
Exercise 17.6
Using SciPy, compute the Voronoi diagram for the following set of 6 player positions (in metres): → Chapter 17 Exercises
Exercise 17.7
For the six players in Exercise 17.6, compute the area of each player's Voronoi cell (after clipping). Which player controls the most space? Which controls the least? → Chapter 17 Exercises
Exercise 17.8
Add two "ghost" players at $(0, 34)$ and $(105, 34)$ to represent the goalkeepers. How does this change the Voronoi diagram and the dominant-region areas of players near the goal lines? → Chapter 17 Exercises
Exercise 17.9
Write a function that takes a set of player positions and returns the **team compactness index**, defined as the standard deviation of the Voronoi cell areas within the team. Test it with a compact formation (e.g., 4-4-2 in a low block) and a stretched formation (e.g., 3-5-2 in transition). → Chapter 17 Exercises
Expected Assists (xA)
The probability that a given pass will result in a goal, aggregated over all passes to produce a player's or team's expected assist total. (Chapter 8) → Appendix E: Glossary of Soccer Analytics Terms
Expected Calibration Error (ECE)
The weighted average absolute difference between predicted probabilities and observed frequencies across probability bins. Measures calibration quality. (Chapter 7) → Appendix E: Glossary of Soccer Analytics Terms
Expected Goals (xG)
The probability that a shot will result in a goal, based on factors including distance, angle, body part, and play pattern. The foundational metric of modern soccer analytics. (Chapter 7) → Appendix E: Glossary of Soccer Analytics Terms
Which zones have highest conversion? - What is the miss rate by zone? - Where do goalkeepers save most often? → Chapter 14: Exercises
Expected Metrics:
Attacking team first contact win rate - Average location of first contact (x, y coordinates) - Distribution of contact points (heatmap data) → Chapter 14: Exercises
Expected Output:
Number of corners per team - Shots generated from corners - Conversion rate → Chapter 14: Exercises
Expected Points (xPts)
The expected number of league points from a match, derived from xG and xGA using a Poisson model to simulate scoreline probabilities. (Chapter 20) → Appendix E: Glossary of Soccer Analytics Terms
Expected Threat (xT)
A framework that assigns a goal probability to every location on the pitch based on historical data, and values actions by the change in threat they produce. (Chapter 9) → Appendix E: Glossary of Soccer Analytics Terms
What patterns exist in the data? - What surprises emerge? - What hypotheses are suggested? - Are there data quality issues that need to be addressed before proceeding? → Chapter 1: Introduction to Soccer Analytics
Extensions for soccer:
**Goal difference adjustment:** Scale the $K$-factor by the goal difference to reward dominant victories more than narrow ones. A common formula: $K_{\text{adj}} = K \cdot \log(1 + |GD|)$. - **Home advantage:** Add a fixed offset (typically 60--100 Elo points) to the home team's rating before comput → Chapter 20: Predictive Modeling
Extensions:
Analyze pass *destination* heat maps to identify the most targeted zones. - Weight passes by progressive distance to separate meaningful progression from possession recycling. - Track the evolution of Barcelona's spatial profile across seasons to identify tactical shifts under different managers. → Case Study 6.1: Mapping Barcelona's Tiki-Taka Through Spatial Analysis
A tactical role where the center forward drops deep into midfield to create space and receive the ball, disrupting the opponent's defensive structure. Analyzing false nines requires tracking data to capture positional flexibility. (Chapter 22) → Appendix E: Glossary of Soccer Analytics Terms
Fast.ai Practical Deep Learning
Modern machine learning approaches - Tabular data applications - Free online course → Chapter 8: Further Reading
FBref
Free aggregated statistics for major leagues. Good for learning to work with player/team statistics. Requires web scraping but is well-structured. → Quiz: Data Sources and Collection in Soccer
FBref Expected Stats Guide
Comprehensive metric explanations - xA and SCA definitions - League-wide data access - Link: fbref.com/en/expected-goals-model-explained → Chapter 8: Further Reading
https://fcpython.com/ - Comprehensive soccer analytics tutorials - xG model building walkthrough - Visualization techniques → Chapter 7: Further Reading
FC Python Tutorials
Soccer analytics in Python - Data collection and analysis - Beginner-friendly approach - Link: fcpython.com → Chapter 8: Further Reading
Feature Engineering
The process of creating informative input variables for machine learning models from raw data. In xG models, includes computing distance, angle, and zone indicators. (Chapter 19) → Appendix E: Glossary of Soccer Analytics Terms
Feature engineering for match prediction:
Rolling averages of xG, xGA over the last $n$ matches (typically $n = 5$ or $n = 10$). - Elo or Pi-rating differentials. - Home advantage adjustment. - Days since last match (fatigue proxy). - Head-to-head historical record. → Chapter 19: Machine Learning for Soccer
Feature Importance
A measure of how much each input variable contributes to a model's predictions. Computed via coefficient magnitude (logistic regression) or permutation importance (tree-based models). (Chapter 19) → Appendix E: Glossary of Soccer Analytics Terms
Pitch dimensions: actual metres (105 x 68 for standard pitches) - Origin: centre of the pitch (0, 0) - $x$-axis: -52.5 to +52.5 - $y$-axis: -34.0 to +34.0 - Includes a $z$-axis for ball height → Chapter 6: The Soccer Pitch as a Coordinate System
Financial outcomes:
Commercial revenue growth: PSG's commercial revenue increased significantly post-Neymar - Shirt sales: Record-breaking in the initial period - Global brand awareness: Measurable increases in social media engagement and global fan base → Case Study 1: The Neymar Effect — How One Transfer Reshaped the Market
FiveThirtyEight Soccer SPI Model
fivethirtyeight.com/methodology/how-our-club-soccer-predictions-work Documentation of FiveThirtyEight's public Soccer Power Index model, which combines Poisson regression with Elo-style ratings. A useful benchmark for custom match prediction models. → Chapter 20: Further Reading
Flat Files (Parquet, CSV, JSON):
Best for individual analysis projects and data sharing - No infrastructure overhead - Easy to version control and share - Limited querying capability compared to databases - Fine for most individual and small-team analytics work → Chapter 2: Data Sources and Collection in Soccer
Historical match results and betting odds for dozens of leagues - Covers many leagues back to the 1990s - CSV downloads available with consistent formatting - Particularly useful for match prediction and betting market analysis - Updated regularly during the season → Chapter 2: Data Sources and Collection in Soccer
footballmodelling.net
A community resource for football prediction models, including implementations of Dixon-Coles and related methods. → Chapter 20: Further Reading
For borrowing clubs:
Access talent they cannot afford to purchase - Fill short-term squad gaps without long-term commitment - "Try before you buy" with optional purchase clauses → Chapter 25: Economic Analysis and Player Valuation
The spatial arrangement of players on the pitch, described in shorthand (e.g., 4-3-3, 3-5-2). Modern analysis recognizes that formations are fluid and phase-dependent. (Chapter 22) → Appendix E: Glossary of Soccer Analytics Terms
Academic tutorials - Code implementations - Research presentations - Link: youtube.com/friendsoftracking → Chapter 8: Further Reading
Friends of Tracking - Defensive Metrics Tutorial
Platform: YouTube - Focus: Python implementation of defensive metrics - Level: Intermediate → Chapter 12: Further Reading
Friends of Tracking - Set Piece Analysis Tutorial
Platform: YouTube - Focus: Python implementation of set piece analysis - Level: Intermediate → Chapter 14: Further Reading
Friends of Tracking Data Science
github.com/Friends-of-Tracking-Data-Science Open-source implementations of various soccer analytics models, including match prediction and player evaluation tools. → Chapter 20: Further Reading
Friends of Tracking GitHub
github.com/Friends-of-Tracking-Data-Science Open-source implementations of pitch control models, Voronoi diagrams, and other spatial analytics tools. Includes Jupyter notebooks with step-by-step explanations. → Chapter 17: Further Reading
Friends of Tracking Lecture Series
YouTube playlist. A multi-part lecture series covering tracking data analysis, including Voronoi diagrams, pitch control, and expected possession value. Presented by Laurie Shaw, David Sumpter, and others. → Chapter 17: Further Reading
Friends of Tracking Tutorials
https://github.com/Friends-of-Tracking-Data-FoTD - Code from video tutorials - xG model implementations → Chapter 7: Further Reading
Friends of Tracking YouTube Channel
Video tutorials on soccer analytics - xG and pass analysis implementations - Python code examples - Link: youtube.com/friendsoftracking → Chapter 8: Further Reading
Friends of Tracking: Computer Vision Series
YouTube playlist. Practical tutorials on applying CV techniques to soccer, including player detection, tracking, and pitch calibration. → Chapter 23: Further Reading
From the club's perspective:
Longer contracts protect the transfer value asset - Longer contracts lock in current wages (beneficial if the player improves) - But longer contracts also lock in wages for underperforming players - Amortization is spread over more years, reducing annual FFP impact → Chapter 25: Economic Analysis and Player Valuation
From the player's perspective:
Longer contracts provide income security - But limit future earning potential if performance improves - Shorter contracts allow more frequent renegotiation → Chapter 25: Economic Analysis and Player Valuation
Full-back / wing-back metrics add:
**Crosses per 90 and cross accuracy** - **Carries into the final third per 90** - **Assists and xA per 90** (especially for attacking full-backs) → Chapter 15: Player Performance Metrics
G
Game State
The current score differential during a match, which influences team behavior (e.g., teams trailing take more risks). An important contextual variable in analytics models. (Chapter 7) → Appendix E: Glossary of Soccer Analytics Terms
Game Theory for Sports Analytics
Platform: Various - Focus: Applying game theory to penalty analysis - Level: Intermediate → Chapter 14: Further Reading
Generative Model
A model that learns the underlying distribution of data and can generate new synthetic examples. Applied in soccer for tactical simulation and data augmentation. (Chapter 30) → Appendix E: Glossary of Soccer Analytics Terms
Non-penalty expected goals (npxG) per 90 - Shot volume and shot quality - npxG per shot (shot selection quality) - Goals minus xG (finishing skill, though high variance) → Chapter 21: Player Recruitment and Scouting
Goal-Creating Action (GCA)
The two offensive actions (such as passes, dribbles, or shots) directly leading to a goal. A broader measure of goal involvement than assists alone. (Chapter 15) → Appendix E: Glossary of Soccer Analytics Terms
Use clear, descriptive cell headers with Markdown - Keep cells focused on single tasks - Move reusable code to `.py` modules as soon as it stabilizes - Restart kernel and run all before sharing - Clear output before committing to version control → Chapter 4: Python Tools for Soccer Analytics
A wearable device embedded with Global Positioning System receivers, accelerometers, and gyroscopes, worn by players during training and matches to collect positional and movement data. (Chapter 18) → Appendix E: Glossary of Soccer Analytics Terms
GPS/Wearable Tracking:
Players wear vests with GPS and accelerometer sensors - Provides position plus physical metrics (acceleration, impacts, heart rate) - Providers: Catapult, STATSports, Playertek - Lower positional accuracy than optical systems (approximately 1-5 meters for GPS) - Richer physical data including accele → Chapter 2: Data Sources and Collection in Soccer
Gradient Boosting
An ensemble machine learning method that builds sequential decision trees, each correcting the errors of its predecessors. A common choice for xG models due to its strong predictive performance. (Chapter 19) → Appendix E: Glossary of Soccer Analytics Terms
Graph Neural Network (GNN)
A neural network architecture designed to operate on graph-structured data. Well-suited to modeling player interactions and passing networks. (Chapter 24) → Appendix E: Glossary of Soccer Analytics Terms
H
Half-Space
The tactical zones between the central channel and the flanks, approximately between the penalty area width and the center of the pitch. Controlling the half-spaces is a key principle of positional play. (Chapter 22) → Appendix E: Glossary of Soccer Analytics Terms
half-spaces
the narrow corridors between the centre of the pitch and the wide areas---have received increasing attention in modern tactical analysis. These zones are particularly dangerous because they force defenders into difficult decisions: if a centre-back steps out to engage a player in the half-space, the → Chapter 17: Spatial Analysis and Pitch Control
Academic papers (peer-reviewed) - StatsBomb methodology documentation - Opta official documentation → Chapter 8: Further Reading
High tackles can indicate:
Aggressive defensive style (positive) - Compensation for poor positioning (negative) - High involvement due to team style (contextual) → Chapter 12: Key Takeaways
Similarity scores to historical transfers that succeeded/failed - Percentile rankings relative to players who previously made similar moves → Chapter 20: Predictive Modeling
Holding / defensive midfielder:
Tackles and interceptions per 90 - Pass completion % (short and medium) - Pressure success rate - Ball recoveries in the defensive and middle thirds → Chapter 15: Player Performance Metrics
How to Access:
Web interface at fbref.com with comprehensive tables and filters - No official API, but the site is structured to facilitate data extraction - Download tables directly from pages using the "Share & Export" option - Web scraping using Python libraries (tolerated within reasonable limits) → Chapter 2: Data Sources and Collection in Soccer
Hudl Sportscode
hudl.com/products/sportscode Industry-standard video analysis platform used for tactical tagging and match preparation. → Chapter 22: Further Reading
Does conversion rate decline as shootout progresses? - Do teams shooting first have an advantage? - How does match context affect conversion? → Chapter 14: Exercises
Hypotheses:
H₀: The team's expected points per match at home equals their expected points per match away ($\mu_H = \mu_A$) - H₁: The team's expected points per match at home exceeds their expected points per match away ($\mu_H > \mu_A$) → Chapter 3: Statistical Foundations for Soccer Analysis
I
Identifying data needs:
What variables are required? - What time period is relevant? - What competitions or teams should be included? - What level of granularity is necessary—match-level aggregates, event-level detail, or tracking data? → Chapter 1: Introduction to Soccer Analytics
Image Processing:
Real-time video processing at 25-50 fps using GPU-accelerated computing - Player detection using machine learning models trained on millions of labeled examples - Jersey number recognition for identification (challenging when jerseys are obscured by sweat, rain, or physical contact) - Ball tracking → Chapter 2: Data Sources and Collection in Soccer
Implications for soccer:
Larger samples lead to more precise estimates - Even non-normal data (like goals, which follow a Poisson distribution) produces approximately normal means - The rate of precision improvement decreases: going from 10 to 20 matches halves the variance, but going from 100 to 110 barely changes it → Chapter 3: Statistical Foundations for Soccer Analysis
Define soccer analytics and explain why clubs invest millions in it - Trace the evolution from simple statistics to sophisticated machine learning - Understand who uses analytics and what they need from it - Follow the journey from raw data to actionable insights - Explore career opportunities in th → Chapter 1: Introduction to Soccer Analytics
Experience in betting industry or related quantitative field (finance, consulting, technology) - Transferable skills in data science or engineering - Domain expertise from playing or coaching at any level - Many successful analysts entered from non-sports backgrounds, bringing analytical methods fro → Chapter 1: Introduction to Soccer Analytics
Initial Focus:
Building data infrastructure from scratch - Developing proprietary metrics and models - Creating player valuation frameworks - Establishing processes for integrating analytics into decisions → Case Study: The Liverpool Analytics Revolution
Injury impact:
Neymar missed approximately 40% of possible matches through injury during his PSG tenure - This significantly reduced the sporting ROI and raises questions about whether injury risk was adequately priced into the transfer fee → Case Study 1: The Neymar Effect — How One Transfer Reshaped the Market
When analysts leave, their knowledge should not leave with them - Documentation of all active models, dashboards, and data pipelines - Succession planning for critical functions - Onboarding materials for new hires → Chapter 28: Building an Analytics Department
Moving from another role within a club (e.g., scout, video analyst, sports scientist) - Adding analytical skills to existing football expertise - This route offers the advantage of existing relationships and football credibility → Chapter 1: Introduction to Soccer Analytics
Interpretation cautions:
**Sample size matters enormously.** Even 50 shots provides insufficient data to reliably identify finishing skill. Most "elite finishers" regress toward the mean over time. - **Shot selection conflates with finishing.** A player who only shoots from high-xG positions will appear to be a good finishe → Chapter 7: Expected Goals (xG) Models
**R-squared:** Proportion of variance explained (e.g., 0.92 = 92%). An R-squared of 0.92 means 92% of the variation in actual goals is explained by xG, with 8% attributable to finishing skill, luck, and other factors. - **Coefficients:** - Intercept (const): Expected goals when xG = 0. This is usual → Chapter 3: Statistical Foundations for Soccer Analysis
Isotonic Regression
A non-parametric calibration method that fits a non-decreasing function to transform model outputs into well-calibrated probabilities. Often used for post-hoc xG calibration. (Chapter 19) → Appendix E: Glossary of Soccer Analytics Terms
J
Jan Vecer, "Mathematical Analysis of Soccer"
Columbia University course. Covers Poisson regression, Elo ratings, and betting market efficiency. → Chapter 20: Further Reading
Javier Fernandez, "Wide Open Spaces"
MIT SSAC 2018. The original conference presentation of the Gaussian influence model. Available on YouTube via the MIT Sloan Sports Analytics Conference channel. → Chapter 17: Further Reading
Various soccer datasets uploaded by community members - Quality and documentation vary significantly - Good for specific projects or competitions - Notable datasets include historical match results, player attributes from FIFA video games, and European match data → Chapter 2: Data Sources and Collection in Soccer
Kaplan-Meier Estimator
A non-parametric statistic used to estimate the survival function from censored data. In soccer, applied to return-to-play modeling after injuries. (Chapter 26) → Appendix E: Glossary of Soccer Analytics Terms
Karun Singh's Blog
Original xT creator - Technical implementations - Model refinements - Link: karun.in/blog → Chapter 9: Further Reading
Karun Singh, "Introducing Expected Threat (xT)"
karun.in. The original blog post introducing the xT framework, with code and visualisations. → Chapter 17: Further Reading
A non-parametric method for estimating the probability density function of a variable. Used to create smooth heatmaps of player actions or shot locations on the pitch. (Chapter 17) → Appendix E: Glossary of Soccer Analytics Terms
Key Achievements:
Promotion to the Premier League after 74 years away - Consistent overperformance relative to budget - Highly profitable transfer model (buy low, sell high) - Innovative approach to goalkeeper recruitment and set-piece design → Chapter 28: Building an Analytics Department
Key Characteristics:
Analytics integrated at every level of decision-making - Set-piece specialization driven by statistical analysis - Recruitment model focused on undervalued players with high statistical profiles - Willingness to challenge conventional football wisdom with data → Chapter 28: Building an Analytics Department
Key concepts:
**Population:** The complete group we're interested in (all Premier League matches ever) - **Sample:** The subset we actually observe (this season's matches) - **Parameter:** A value describing the population (true mean goals per match) - **Statistic:** A value calculated from the sample (sample mea → Chapter 3: Statistical Foundations for Soccer Analysis
Key decisions in Phase 1:
Hired a small analytics team (2-3 people) with direct access to ownership - Established data subscriptions with Opta and other providers - Began building basic models for player valuation and match prediction - Identified set-piece performance as an area of significant market inefficiency → Case Study 1: From Zero to World Class --- How FC Midtjylland Built a Data-First Culture
Key Differentiators:
**360 Freeze Frames:** StatsBomb includes positional data for all visible players at the moment of key events. This enables spatial analysis of shots, passes, and other actions without requiring full tracking data. - **Pressure events:** StatsBomb codes "pressure" events—moments when a player closes → Chapter 2: Data Sources and Collection in Soccer
Key findings:
1,079 total shots in the tournament - ~11% conversion rate (typical for professional soccer) - Right foot shots dominate (58%), followed by headers (24%) - Most shots are open play; penalties and free kicks are distinct categories → Case Study 1: Building a Production-Ready xG Model
**Total distance:** The total distance covered by a player during a match or training session. A typical outfield player covers 10-13 km per match, with midfielders generally covering the most distance. - **High-speed running distance:** Distance covered above a threshold speed (typically 5.5 m/s or → Chapter 2: Data Sources and Collection in Soccer
Key Providers:
**Catapult:** The largest provider of GPS/wearable tracking in professional sports. Catapult devices are used by over 3,000 teams worldwide across multiple sports. Their platform provides both raw data and derived metrics for physical performance monitoring. - **STATSports:** An Irish company whose → Chapter 2: Data Sources and Collection in Soccer
Key risk factors include:
Acute:Chronic Workload Ratio (ACWR) - Total distance covered in training (meters) - High-speed running distance (> 7.5 m/s) - Number of accelerations and decelerations - Days since last rest day - Previous injury history (binary: injured in prior 6 months) - Age - Match congestion (matches in last 1 → Chapter 29: Comprehensive Case Studies
Key season-level insights:
**xG is predictive of future goals.** A team that underperforms xG by 10 goals in the first half is likely to score closer to xG in the second half. - **Large xG over/underperformance is unsustainable.** Teams rarely deviate from xG by more than plus or minus 10% over a full season. - **xG tables of → Chapter 7: Expected Goals (xG) Models
kloppy
Multi-provider data loading - Standardized data format - Event and tracking data - Link: github.com/PySport/kloppy → Chapter 8: Further Reading
Knowledge Sharing Practices:
Regular internal seminars or "lunch and learn" sessions - Pair programming or analysis for skill transfer - Standard templates for common analyses - A shared library of reusable code and visualizations → Chapter 28: Building an Analytics Department
L
Last Row View (Tracking Data Analytics)
Access: Research samples - Type: Open tracking data - Defensive Use: Movement and spacing analysis → Chapter 12: Further Reading
Laurie Shaw's Pitch Control Tutorial
github.com/Friends-of-Tracking-Data-Science/LaurieOnTracking A detailed Python implementation of Spearman's pitch control model with synthetic tracking data. → Chapter 17: Further Reading
Layer 1: Data Sources
Event data providers (Opta/Stats Perform, StatsBomb, Wyscout) - Tracking data (Second Spectrum, SkillCorner, Signality, Hawk-Eye) - Physical performance data (GPS/accelerometer: Catapult, STATSports, Polar) - Video feeds (broadcast, tactical camera, drone) - Internal scouting reports and coach evalu → Chapter 28: Building an Analytics Department
Layer 2: Data Infrastructure
Cloud platform (AWS, GCP, Azure) - Data warehouse / data lake (Snowflake, BigQuery, Redshift, S3) - ETL/ELT pipelines (Airflow, dbt, custom Python scripts) - API layer for data access - Version control (Git) → Chapter 28: Building an Analytics Department
Layer 3: Analytics Tools
Statistical computing (Python, R) - Machine learning frameworks (scikit-learn, PyTorch, TensorFlow) - Visualization (matplotlib, Plotly, D3.js, Tableau) - Geospatial analysis tools (for pitch-based visualization) - Video analysis platforms (Hudl/SBG, Catapult/ProZone) → Chapter 28: Building an Analytics Department
Layer 4: Delivery and Communication
Dashboards (Tableau, Power BI, Streamlit, custom web apps) - Reporting tools (Jupyter notebooks, automated PDF reports) - Presentation platforms (for matchday and scouting presentations) - Mobile interfaces (for coaching staff on the training ground) - Slack/Teams integration for alerts and notifica → Chapter 28: Building an Analytics Department
Learning Objectives:
Prepare raw event data for xG modeling - Engineer meaningful features from shot locations and context - Train and evaluate multiple model architectures - Understand calibration and its importance for probability models - Create a reusable xG calculation pipeline → Case Study 1: Building a Production-Ready xG Model
Legal Considerations:
The legal status of web scraping varies by jurisdiction. In the United States, the Computer Fraud and Abuse Act (CFAA) has been interpreted in various ways regarding scraping. In the European Union, the Database Directive provides legal protection for databases. - Scraping publicly available data fo → Chapter 2: Data Sources and Collection in Soccer
Legitimate strategies:
Revenue growth through commercial development and stadium expansion - Academy investment to develop players with zero transfer amortization - Structured transfer payments (installments, contingent fees) - Strategic timing of player sales to generate accounting profits → Chapter 25: Economic Analysis and Player Valuation
Limitations of Event Data:
Only records discrete actions, missing continuous play - Doesn't capture off-ball movement—where the 20 players without the ball are positioned - Quality varies between providers and between leagues within the same provider - Some subjective classification (what counts as a "key pass"? When does a " → Chapter 2: Data Sources and Collection in Soccer
Limitations of Tracking Data:
Expensive and not universally available - Large data volumes require significant computational infrastructure - Processing and analysis more complex than event data - Still being explored—best practices and standard methods are evolving - Synchronization between tracking data and event data can be i → Chapter 2: Data Sources and Collection in Soccer
Limitations:
Only captures specific moments, not continuous play - Missing movement trajectories—you see where players are but not where they are going - Limited coverage (not all providers offer this) - Selection of which events include freeze frames varies → Chapter 2: Data Sources and Collection in Soccer
Line-Breaking Pass
A pass that travels through a line of opposition players, progressing the ball past defensive or midfield structures. Valued highly in modern tactical analysis. (Chapter 10) → Appendix E: Glossary of Soccer Analytics Terms
A loss function for probabilistic classification models, penalizing confident incorrect predictions more heavily. The standard evaluation metric for xG models. (Chapter 3) → Appendix E: Glossary of Soccer Analytics Terms
Logistic Regression
A statistical model for binary classification that estimates probabilities using a logistic (sigmoid) function. Often used as a baseline xG model due to its interpretability and inherent calibration. (Chapter 19) → Appendix E: Glossary of Soccer Analytics Terms
Low tackles can indicate:
Excellent positioning deterring challenges (positive) - Playing for dominant possession team (contextual) - Avoidance of engagement (negative) → Chapter 12: Key Takeaways
**TRACAB (ChyronHego):** One of the longest-established tracking systems, installed in stadiums across Europe including all Bundesliga venues. TRACAB uses a system of cameras mounted at the top of the main stand to provide overlapping coverage of the entire pitch. → Chapter 2: Data Sources and Collection in Soccer
Man-Marking
A defensive system where each defender is assigned to track a specific opponent. At set pieces, contrasted with zonal marking schemes. Analytics can evaluate which system is more effective against specific opponents. (Chapter 14) → Appendix E: Glossary of Soccer Analytics Terms
Manual Analysis:
Coaches and analysts watch video to understand tactics, player behaviors, and game situations - Tagging systems (Hudl, Wyscout, InStat) allow annotation of video with custom labels - Essential for qualitative understanding that data alone cannot provide - Pre-match opposition analysis typically invo → Chapter 2: Data Sources and Collection in Soccer
Market factors:
Selling club's league and reputation - Current market conditions (inflation, pandemic effects) - Transfer window timing (January premium vs. summer) - Number of interested buyers (competition effect) → Chapter 25: Economic Analysis and Player Valuation
Markov chain
a stochastic model where the probability of future states depends only on the current state, not on the sequence of events that preceded it. In the xT framework: → Chapter 9: Expected Threat (xT) and Ball Progression
Match Day: Execution and Monitoring
Provide real-time tactical monitoring (Section 22.4.3) - Prepare half-time data package - Support in-game decision-making → Chapter 22: Match Strategy and Tactics
ticket sales, hospitality, and stadium-related income 2. **Broadcasting revenue** --- domestic and international television rights 3. **Commercial revenue** --- sponsorship, merchandising, and licensing → Chapter 25: Economic Analysis and Player Valuation
Mathematical Modelling of Football
Uppsala University - Free course on soccer analytics - Includes expected metrics coverage - Available on Uppsala University website → Chapter 8: Further Reading
MathSport International
Focus: Mathematical approaches to sports including soccer defense - Key Content: Statistical modeling papers → Chapter 12: Further Reading
Taker choices: Left, Center, Right - Goalkeeper choices: Dive Left, Stay, Dive Right - Outcomes: Goal probability for each combination → Chapter 14: Exercises
xG and xA methodology development - Influential early analysis - Historical xG model documentation → Chapter 8: Further Reading
MinMaxScaler
A preprocessing technique that scales features to a fixed range, typically [0, 1]. Essential for fair comparison in multi-criteria scouting scores and distance-based algorithms. (Chapter 21) → Appendix E: Glossary of Soccer Analytics Terms
Recorded presentations from the leading sports analytics conference. Search for soccer/football talks on metric validation, expected goals, and recruitment analytics. → Further Reading: Introduction to Soccer Metrics
MMPose
github.com/open-mmlab/mmpose A comprehensive pose estimation library supporting multiple architectures (HRNet, ViTPose) with pre-trained sports models. → Chapter 23: Further Reading
A computational technique using random sampling to estimate probability distributions. In soccer, used to simulate match outcomes, season results, and tournament brackets. (Chapter 20) → Appendix E: Glossary of Soccer Analytics Terms
mplsoccer
Providing beautiful pitch visualizations - **socceraction** — Implementing action valuation frameworks - **statsbombpy** — Enabling access to open event data - **kloppy** — Standardizing data formats across providers - **pandas, numpy, scikit-learn** — The foundational tools of data science → Acknowledgments
mplsoccer (Python)
GitHub: andrewRowlinson/mplsoccer - Use: Pitch visualization for defensive action maps → Chapter 12: Further Reading
mplsoccer Documentation
Python visualization library - Pitch plots and heatmaps - xT grid visualization - Link: mplsoccer.readthedocs.io → Chapter 9: Further Reading
The study of relationships between entities (players) using graph theory. Passing networks represent players as nodes and passes as weighted edges. (Chapter 10) → Appendix E: Glossary of Soccer Analytics Terms
Countermovement jump (CMJ) height and related metrics (flight time, rate of force development). - Isometric mid-thigh pull or adductor squeeze. - Typical recovery timeline: CMJ returns to baseline within 48-72 hours after a match. → Chapter 26: Injury Prevention and Load Management
github.com/tryolabs/norfair A lightweight, customisable multi-object tracking library in Python. Designed for real-time applications and easy integration with detection models. → Chapter 23: Further Reading
Normal (Gaussian) distribution
**Binomial and Poisson distributions** (basic awareness) - **Understanding of probability density functions** → Prerequisites
Normalization
Scaling data to a standard range. Per-90 normalization divides raw counts by minutes played and multiplies by 90 to enable fair comparison between players with different playing times. (Chapter 5) → Appendix E: Glossary of Soccer Analytics Terms
NoSQL Databases (MongoDB, DynamoDB):
Best for semi-structured or schema-flexible data (event data with varying qualifier structures, tracking data) - MongoDB's document model maps naturally to the nested JSON structure of event data - Better horizontal scalability for very large datasets - Less suitable for complex joins across multipl → Chapter 2: Data Sources and Collection in Soccer
NumPy
Matrix operations for transition matrices - Value iteration implementation - Efficient numerical computing → Chapter 9: Further Reading
NumPy Fundamentals:
Vectorized operations dramatically outperform Python loops - Statistical functions enable quick exploratory analysis - Spatial calculations (distance, angle) support position-based analytics - Random number generation powers Monte Carlo simulations → Chapter 4: Python Tools for Soccer Analytics
O
Observations:
France's possession decreased through the tournament as opponents strengthened - Their efficiency (xG per sequence) remained consistently high - In knockout rounds, they averaged just 39% possession but won all four matches → Case Study 1: Possession Efficiency in the 2018 World Cup
Off-Ball Movement
Player movement when not in possession of the ball, including runs to create space, pressing movements, and defensive positioning. Requires tracking data to analyze. (Chapter 18) → Appendix E: Glossary of Soccer Analytics Terms
Weeks 1-2: Chapters 1-2 (Foundations) - Weeks 3-4: Chapters 3-4 (Statistics and Python) - Weeks 5-6: Chapters 5-6 (Metrics and Coordinates) - Weeks 7-8: Chapters 7-8 (xG and Passing) - Weeks 9-10: Chapters 9-10 (Possession and Defense) - Weeks 11-12: Chapters 11-12 (GK and Set Pieces) - Weeks 13-14: → How to Use This Book
Online Courses:
Khan Academy Statistics and Probability (free) - Coursera: Statistics with Python Specialization - edX: Introduction to Probability and Statistics → Prerequisites
Online Learning:
Friends of Tracking (YouTube) --- free video tutorials on tracking data analysis - DataCamp and Coursera sports analytics courses - StatsBomb IQ and other commercial educational platforms - University MOOCs in sports analytics and data science → Chapter 30: The Future of Soccer Analytics
Open Football Data:
A community-maintained repository of open soccer data on GitHub - Includes fixtures, results, and league tables in structured formats - Good for basic historical analysis → Chapter 2: Data Sources and Collection in Soccer
Opportunities:
Lower leagues have less data coverage (less competition for undervalued players) - Young players in lower leagues are often mispriced - Analytical approaches can identify hidden value - Championship (second tier) has significant financial upside for promotion → Case Study: Brentford's Moneyball Approach
Opta
Pitch dimensions: 100 x 100 (percentage-based) - Origin: bottom-left for the team's own half in the first half - $x$-axis: 0 to 100 (own goal line to opposition goal line) - $y$-axis: 0 to 100 (left touchline to right touchline from the perspective of the attacking team) → Chapter 6: The Soccer Pitch as a Coordinate System
Opta (Stats Perform)
https://www.statsperform.com/ - Comprehensive event data coverage - Multiple xG model versions → Chapter 7: Further Reading
Opta Analytics Blog
Provider methodology insights - Event data explanations - Industry applications - Link: optasports.com/services/analytics → Chapter 8: Further Reading
Opta Pro Forum Presentations
Focus: Industry research on defensive metrics - Key Presentations: Annual analytics forum content → Chapter 12: Further Reading
Multiple cameras positioned around the stadium (typically 12-20+) - Computer vision algorithms identify and track players and the ball - Providers: Second Spectrum, TRACAB (ChyronHego), Hawk-Eye - Most accurate method, with positional precision of approximately 10-30 centimeters - Requires permanent → Chapter 2: Data Sources and Collection in Soccer
Organizational Approach:
Small but influential analytics team (5-8 people at peak) - Direct reporting to ownership, bypassing traditional football hierarchies - Strong alignment between ownership vision and analytical methodology - Culture of experimentation and tolerance for failure → Chapter 28: Building an Analytics Department
Outcome KPIs:
Adoption rate of analytics recommendations - Success rate of analytically-supported signings - Accuracy of predictive models (calibration, discrimination) - Cost savings attributable to analytics → Chapter 28: Building an Analytics Department
Number of pre-match reports delivered on time - Number of player profiles generated for recruitment - Number of ad-hoc analysis requests fulfilled - Number of models deployed and maintained → Chapter 28: Building an Analytics Department
overfitting to metrics
identifying a player who excels on the specific metrics used in the shortlisting model but who lacks qualities that the model does not capture. → Chapter 21: Player Recruitment and Scouting
Overperformance
When a team's actual results (goals, points) exceed what their underlying metrics (xG, xPts) would predict. May indicate genuine skill or favorable variance. (Chapter 16) → Appendix E: Glossary of Soccer Analytics Terms
Data manipulation essential - Groupby operations for aggregation - Time series handling - Link: pandas.pydata.org → Chapter 8: Further Reading
pandas Essentials:
DataFrames efficiently store and manipulate tabular soccer data - Boolean indexing and query() filter data precisely - Groupby operations aggregate statistics at any level (player, team, match) - Merges combine multiple data sources (events, matches, player bio) - Time series operations support roll → Chapter 4: Python Tools for Soccer Analytics
Passes Per Defensive Action (PPDA)
A pressing intensity metric computed as opponent passes allowed divided by a team's defensive actions in the opponent's half. Lower PPDA indicates more intense pressing. (Chapter 12) → Appendix E: Glossary of Soccer Analytics Terms
Team retention rate - Skill development and certification completion - Internal stakeholder satisfaction scores - Cross-functional collaboration frequency → Chapter 28: Building an Analytics Department
Per-90 Metrics
Statistics normalized to a 90-minute match equivalent by dividing by minutes played and multiplying by 90. Enables comparison across players with different playing time. (Chapter 5) → Appendix E: Glossary of Soccer Analytics Terms
Percentiles and quartiles
**Basic data visualization:** histograms, scatter plots, box plots → Prerequisites
Perceptual Recovery:
Subjective wellness questionnaires (fatigue, muscle soreness, mood, sleep quality, stress). - Typically scored on Likert scales and tracked as rolling z-scores. → Chapter 26: Injury Prevention and Load Management
Performance Analysis:
Advanced metrics beyond basic statistics - Focus on repeatable, skill-based performance - Identification of players overperforming or underperforming expectations → Case Study: Brentford's Moneyball Approach
Performance Analysts:
Prepare opposition analysis and reports - Support coaches with tactical insights - Create video compilations linked to data - Deliver pre-match and post-match presentations - Skills: Video analysis software (Hudl, Wyscout), communication, tactical understanding, presentation skills → Chapter 1: Introduction to Soccer Analytics
A model that computes the probability that each team controls each point on the pitch at a given moment, based on player positions and velocities. (Chapter 17) → Appendix E: Glossary of Soccer Analytics Terms
Platt Scaling
A calibration method that fits a logistic regression to transform model outputs into calibrated probabilities. Also called sigmoid calibration. (Chapter 19) → Appendix E: Glossary of Soccer Analytics Terms
Injury risk: LOW (no significant injuries) - Performance sustainability: MEDIUM (only 1.5 seasons of high-level data) - Adaptation risk: MEDIUM (Bundesliga to Premier League) - Character: LOW risk - Financial: MEDIUM (estimated fee 15-20M EUR, good development potential) - **Composite risk: 3.2/10** → Case Study 21.2: Scouting a Replacement — Finding the Next N'Golo Kanté
Player Epsilon (Age 26, Premier League):
Injury risk: MEDIUM (recurring ankle issues, 45 days missed) - Performance sustainability: LOW (consistent performer for 3 seasons) - Adaptation risk: LOW (already in Premier League) - Character: LOW risk (known in the league, good reputation) - Financial: HIGH (estimated fee 30-35M EUR, higher wage → Case Study 21.2: Scouting a Replacement — Finding the Next N'Golo Kanté
Statistical models estimate "true" player ability separate from market price - Models account for age, development trajectory, and positional scarcity - Emphasis on metrics that stabilize quickly (underlying performance over results) → Case Study: Brentford's Moneyball Approach
Player-level features:
Per-90 performance metrics (xG, xA, progressive actions, defensive actions) - Physical profile (height, speed, endurance metrics) - Age at transfer - Contract situation (years remaining, release clause) - Previous transfer history (number of clubs, adaptation record) → Chapter 20: Predictive Modeling
plotly
Interactive visualizations - Web-based dashboards - Animation support - Link: plotly.com/python → Chapter 8: Further Reading
A discrete probability distribution modeling the number of events in a fixed interval. Goals per match are approximately Poisson-distributed, making it fundamental to expected points calculations. (Chapter 3) → Appendix E: Glossary of Soccer Analytics Terms
Pose Estimation
Computer vision technique that detects and tracks the positions of body joints (skeleton) from video. Future applications include biomechanical analysis and technique assessment. (Chapter 30) → Appendix E: Glossary of Soccer Analytics Terms
Position Calculation:
Triangulation from multiple camera views to determine 3D positions - Kalman filtering for smooth trajectories that handle measurement noise - Handling occlusions (when players block each other from camera view, the system must predict positions based on recent trajectory) - Identity management (main → Chapter 2: Data Sources and Collection in Soccer
Possession & Creativity:
Progressive passes per 90 minutes - Expected assists (xA) per 90 - Through balls completed per 90 - Carries into the final third per 90 - Pass completion percentage (adjusted by length and direction) → Chapter 21: Player Recruitment and Scouting
**Match preparation:** Analysts can compute the opponent's typical defensive shape to identify areas of weakness (e.g., if the CV of Voronoi areas is high, there are exploitable gaps). - **In-game adjustments:** Real-time tracking data enables live monitoring of defensive compactness, triggering ale → Case Study 6.2: Building a Defensive Shape Analyzer Using Coordinate Data
Practical Approaches:
Use domain knowledge to evaluate mechanisms - Control for confounders in regression - Use natural experiments when available (e.g., rule changes, managerial sackings) - Use causal inference frameworks (difference-in-differences, instrumental variables) - Be humble about causal claims → Chapter 3: Statistical Foundations for Soccer Analysis
Pressing
A defensive tactic where players actively close down opponents to force turnovers. Pressing intensity, triggers, and effectiveness are key analytical topics. (Chapter 22) → Appendix E: Glossary of Soccer Analytics Terms
Best statistical fit for the departed player's role - Strong scout endorsement - Reasonable transfer fee with development upside - Manageable risk profile - Age 24 means 4-5 years of peak performance and potential resale value → Case Study 21.2: Scouting a Replacement — Finding the Next N'Golo Kanté
Prior belief (before streak):
Thompson's true scoring rate: approximately 0.31 goals per 90 - Standard deviation of true ability: approximately 0.08 (based on player population) → Case Study 1: Evaluating a Hot Streak
Problem 1:
a) Mean = (2+1+0+3+2+1+4+2+1+2)/10 = 18/10 = **1.8 goals** - b) Sorted: 0,1,1,1,2,2,2,2,3,4. Median = (2+2)/2 = **2 goals** - c) Variance = Σ(x-μ)²/n = 1.16, Standard deviation = √1.16 ≈ **1.08 goals** → Prerequisites
Problem formulation
Define the target variable and the decision the model will inform. 2. **Data collection** --- Aggregate event data (e.g., StatsBomb, Opta), tracking data (e.g., Second Spectrum, SkillCorner), or both. 3. **Feature engineering** --- Transform raw events into informative predictor variables. 4. **Trai → Chapter 19: Machine Learning for Soccer
Process KPIs:
Average turnaround time for analysis requests - Percentage of routine reporting that is automated - System uptime and data pipeline reliability - Code quality metrics (test coverage, documentation) → Chapter 28: Building an Analytics Department
Products:
**Event Data (F24/F9):** Detailed event-level data with ~2,000+ events per match (passes, shots, tackles, fouls, etc.) with x,y coordinates. - **Match Data:** Pre-match, live, and post-match statistics. - **Player Data:** Seasonal aggregates, biographical information. - **Advanced Metrics:** Expecte → Appendix D: Data Sources and Tools
Profile Elements:
Total penalties taken - Conversion rate - Preferred placement zones - Technique patterns → Chapter 14: Exercises
Comfortable writing Python functions and classes - Experience with basic data structures (lists, dictionaries) - Familiarity with reading documentation and debugging - Some exposure to pandas or similar data manipulation libraries → Preface
Progressive Pass
A pass that moves the ball at least 10 meters closer to the opponent's goal, measured along the x-axis. A key ball progression metric. (Chapter 10) → Appendix E: Glossary of Soccer Analytics Terms
Properties of Expectation:
$\mathbb{E}[aX + b] = a\mathbb{E}[X] + b$ (linearity) - $\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]$ (always, even if dependent) - $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$ only if $X$ and $Y$ are independent → Appendix A: Mathematical Foundations
StatsBomb includes freeze frames with their event data, making them the most widely accessible source of freeze frame data - Some providers offer "enhanced" event data with positional context for certain event types → Chapter 2: Data Sources and Collection in Soccer
Building a public profile through blog posts, Twitter threads, or other platforms - Contributing to open-source projects like mplsoccer (Python soccer visualization library) or socceraction - Participating in competitions (e.g., the Friends of Tracking challenge, Kaggle competitions) - Attending and → Chapter 1: Introduction to Soccer Analytics
Community discussion of soccer metrics, tools, and methodologies. Good for staying current with new developments and debating metric design choices. → Further Reading: Introduction to Soccer Metrics
Radar Chart (Spider Chart)
A visualization that displays multiple variables on axes radiating from a center point, commonly used to show player profiles across multiple metrics. (Chapter 15) → Appendix E: Glossary of Soccer Analytics Terms
Random Forest
An ensemble learning method that builds multiple decision trees and averages their predictions. Provides feature importance rankings useful for understanding model drivers. (Chapter 19) → Appendix E: Glossary of Soccer Analytics Terms
Random Forests and Gradient Boosting:
Handle nonlinear relationships and interactions naturally. - Feature importance scores provide interpretability. - Require careful hyperparameter tuning to avoid overfitting on small datasets. - Gradient boosted models (XGBoost, LightGBM) have shown promise in injury prediction research. → Chapter 26: Injury Prevention and Load Management
Recommendation:
Converting insights into specific recommendations - Acknowledging uncertainty appropriately - Providing options when appropriate - Making the decision easy for the stakeholder—not "here is some data" but "I recommend we do X because the data shows Y" → Chapter 1: Introduction to Soccer Analytics
clubs are buying future performance, not past statistics. Projection models are therefore essential. - **Age curves** describe the typical relationship between age and performance. Most outfield players peak between ages 24-29, with physical attributes declining before technical ones. - **The delta → Chapter 21: Key Takeaways
Recruitment Process:
Data screening of large player populations - Statistical shortlisting to identify candidates - Video analysis of shortlisted players - Traditional scouting for final validation - Structured interviews assessing psychological factors → Case Study: Brentford's Moneyball Approach
Recurrent Neural Networks:
Can model sequential load data (daily load time series) directly. - LSTM architectures can capture long-range dependencies in training history. - Require substantially more data than traditional approaches and are prone to overfitting in typical soccer club datasets. → Chapter 26: Injury Prevention and Load Management
Reference Data:
Player profiles: age, height, weight, preferred foot, position(s), nationality, date of birth - Team rosters and formations - Match metadata: date, venue, competition, round, referee, attendance - League tables and standings - Competition structures (group stages, knockout rounds, promotion/relegati → Chapter 2: Data Sources and Collection in Soccer
Regression to the Mean
The statistical tendency for extreme observations to be followed by more moderate ones. Critical for interpreting over- and under-performance in soccer metrics. (Chapter 3) → Appendix E: Glossary of Soccer Analytics Terms
Relational Databases (PostgreSQL, MySQL, SQLite):
Best for structured data with well-defined schemas (event data, player reference data, match metadata) - SQL queries enable complex filtering, joining, and aggregation - Referential integrity ensures data consistency - PostgreSQL with the PostGIS extension is particularly useful for spatial queries → Chapter 2: Data Sources and Collection in Soccer
Model must work with freely available event data - Performance should approach commercial alternatives - Predictions must be well-calibrated (accurate probabilities) - Code should be maintainable and documented → Case Study 1: Building a Production-Ready xG Model
Research Analysts:
Focus on specific long-term research projects - Develop club's analytical methodology - Often specialized (e.g., set piece analyst, tracking data analyst) - Produce internal papers and methodological guides - Skills: Deep expertise in specific area, research methodology, academic writing → Chapter 1: Introduction to Soccer Analytics
Results:
Multiple Danish Superliga titles - Consistent overperformance relative to wage bill - Successful player development and profitable transfer activity - Established template for data-driven club management → Chapter 28: Building an Analytics Department
Reward controllable outcomes
individual statistics the player can influence 2. **Incentivize team success** --- bonuses tied to collective achievements 3. **Manage downside risk** --- cap total compensation to protect the club's budget 4. **Avoid moral hazard** --- prevent perverse incentives (e.g., a striker avoiding defensive → Chapter 25: Economic Analysis and Player Valuation
Roboflow Sports Datasets
roboflow.com/sports Curated datasets for sports object detection, including annotated soccer frames for player, ball, and referee detection. → Chapter 23: Further Reading
Roboflow, "How to Detect Soccer Players"
blog.roboflow.com. A step-by-step tutorial on training a custom YOLOv8 model for soccer player detection, including data annotation, training, and inference. → Chapter 23: Further Reading
ROC-AUC
The area under the Receiver Operating Characteristic curve, measuring a classifier's ability to distinguish between positive and negative cases. Used to evaluate xG model discrimination. (Chapter 19) → Appendix E: Glossary of Soccer Analytics Terms
Delivering local analytics services to their club's coaching and recruitment staff - Providing local context for centrally developed models - Feeding local data and intelligence back to the hub - Adapting group-wide tools to local needs → Case Study 2: Scaling Analytics at Manchester City Football Group
Scalars, Vectors, and Matrices
Scalars are denoted by lowercase italic letters: $x$, $y$, $\theta$. - Vectors are denoted by lowercase bold letters: $\mathbf{x}$, $\mathbf{w}$, $\mathbf{v}$. - Matrices are denoted by uppercase bold letters: $\mathbf{A}$, $\mathbf{X}$, $\mathbf{\Sigma}$. - The $i$-th element of vector $\mathbf{x}$ → Appendix A: Mathematical Foundations
scikit-learn
`pip install scikit-learn` - Core ML library for xG models - Logistic regression, gradient boosting, calibration → Chapter 7: Further Reading
Optical tracking systems installed in MLS (all stadiums), La Liga (all stadiums), and the Premier League - Advanced analytics layer on top of tracking data, including proprietary metrics and tactical classification - Powers broadcast graphics (the "augmented reality" overlays showing tactical patter → Chapter 2: Data Sources and Collection in Soccer
measured as meters progressed per second -- distinguishes fast transitions from patient build-up. Counter-pressing teams like Liverpool under Klopp showed the highest sequence speeds in the Premier League, reflecting their philosophy of attacking quickly after winning the ball. → Chapter 11: Possession and Territorial Control
Set Piece
A restart of play from a dead-ball situation: corners, free kicks, throw-ins, goal kicks, and penalties. Account for approximately 25-30% of all goals. (Chapter 14) → Appendix E: Glossary of Soccer Analytics Terms
Set Piece Analysis Series
StatsBomb - Corner and free kick analysis - xA from set pieces - Design and execution → Chapter 8: Further Reading
a significant portion of scoring 2. **Throw-ins are most frequent** but corners/free kicks have higher individual value 3. **Penalty conversion is ~76%** - highly reliable opportunity 4. **Shot placement is key for penalties** - top corners most effective 5. **Mixed strategies are optimal** in game- → Chapter 14: Quiz
Sets and Indices
Sets are denoted by calligraphic uppercase letters: $\mathcal{S}$, $\mathcal{T}$, $\mathcal{P}$. - The set of real numbers is $\mathbb{R}$; the set of positive integers is $\mathbb{Z}^+$. - We use $i \in \{1, 2, \ldots, n\}$ to index observations and $j \in \{1, 2, \ldots, p\}$ to index features. → Appendix A: Mathematical Foundations
Broadcast tracking data - Physical metrics - Off-ball movement data - Link: skillcorner.com → Chapter 8: Further Reading
SkillCorner Open Data
Access: Research partnerships - Type: Broadcast tracking - Use: Movement and positioning → Chapter 13: Further Reading
SkillCorner Open Research
Focus: Publicly available tracking research - Key Content: Defensive intensity and pressing data → Chapter 12: Further Reading
SkillCorner:
Tracking data derived from broadcast video using computer vision, requiring no stadium hardware - Founded in 2017, SkillCorner has rapidly grown to become a significant player in the tracking data market - Growing coverage: any televised match can potentially be processed - Democratizing tracking da → Chapter 2: Data Sources and Collection in Soccer
Skills Applied:
Understanding analytics organizational structures - Recognizing the role of analytics in decision-making - Evaluating the value of data-driven approaches → Case Study: The Liverpool Analytics Revolution
Soccer Example:
Player A: 20% conversion vs strong teams, 15% vs weak teams - Player B: 18% conversion vs strong teams, 13% vs weak teams - Player A is better against both types of opposition - But Player B might have higher OVERALL conversion if they play mostly weak teams! → Chapter 3: Statistical Foundations for Soccer Analysis
Soccer-Specific Conventions
Pitch coordinates: origin at the bottom-left corner of the pitch, $x$-axis running along the length (0 to 120 yards or 0 to 105 meters), $y$-axis running along the width (0 to 80 yards or 0 to 68 meters). - Time: match time $t$ measured in minutes from kickoff, with $t \in [0, 90]$ for regulation ti → Appendix A: Mathematical Foundations
Soccer:
Understanding of basic soccer rules and gameplay - Familiarity with common tactical concepts (formations, positions) - General awareness of major leagues and competitions → Preface
socceraction
`pip install socceraction` - SPADL data format and VAEP implementation - Academic research tools - https://github.com/ML-KULeuven/socceraction → Chapter 7: Further Reading
CVPR/ECCV annual workshops. Presentations from leading research groups on the latest advances in soccer video understanding. Available on the SoccerNet YouTube channel. → Chapter 23: Further Reading
SoccerNet GitHub
github.com/SoccerNet Open-source tools, benchmarks, and pre-trained models for soccer video understanding. Includes tracking, action spotting, and camera calibration tasks. → Chapter 23: Further Reading
Sources:
Transfermarkt: The largest publicly accessible source of transfer data, player valuations, injury histories, and squad information. Despite being a website rather than a formal data provider, Transfermarkt's community-maintained data is widely used in professional and academic analytics. - National → Chapter 2: Data Sources and Collection in Soccer
Sourcing data:
Is the data available internally? - Do we need to purchase from a provider? - Can we collect it ourselves? - Are free alternatives adequate for this purpose? → Chapter 1: Introduction to Soccer Analytics
League titles: Neymar contributed to multiple Ligue 1 championships, though PSG was already dominant domestically - Champions League: PSG reached the final in 2020 and semi-final in 2021, their best-ever results - Individual awards: Neymar did not win the Ballon d'Or, which was part of the implicit → Case Study 1: The Neymar Effect — How One Transfer Reshaped the Market
Stage 1: Ad Hoc (1-2 people)
Single analyst or small team - Reactive work driven by coaching requests - Basic tools (Excel, basic video analysis) - Limited data infrastructure - Typical budget: $50,000 -- $150,000 → Chapter 28: Building an Analytics Department
Stage 1: Data Screening
Player databases covering 50+ leagues worldwide were screened using proprietary statistical models - Models evaluated players on output metrics (xG, xA, progressive actions) rather than traditional statistics (goals, assists) - Age, contract status, and estimated market value were used as additional → Case Study 21.1: Brentford's Moneyball — How Data-Driven Recruitment Built a Premier League Team
Stage 1: Data-Led Discovery
Analysts define search parameters based on tactical needs - Automated screening produces a long list of candidates - Initial statistical profiles and percentile rankings are generated → Chapter 21: Player Recruitment and Scouting
Stage 2: Foundational (3-5 people)
Dedicated roles for match analysis and recruitment - Established data pipelines from providers (Opta, StatsBomb, etc.) - Basic dashboards and reporting - Beginning to influence some decisions - Typical budget: $200,000 -- $500,000 → Chapter 28: Building an Analytics Department
Stage 2: Scout-Led Evaluation
Scouts review video of data-identified candidates - Live scouting assignments are prioritized based on data rankings - Scouts provide structured reports addressing specific questions raised by the data → Chapter 21: Player Recruitment and Scouting
Joint meetings between analysts and scouts to discuss candidates - Data provides context for scout observations ("you noted he doesn't press well -- his pressing numbers confirm this") - Scouts provide context for data anomalies ("his passing numbers are low because his team plays long ball") → Chapter 21: Player Recruitment and Scouting
Stage 3: Established (6-12 people)
Specialized roles including data scientists and engineers - Custom models and tools - Proactive analysis alongside reactive support - Regular integration into decision-making processes - Typical budget: $500,000 -- $1,500,000 → Chapter 28: Building an Analytics Department
Stage 3: Video and Live Scouting
Scouts received targeted shortlists with specific questions to address (e.g., "Confirm or deny the data suggestion that this player's pressing is elite") - Structured scouting reports were submitted that mapped onto statistical categories - Multiple scouts evaluated each candidate to reduce individu → Case Study 21.1: Brentford's Moneyball — How Data-Driven Recruitment Built a Premier League Team
Stage 4: Advanced (13-25+ people)
Full-stack analytics operation - Proprietary data collection and tracking systems - Research and development function - Analytics embedded in organizational culture - Typical budget: $1,500,000 -- $5,000,000+ → Chapter 28: Building an Analytics Department
Combined data-scout reports for the sporting director / decision-maker - Clear presentation of both quantitative evidence and qualitative assessment - Explicit articulation of risks and uncertainties → Chapter 21: Player Recruitment and Scouting
Understanding of descriptive statistics (mean, median, standard deviation) - Familiarity with probability concepts (probability distributions, conditional probability) - Basic exposure to hypothesis testing and confidence intervals - Awareness of regression analysis concepts → Preface
Pitch dimensions: 120 x 80 (arbitrary units, not metres) - Origin: top-left corner of the pitch when the attacking team attacks left-to-right - $x$-axis: runs left to right (0 to 120) - $y$-axis: runs top to bottom (0 to 80) - The team always attacks toward $x = 120$ in the first half → Chapter 6: The Soccer Pitch as a Coordinate System
StatsBomb (statsbomb.com)
Industry-leading analytics company whose blog and public research set the standard for football analytics communication. Their data specification documents are invaluable for understanding data architecture. → Chapter 28 Further Reading: Building an Analytics Department
StatsBomb - Set Piece Analysis Series
URL: statsbomb.com/articles - Key Articles: "The Art of the Set Piece", "Corner Kick Efficiency" → Chapter 14: Further Reading
StatsBomb 360
Full event and tracking data - Freeze frame information - Industry standard for clubs - Link: statsbomb.com → Chapter 8: Further Reading
StatsBomb Articles
Industry-leading analysis - xA methodology explanations - Player evaluation examples - Link: statsbomb.com/articles → Chapter 8: Further Reading
Type: Commercial platform - Features: PSxG, distribution analysis, comprehensive profiles → Chapter 13: Further Reading
StatsBomb IQ - Set Piece Module
Type: Commercial platform - Features: Comprehensive set piece tracking and visualization → Chapter 14: Further Reading
StatsBomb IQ Articles on Defensive Metrics
URL: statsbomb.com/articles - Focus: PPDA, pressing metrics, defensive analysis - Key Articles: "What is PPDA and is it useful?", "Evaluating Defenders" → Chapter 12: Further Reading
StatsBomb Open Data
Highest quality free data, includes World Cup matches with freeze frames. Available via Python API or direct download. Best for learning event data analysis. → Quiz: Data Sources and Collection in Soccer
StatsBomb Open Data - Corner Analysis
Platform: GitHub/Documentation - Focus: Working with corner kick data - Level: Beginner to Intermediate → Chapter 14: Further Reading
StatsBomb Open Data Repository
https://github.com/statsbomb/open-data - Sample code for data access - Specification documents → Chapter 7: Further Reading
StatsBomb Open Data Tutorials
Platform: GitHub/YouTube - Focus: Working with event data for defensive analysis - Level: Beginner to Intermediate → Chapter 12: Further Reading
they play similarly regardless of the score. Teams with large $\|\Delta \mathbf{v}_s\|$ are **strategically adaptive**. Neither is inherently better; the question is whether the adaptation is effective. → Chapter 22: Match Strategy and Tactics
Stratified K-Fold
A cross-validation variant that preserves the class distribution (e.g., goal/no-goal ratio) in each fold, important for imbalanced datasets. (Chapter 19) → Appendix E: Glossary of Soccer Analytics Terms
Strengths of Event Data:
Widely available for most professional matches across dozens of leagues - Standardized formats allow cross-competition analysis - Captures the "story" of a match in structured form - Relatively affordable compared to tracking data - Sufficient for many common analytical tasks (xG, passing analysis, → Chapter 2: Data Sources and Collection in Soccer
Strengths of Tracking Data:
Captures everything that happens on the pitch, including off-ball movement - Enables analysis of pressing patterns, space creation, defensive shape, and other continuous phenomena - Allows sophisticated spatial models (pitch control, pressing intensity, expected threat) - Physical metrics support lo → Chapter 2: Data Sources and Collection in Soccer
Strengths:
Extensive historical archive enabling longitudinal analysis - Consistent definitions across seasons (though definitions do evolve, changes are documented) - Well-documented data dictionary with detailed qualifier definitions - Industry standard for many metrics—when media refer to "official" stats, → Chapter 2: Data Sources and Collection in Soccer
Striker metrics:
**Non-penalty goals per 90** (npG/90): Removes penalty distortion - **Non-penalty xG per 90** (npxG/90): Shot quality regardless of finishing - **xG outperformance** (npG - npxG): Finishing skill or luck (controversial---see Chapter 10) - **Shot volume per 90** - **Aerial duels won per 90** (for tar → Chapter 15: Player Performance Metrics
Strong Positive Correlation:
xG and actual goals (r ≈ 0.85): This validates xG as a predictive metric - Shots and xG (r ≈ 0.75): Teams that shoot more generate more xG - Points and goal difference (r ≈ 0.95): Nearly perfectly correlated in league play → Chapter 3: Statistical Foundations for Soccer Analysis
Sufficient sample size
Minimum ~50,000 actions for stable estimates - One full league season typically provides adequate data - More data (multiple seasons) produces smoother estimates → Chapter 9: Expected Threat (xT) and Ball Progression
Survival Analysis
Statistical methods for analyzing time-to-event data, accounting for censoring. Applied in soccer for injury duration modeling and return-to-play estimation. (Chapter 26) → Appendix E: Glossary of Soccer Analytics Terms
Top bar: Score, match time, momentum indicator (colored bar) - Main panel: Pitch map with player dots, formation lines, and Voronoi space control - Bottom bar: Three alert slots (latest alerts in color-coded boxes) → Case Study 2: Building a Live Match Dashboard for Coaching Staff
Channels: Tifo, The Coaches' Voice, Pep Confident - Focus: Visual tactical defensive analysis - Level: All levels → Chapter 12: Further Reading
Team Possession %
Use PADA formula 2. **Opposition Strength** - Weight by opponent xG created 3. **Game State** - Segment by leading/level/trailing 4. **Position** - Compare within position groups → Chapter 12: Key Takeaways
Technical Considerations:
Web scraping is inherently fragile: changes to a website's HTML structure can break your scraper without warning. Build scrapers that fail gracefully and log errors clearly. - Use libraries like `BeautifulSoup` (for HTML parsing) and `requests` (for HTTP requests) in Python, or `Selenium` for JavaSc → Chapter 2: Data Sources and Collection in Soccer
Technical projects:
xG model built from scratch, with documentation explaining methodology, validation, and limitations - Passing network analysis revealing tactical patterns in specific teams or matches - Player similarity tool that identifies comparable players across leagues - Match prediction model with calibration → Chapter 1: Introduction to Soccer Analytics
Optical flow (pixel-level motion between frames) - Trajectory patterns (player and ball movement over time windows) - Temporal convolutions over spatial features → Chapter 23: Video Analysis and Computer Vision
Terms of Use:
Free for educational, personal, and non-commercial use - Attribution to StatsBomb is required in any publication or presentation - Not for commercial products without a separate commercial license - Data should not be redistributed outside the terms of use → Chapter 2: Data Sources and Collection in Soccer
*OpenIntro Statistics* (free online) — Comprehensive introduction - *Statistics* by Freedman, Pisani, and Purves — Classic introduction - *Naked Statistics* by Charles Wheelan — Accessible, non-technical → Prerequisites
The Analyst (Opta)
Official Opta content - Data visualizations - Industry insights - Link: youtube.com/theanalyst → Chapter 8: Further Reading
The Analyst Podcast
Regular discussions on soccer analytics trends and methods. - **Tifo Football** --- Accessible tactical and analytical content on YouTube. - **Zonal Marking** (Michael Cox) --- Tactical analysis blog and book (*Zonal Marking: The Making of Modern European Football*). - **Between the Posts** --- Anal → Chapter 30: Further Reading
The Athletic (Soccer Analytics Coverage)
Long-form analytics journalism that demonstrates effective metric communication to a general audience. Good examples of the storytelling principles from Section 5.6. → Further Reading: Introduction to Soccer Metrics
25 frames/second x 90 minutes x 60 seconds = 135,000 frames per match - 22 players x 2 coordinates = 44 positional values per frame - Plus ball position (3 coordinates including height), velocities, accelerations - **Total: approximately 4-6 million data points per match** → Chapter 2: Data Sources and Collection in Soccer
embedded (within football ops), centralized (independent unit), and multi-club (group level) --- and the optimal choice depends on club size, ownership philosophy, and organizational context. → Chapter 28 Key Takeaways: Building an Analytics Department
**Distance covered in speed zones** (walking, jogging, running, high-speed running, sprinting) - **Number and distance of sprints** - **Peak speed** - **Number of accelerations and decelerations above thresholds** - **Metabolic power and energy expenditure** → Chapter 18: Tracking Data Analytics
Total Football Analysis
Tactical and data analysis - Player profiles with xA - European league coverage - Link: totalfootballanalysis.com → Chapter 8: Further Reading
Total Football Analysis Podcast
Focus: In-depth tactical breakdowns - Defensive Content: Team defensive system analysis → Chapter 12: Further Reading
Positional coordinates for all 22 players and the ball at high frequency (typically 25 Hz), captured by camera systems or GPS. Enables spatial analysis of off-ball movement and team shape. (Chapter 18) → Appendix E: Glossary of Soccer Analytics Terms
An estimated monetary value of a player, influenced by performance, age, contract length, and market conditions. Analytics models attempt to identify under- and over-valued players. (Chapter 25) → Appendix E: Glossary of Soccer Analytics Terms
Transfermarkt
Player and match data - Market values and transfers - Extensive historical records - Link: transfermarkt.com → Chapter 8: Further Reading
Transfermarkt:
Market values and transfer history for players worldwide - Squad information including contract details, agent information, and historical clubs - Injury records (dates, types, duration) - Web scraping or unofficial APIs (the `transfermarkt-api` package provides structured access) - Extremely compre → Chapter 2: Data Sources and Collection in Soccer
transient fatigue
temporary performance decrements following intense passages of play. After a period of sustained high-intensity effort (e.g., a prolonged pressing sequence), players may show reduced output for the subsequent 2--5 minutes. This transient effect is superimposed on the broader match-long fatigue trend → Chapter 18: Tracking Data Analytics
https://www.twenty3.sport/ - Industry insights and visualization examples - xG communication best practices → Chapter 7: Further Reading
Types of drift:
**Data drift** (covariate shift): The distribution of input features changes. Example: a new league season features more shots from outside the box due to tactical trends. - **Concept drift**: The relationship between features and the target changes. Example: VAR overturning goals changes the effect → Chapter 19: Machine Learning for Soccer
Typical values:
Random guessing (predicting base rate): ~0.35 - Simple distance model: ~0.30 - Good xG model: ~0.26-0.28 - Excellent model with tracking data: ~0.24-0.26 → Chapter 7: Expected Goals (xG) Models
github.com/ultralytics/ultralytics The most widely used object detection library for sports applications. Pre-trained models can be fine-tuned on soccer-specific data with minimal code. → Chapter 23: Further Reading
Pressure events tracked - Goalkeeper positioning data - Carry events (not just dribbles) - Shot freeze-frame data (positions of all nearby players) → Appendix D: Data Sources and Tools
Social media analysis (verify methods) - Older blog posts (methods may be outdated) - Non-attributed analysis → Chapter 8: Further Reading
V
VAEP (Valuing Actions by Estimating Probabilities)
An action valuation framework that assigns value to every on-ball action based on its impact on the probability of scoring and conceding in subsequent actions. (Chapter 9) → Appendix E: Glossary of Soccer Analytics Terms
Validating data quality:
Is the data complete? - Are there obvious errors? - Is it consistent across sources? - Are there known limitations we need to account for? → Chapter 1: Introduction to Soccer Analytics
Validation:
Do results make sense? - Are they robust to different methodological choices? - Do they replicate on holdout data? - Would a domain expert find the conclusions reasonable? → Chapter 1: Introduction to Soccer Analytics
A metric comparing player contribution (on-pitch value) to their cost (transfer fee amortization plus wages), used for transfer audit and recruitment efficiency analysis. (Chapter 29) → Appendix E: Glossary of Soccer Analytics Terms
Variants:
**Momentum:** $\boldsymbol{v}^{(t+1)} = \gamma \boldsymbol{v}^{(t)} + \eta \nabla f$; $\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \boldsymbol{v}^{(t+1)}$ - **Adam:** Adaptive learning rates using first and second moment estimates. The default optimizer for most deep learning models in → Appendix A: Mathematical Foundations
Video Analysis Platforms:
**Hudl (formerly Wyscout Professional):** Leading platform for team video analysis, offering tagging, clipping, sharing, and presentation tools. Used by thousands of professional and amateur clubs worldwide. Hudl acquired Wyscout in 2019, consolidating the two largest video platforms under one compa → Chapter 2: Data Sources and Collection in Soccer
CNN features from individual frames - 3D CNN features from frame sequences (e.g., I3D, SlowFast networks) - Transformer-based video representations → Chapter 23: Video Analysis and Computer Vision
Visualization:
Clear, effective data visualizations that communicate findings without requiring extensive explanation - Interactive dashboards using tools like Streamlit, Tableau, or Observable - Novel visual formats that present familiar data in new, illuminating ways → Chapter 1: Introduction to Soccer Analytics
Aggregated player ratings and statistics powered by Opta data - Requires web scraping for data extraction - Less detailed than FBref for advanced metrics - Player ratings (on a 1-10 scale) are widely referenced in media and fan discussion → Chapter 2: Data Sources and Collection in Soccer
William Spearman, "Beyond Expected Goals"
MIT SSAC 2017. Spearman's presentation of the physics-based pitch control model. Available on YouTube. → Chapter 17: Further Reading
Winger metrics add:
**Successful dribbles per 90 and dribble success rate** - **Crosses and cross accuracy** - **Touches in the penalty area per 90** - **xA per 90** → Chapter 15: Player Performance Metrics
Workload Monitoring
The systematic tracking of physical demands placed on players during training and matches, using metrics derived from GPS, accelerometer, and heart rate data. (Chapter 26) → Appendix E: Glossary of Soccer Analytics Terms
Blog posts explaining methods in accessible language - Deep dives on specific questions (e.g., "How does Manchester City's pressing structure change when trailing?") - Analysis of current events demonstrating ability to produce timely, relevant work → Chapter 1: Introduction to Soccer Analytics
Wyscout
Pitch dimensions: 100 x 100 (percentage-based) - Origin: top-left (similar to screen coordinates) - $x$-axis: 0 to 100 (left to right, own goal to opposition goal) - $y$-axis: 0 to 100 (top to bottom) → Chapter 6: The Soccer Pitch as a Coordinate System
Wyscout Academy
Professional scouting integration - Video and data combination - Industry applications - Link: wyscout.com → Chapter 9: Further Reading
a cumulative plot of each team's xG throughout the match. These visualizations have become ubiquitous in post-match analysis and tell compelling stories about how matches unfolded. → Chapter 7: Expected Goals (xG) Models
Tifo Football — Tactical explanations - The Coaches' Voice — Professional insights - HITC Sevens — Tactical analysis → Prerequisites
Z
Zonal Marking
A defensive system where players are responsible for areas of the pitch rather than specific opponents. At set pieces, contrasted with man-marking schemes. (Chapter 14) → Appendix E: Glossary of Soccer Analytics Terms
Zone Definitions:
Prime: 18-22m, central - Good: 22-28m, central - Moderate: 18-25m, wide - Marginal: 28-35m, any angle → Chapter 14: Exercises