Appendix E: Glossary of Soccer Analytics Terms
This glossary provides definitions for key terms used throughout the textbook. Terms are organized alphabetically. The chapter reference in parentheses indicates where the term is first introduced or most thoroughly discussed.
A
Accelerometer --- A sensor that measures the rate of change of velocity, commonly embedded in GPS vests to track player movements and impacts during training and matches. (Chapter 18)
Accuracy --- In classification, the proportion of correct predictions out of total predictions. In soccer analytics, often applied to pass completion or shot-on-target rates. (Chapter 3)
Acute:Chronic Workload Ratio (ACWR) --- The ratio of a player's recent training load (typically 7-day rolling average) to their longer-term baseline (typically 28-day exponentially weighted moving average). Values between 0.8 and 1.3 are generally considered the safe zone. (Chapter 26)
Action Valuation --- A framework for assigning a numerical value to every on-ball action (pass, carry, shot, tackle) based on its contribution to the probability of scoring or conceding. VAEP and xT are prominent action valuation models. (Chapter 9)
Aerial Duel --- A contest between two players for a ball in the air, typically from a long pass, cross, or goal kick. Win rate is expressed as a percentage. (Chapter 12)
Age Curve --- A statistical model describing how player performance changes as a function of age, typically showing peak performance between ages 25 and 29 for most outfield positions. (Chapter 21)
Algorithmic Bias --- Systematic errors in model predictions that create unfair outcomes for particular groups, such as undervaluing players from certain leagues or backgrounds. (Chapter 30)
Angle to Goal --- The angle subtended by the goal posts from the location of a shot, calculated using trigonometry. A key feature in expected goals models. (Chapter 7)
Assist --- The final pass or action leading directly to a goal. Expected assists (xA) models assign a probability that a given pass will result in a goal. (Chapter 8)
B
Backpass --- A pass directed away from the opponent's goal, often used to retain possession and reset the attacking build-up. Excessive backpass frequency may indicate a team under pressing pressure. (Chapter 22)
Bayesian Inference --- A statistical framework that updates beliefs (prior probabilities) with observed data (likelihood) to produce updated beliefs (posterior probabilities). Particularly useful in soccer analytics due to small sample sizes. (Chapter 19)
Big Chance --- A shot opportunity where the scorer would reasonably be expected to score, typically defined as situations with xG above 0.35. Used by some data providers as a categorical metric alongside xG. (Chapter 7)
Bootstrap --- A resampling technique that creates multiple samples by drawing with replacement from the original dataset. Used to estimate confidence intervals for metrics like xG model coefficients. (Chapter 3)
Betweenness Centrality --- A network metric measuring how often a node (player) lies on the shortest path between other nodes. High betweenness centrality in a passing network indicates a player who is critical to the team's ball circulation. (Chapter 10)
Brier Score --- A scoring rule that measures the accuracy of probabilistic predictions by computing the mean squared difference between predicted probabilities and actual binary outcomes. Lower is better. (Chapter 3)
Build-Up Play --- The phase of play where a team progresses the ball from their own defensive third toward the opponent's goal. Characterized by passing sequences, ball carries, and positional movements. (Chapter 22)
C
Calibration --- The property that a model's predicted probabilities match observed frequencies. A well-calibrated xG model assigning 0.20 to shots means approximately 20% of those shots result in goals. (Chapter 7)
Carry --- An on-ball event where a player moves with the ball at their feet. Progressive carries advance the ball at least 10 meters toward the opponent's goal. (Chapter 5)
Clustering --- An unsupervised machine learning technique that groups similar observations together. In scouting, used to identify player archetypes. K-means and hierarchical clustering are common methods. (Chapter 21)
Coefficient of Variation (CV) --- The ratio of the standard deviation to the mean, used to measure relative variability. In analytics, useful for assessing consistency of performance metrics. (Chapter 3)
Compactness --- A measure of how tightly grouped a team's players are on the pitch, typically computed as the area of the convex hull of player positions. Low compactness indicates a compact defensive shape. (Chapter 17)
Cosine Similarity --- A measure of similarity between two vectors, computed as the cosine of the angle between them. Used in scouting to find players with similar statistical profiles. (Chapter 21)
Counter-Attack --- A rapid attacking transition following a turnover, designed to exploit the opponent's disorganized defensive shape. Characterized by directness and speed. (Chapter 22)
Cross-Validation --- A model evaluation technique that partitions data into complementary subsets for training and testing. Stratified k-fold cross-validation is standard for imbalanced classification problems like xG. (Chapter 19)
Cutback --- A pass or cross delivered backward from the byline toward the edge of the penalty area, creating high-quality shooting opportunities. Cutback assists often lead to high-xG shots. (Chapter 22)
D
Dangerous Attack --- A possession sequence that enters the final third and results in a shot or penalty area entry. The ratio of attacks to dangerous attacks measures a team's attacking quality. (Chapter 22)
Data Engineering --- The practice of collecting, cleaning, transforming, and organizing data for analysis. Typically consumes 50-70% of effort in analytics projects. (Chapter 2)
Defensive Action --- An umbrella term for tackles, interceptions, clearances, blocks, and pressures performed to regain possession or prevent the opponent from progressing. (Chapter 12)
Defensive Line Height --- The average x-coordinate of a team's defensive line, indicating how high or deep they defend. Measured from tracking data. (Chapter 18)
Demographic Parity --- A fairness criterion requiring that a model's positive prediction rate be equal across different demographic groups. (Chapter 30)
Digital Twin --- A continuously updated computational model of a player that integrates physical, tactical, technical, and psychological data to predict performance under different conditions. (Chapter 30)
Distance to Goal --- The Euclidean distance from the shot location to the center of the goal. One of the most predictive features in xG models. (Chapter 7)
E
Edge Computing --- Processing data near the point of collection (e.g., at the stadium) rather than in a remote cloud, enabling real-time analytics with low latency. (Chapter 30)
Eigenvector Centrality --- A network metric that measures a node's influence based on the influence of its neighbors. In passing networks, identifies players connected to other highly connected players. (Chapter 10)
Equalized Odds --- A fairness criterion requiring equal true positive rates and false positive rates across demographic groups. (Chapter 30)
Event Data --- A structured record of every on-ball action in a match, including passes, shots, tackles, and carries, with spatial coordinates and outcome labels. Provided by companies such as StatsBomb, Opta, and Wyscout. (Chapter 2)
Expected Assists (xA) --- The probability that a given pass will result in a goal, aggregated over all passes to produce a player's or team's expected assist total. (Chapter 8)
Expected Calibration Error (ECE) --- The weighted average absolute difference between predicted probabilities and observed frequencies across probability bins. Measures calibration quality. (Chapter 7)
Expected Goals (xG) --- The probability that a shot will result in a goal, based on factors including distance, angle, body part, and play pattern. The foundational metric of modern soccer analytics. (Chapter 7)
Expected Goals Against (xGA) --- The total xG of shots conceded by a team, measuring the quality of chances they allow. (Chapter 7)
Expected Points (xPts) --- The expected number of league points from a match, derived from xG and xGA using a Poisson model to simulate scoreline probabilities. (Chapter 20)
Expected Threat (xT) --- A framework that assigns a goal probability to every location on the pitch based on historical data, and values actions by the change in threat they produce. (Chapter 9)
F
False Nine --- A tactical role where the center forward drops deep into midfield to create space and receive the ball, disrupting the opponent's defensive structure. Analyzing false nines requires tracking data to capture positional flexibility. (Chapter 22)
Feature Engineering --- The process of creating informative input variables for machine learning models from raw data. In xG models, includes computing distance, angle, and zone indicators. (Chapter 19)
Feature Importance --- A measure of how much each input variable contributes to a model's predictions. Computed via coefficient magnitude (logistic regression) or permutation importance (tree-based models). (Chapter 19)
Field Tilt --- The proportion of a team's passes or touches that occur in the opponent's defensive third, indicating territorial dominance. (Chapter 11)
Formation --- The spatial arrangement of players on the pitch, described in shorthand (e.g., 4-3-3, 3-5-2). Modern analysis recognizes that formations are fluid and phase-dependent. (Chapter 22)
Freeze Frame --- A snapshot of all player positions at the moment of a key event (e.g., a shot), providing spatial context for action valuation. (Chapter 2)
G
Game State --- The current score differential during a match, which influences team behavior (e.g., teams trailing take more risks). An important contextual variable in analytics models. (Chapter 7)
Generative Model --- A model that learns the underlying distribution of data and can generate new synthetic examples. Applied in soccer for tactical simulation and data augmentation. (Chapter 30)
Gradient Boosting --- An ensemble machine learning method that builds sequential decision trees, each correcting the errors of its predecessors. A common choice for xG models due to its strong predictive performance. (Chapter 19)
Graph Neural Network (GNN) --- A neural network architecture designed to operate on graph-structured data. Well-suited to modeling player interactions and passing networks. (Chapter 24)
G (continued)
Goal-Creating Action (GCA) --- The two offensive actions (such as passes, dribbles, or shots) directly leading to a goal. A broader measure of goal involvement than assists alone. (Chapter 15)
GPS Vest --- A wearable device embedded with Global Positioning System receivers, accelerometers, and gyroscopes, worn by players during training and matches to collect positional and movement data. (Chapter 18)
H
Half-Space --- The tactical zones between the central channel and the flanks, approximately between the penalty area width and the center of the pitch. Controlling the half-spaces is a key principle of positional play. (Chapter 22)
Hazard Function --- In survival analysis, the instantaneous rate of an event (e.g., returning from injury) at time t, given survival to that point. (Chapter 26)
High-Intensity Running --- Running at speeds above a defined threshold (typically 7.5 m/s or 21.8 km/h). Tracked via GPS and used in workload monitoring. (Chapter 18)
I
Interception --- A defensive action where a player reads and cuts out an opponent's pass before it reaches its intended target. (Chapter 12)
Isotonic Regression --- A non-parametric calibration method that fits a non-decreasing function to transform model outputs into well-calibrated probabilities. Often used for post-hoc xG calibration. (Chapter 19)
J
JSON (JavaScript Object Notation) --- A lightweight data interchange format used by most soccer data providers (including StatsBomb) to structure event and match data. (Chapter 2)
K
Kaplan-Meier Estimator --- A non-parametric statistic used to estimate the survival function from censored data. In soccer, applied to return-to-play modeling after injuries. (Chapter 26)
Kernel Density Estimation (KDE) --- A non-parametric method for estimating the probability density function of a variable. Used to create smooth heatmaps of player actions or shot locations on the pitch. (Chapter 17)
Key Pass --- A pass that leads directly to a shot attempt, regardless of whether the shot results in a goal. (Chapter 8)
L
Line-Breaking Pass --- A pass that travels through a line of opposition players, progressing the ball past defensive or midfield structures. Valued highly in modern tactical analysis. (Chapter 10)
Log-Loss (Binary Cross-Entropy) --- A loss function for probabilistic classification models, penalizing confident incorrect predictions more heavily. The standard evaluation metric for xG models. (Chapter 3)
Logistic Regression --- A statistical model for binary classification that estimates probabilities using a logistic (sigmoid) function. Often used as a baseline xG model due to its interpretability and inherent calibration. (Chapter 19)
M
Man-Marking --- A defensive system where each defender is assigned to track a specific opponent. At set pieces, contrasted with zonal marking schemes. Analytics can evaluate which system is more effective against specific opponents. (Chapter 14)
MinMaxScaler --- A preprocessing technique that scales features to a fixed range, typically [0, 1]. Essential for fair comparison in multi-criteria scouting scores and distance-based algorithms. (Chapter 21)
Monte Carlo Simulation --- A computational technique using random sampling to estimate probability distributions. In soccer, used to simulate match outcomes, season results, and tournament brackets. (Chapter 20)
N
Network Analysis --- The study of relationships between entities (players) using graph theory. Passing networks represent players as nodes and passes as weighted edges. (Chapter 10)
Normalization --- Scaling data to a standard range. Per-90 normalization divides raw counts by minutes played and multiplies by 90 to enable fair comparison between players with different playing times. (Chapter 5)
O
Off-Ball Movement --- Player movement when not in possession of the ball, including runs to create space, pressing movements, and defensive positioning. Requires tracking data to analyze. (Chapter 18)
Overperformance --- When a team's actual results (goals, points) exceed what their underlying metrics (xG, xPts) would predict. May indicate genuine skill or favorable variance. (Chapter 16)
P
Passes Per Defensive Action (PPDA) --- A pressing intensity metric computed as opponent passes allowed divided by a team's defensive actions in the opponent's half. Lower PPDA indicates more intense pressing. (Chapter 12)
Penalty Area (Box) --- The 18-yard rectangular area in front of each goal. Shots from inside the box are significantly more likely to result in goals. (Chapter 7)
Per-90 Metrics --- Statistics normalized to a 90-minute match equivalent by dividing by minutes played and multiplying by 90. Enables comparison across players with different playing time. (Chapter 5)
Pitch Control --- A model that computes the probability that each team controls each point on the pitch at a given moment, based on player positions and velocities. (Chapter 17)
Platt Scaling --- A calibration method that fits a logistic regression to transform model outputs into calibrated probabilities. Also called sigmoid calibration. (Chapter 19)
Poisson Distribution --- A discrete probability distribution modeling the number of events in a fixed interval. Goals per match are approximately Poisson-distributed, making it fundamental to expected points calculations. (Chapter 3)
Pose Estimation --- Computer vision technique that detects and tracks the positions of body joints (skeleton) from video. Future applications include biomechanical analysis and technique assessment. (Chapter 30)
Pressing --- A defensive tactic where players actively close down opponents to force turnovers. Pressing intensity, triggers, and effectiveness are key analytical topics. (Chapter 22)
Progressive Pass --- A pass that moves the ball at least 10 meters closer to the opponent's goal, measured along the x-axis. A key ball progression metric. (Chapter 10)
R
Radar Chart (Spider Chart) --- A visualization that displays multiple variables on axes radiating from a center point, commonly used to show player profiles across multiple metrics. (Chapter 15)
Random Forest --- An ensemble learning method that builds multiple decision trees and averages their predictions. Provides feature importance rankings useful for understanding model drivers. (Chapter 19)
Regression to the Mean --- The statistical tendency for extreme observations to be followed by more moderate ones. Critical for interpreting over- and under-performance in soccer metrics. (Chapter 3)
ROC-AUC --- The area under the Receiver Operating Characteristic curve, measuring a classifier's ability to distinguish between positive and negative cases. Used to evaluate xG model discrimination. (Chapter 19)
S
Set Piece --- A restart of play from a dead-ball situation: corners, free kicks, throw-ins, goal kicks, and penalties. Account for approximately 25-30% of all goals. (Chapter 14)
Shot on Target --- A shot that would enter the goal if not saved by the goalkeeper. Not all shots with high xG are on target. (Chapter 7)
Stratified K-Fold --- A cross-validation variant that preserves the class distribution (e.g., goal/no-goal ratio) in each fold, important for imbalanced datasets. (Chapter 19)
Survival Analysis --- Statistical methods for analyzing time-to-event data, accounting for censoring. Applied in soccer for injury duration modeling and return-to-play estimation. (Chapter 26)
T
Tracking Data --- Positional coordinates for all 22 players and the ball at high frequency (typically 25 Hz), captured by camera systems or GPS. Enables spatial analysis of off-ball movement and team shape. (Chapter 18)
Transfer Market Value --- An estimated monetary value of a player, influenced by performance, age, contract length, and market conditions. Analytics models attempt to identify under- and over-valued players. (Chapter 25)
V
VAEP (Valuing Actions by Estimating Probabilities) --- An action valuation framework that assigns value to every on-ball action based on its impact on the probability of scoring and conceding in subsequent actions. (Chapter 9)
Value Ratio --- A metric comparing player contribution (on-pitch value) to their cost (transfer fee amortization plus wages), used for transfer audit and recruitment efficiency analysis. (Chapter 29)
W
Workload Monitoring --- The systematic tracking of physical demands placed on players during training and matches, using metrics derived from GPS, accelerometer, and heart rate data. (Chapter 26)
X
xA --- See Expected Assists.
xG --- See Expected Goals.
xGA --- See Expected Goals Against.
xG Chain --- The total xG of all possessions in which a player was involved, measuring their contribution to shot-creating sequences. (Chapter 7)
xPts --- See Expected Points.
xT --- See Expected Threat.
Z
Zonal Marking --- A defensive system where players are responsible for areas of the pitch rather than specific opponents. At set pieces, contrasted with man-marking schemes. (Chapter 14)