> "Not everything that counts can be counted, and not everything that can be counted counts."
Learning Objectives
- Identify the most common traditional soccer statistics and articulate their strengths and limitations
- Explain the core principles behind designing effective performance metrics
- Distinguish between rate statistics and counting statistics and know when each is appropriate
- Apply context and adjustment factors to raw statistics to produce fairer comparisons
- Construct a metric validation framework using stability, discrimination, and predictive power
- Use percentile rankings and Z-scores to compare players across positions and leagues
- Understand composite metrics and player rating systems
- Communicate quantitative findings to non-technical stakeholders using best practices in data storytelling
- Describe the full lifecycle of a metric from creation through validation, adoption, and iteration
- Recognize and avoid common fallacies when interpreting soccer metrics
In This Chapter
- Chapter Overview
- 5.1 Traditional Statistics and Their Limitations
- 5.2 The Philosophy of Metric Design
- 5.3 Rate Statistics vs. Counting Statistics
- 5.4 Context and Adjustment Factors
- 5.5 Metric Validation Frameworks
- 5.6 Percentile Rankings and Z-Scores for Player Comparison
- 5.7 Composite Metrics and Player Ratings
- 5.8 Communicating Metrics to Stakeholders
- 5.9 The Metrics Lifecycle: Creation, Validation, Adoption, Iteration
- 5.10 Common Fallacies When Interpreting Soccer Metrics
- Summary
- Chapter 5 Notation Reference
Chapter 5: Introduction to Soccer Metrics
"Not everything that counts can be counted, and not everything that can be counted counts." — William Bruce Cameron (often attributed to Albert Einstein)
Chapter Overview
Soccer analytics rests on a deceptively simple premise: measure what players and teams do on the pitch, then use those measurements to make better decisions. In practice, this premise hides a rich set of challenges. What should we measure? How do we account for the fact that a midfielder playing against Manchester City faces a fundamentally different task than one playing against a newly promoted side? When is a number stable enough to trust, and when is it just noise dressed up as insight?
This chapter lays the conceptual groundwork for every metric we will encounter in the remainder of the textbook. We begin with the traditional statistics that have appeared on the back of football sticker cards for decades --- goals, assists, pass completion rates --- and carefully examine why they fall short as tools for serious analysis. We then step back and think about what makes a good metric: what design principles separate a genuinely informative measure from a misleading one. From there we explore the crucial distinction between counting statistics and rate statistics, learn how context adjustments can level the playing field, and build a systematic framework for validating any metric we create.
We continue by introducing percentile rankings and Z-scores as tools for comparing players, explore the construction of composite metrics and player rating systems, and address the human side of analytics: how to present numbers to coaches, scouts, directors of football, and fans in ways that inspire trust and drive action. We close the chapter with the full lifecycle of a metric --- from initial creation through validation, adoption, and iteration --- and catalogue the most common fallacies that lead analysts astray when interpreting soccer data.
By the end of this chapter, you will have a mental toolkit for evaluating any soccer metric you encounter --- whether it was invented last week in a blog post or has been used by clubs for years.
5.1 Traditional Statistics and Their Limitations
5.1.1 The Box-Score Era
For most of soccer's history, the statistical record of a match was remarkably thin. The official match report would note goals, assists, yellow and red cards, and substitutions. Television broadcasts added shots, corners, and possession percentages. Newspapers published league tables with points, goals scored, and goals conceded.
These box-score statistics remain the lingua franca of soccer discourse. When a pundit says "he's scored 15 goals this season," everyone understands the claim. That universality is a genuine strength --- but it comes at a cost.
Let us catalogue the most common traditional statistics and examine each one's blind spots.
5.1.2 Goals: The Noisiest Signal
Goals are the most important event in a match, yet as a measure of a striker's quality, raw goal tallies are surprisingly noisy. A striker who scores 20 goals from 5.0 xG worth of chances is having an unsustainable hot streak; one who scores 10 from 12.0 xG is being unlucky. Without understanding the quality of chances, we cannot separate skill from fortune.
Consider two Premier League strikers across a single season. Striker A scores 18 goals from shots totaling 14.5 xG --- an overperformance of 3.5 goals. Striker B scores 12 goals from shots totaling 15.8 xG --- an underperformance of 3.8 goals. The naive observer concludes that Striker A is the better finisher. But the data suggests that Striker B is actually getting into better positions and creating higher-quality chances; he has simply been unlucky with finishing variance. The following season, regression to the mean is likely to bring their totals closer together --- or even reverse the ranking.
This is not a hypothetical scenario. Research by analysts at clubs and in academia has consistently shown that finishing skill (the ability to beat xG consistently) is one of the least stable individual metrics in soccer. Only a handful of elite finishers --- players like Lionel Messi or Robert Lewandowski --- demonstrate persistent overperformance across multiple seasons, and even their margins are modest compared to the random variation that affects most strikers.
Common Pitfall: Raw goal tallies conflate two distinct skills: the ability to get into scoring positions (shot volume and shot quality) and the ability to convert those chances (finishing). Separating these skills requires expected goals models, which we introduce in Chapter 11.
5.1.3 Assists: Arbitrary Credit Assignment
An assist is awarded to the player who provides the final pass before a goal. This definition is arbitrary in several ways: a brilliant through ball that requires the scorer to beat three defenders counts the same as a simple square pass tapped in from two yards. Moreover, the assister gets no credit if the chance is missed.
Consider a playmaker who delivers ten exquisite through balls into the penalty area during a match. Nine of them are squandered by teammates who miss the target or are saved by the goalkeeper. One is converted. The playmaker receives one assist. A different playmaker delivers ten routine square passes across the box. Nine of them result in blocked shots. One is tapped in. That playmaker also receives one assist. By the raw assist count, these two performances are identical --- yet they represent vastly different levels of creative contribution.
The problem deepens when we consider that assists are entirely dependent on a teammate scoring. A midfielder who creates 3.5 xA (expected assists) per 90 minutes but plays alongside poor finishers may record fewer assists than one who creates 1.5 xA per 90 but plays alongside clinical strikers. The raw assist tally punishes the better creator.
Intuition: Assists measure the last step in the creative chain and only when that chain ends in a goal. They ignore the quality of the chance created and are hostage to the finishing ability of teammates. Expected assists (xA) address this by crediting the passer with the probability that their pass leads to a goal, regardless of whether the shot is actually converted.
5.1.4 Pass Completion Rate: Punishing Ambition
Perhaps the most misleading traditional statistic. A centre-back who plays short, safe passes to fellow defenders will record a completion rate above 90%. An attacking midfielder who attempts incisive through balls may complete only 70% of passes --- yet the latter's passing is far more valuable. Without accounting for pass difficulty, completion rates punish ambition.
This problem is not merely theoretical. In the 2018-19 Premier League season, the players with the highest pass completion rates were overwhelmingly centre-backs and defensive midfielders who rarely attempted forward passes of more than 10 metres. Meanwhile, players like Kevin De Bruyne, who consistently attempted some of the most difficult passes in the league, recorded completion rates that looked mediocre by comparison. Any scouting model that ranked passers by completion rate would systematically undervalue creative players and overvalue conservative ones.
The solution is to compute expected pass completion (xPass) --- a model that estimates the probability of a pass being completed given its length, direction, the positions of nearby opponents, and other contextual features. The difference between actual completion and expected completion tells us whether a passer is beating the difficulty curve. We explore this concept further in Chapter 12.
5.1.5 Shots and Shots on Target
These counting statistics ignore the most important question: where was the shot taken from? A shot from 35 yards that sails over the bar and a shot from six yards that the goalkeeper saves both count as one shot, yet they represent entirely different levels of threat.
Shots on target is only slightly better. It tells us whether the shot was heading toward the goal frame, but a weak shot straight at the goalkeeper from 25 yards and a powerful drive to the top corner from 18 yards are both "on target." The metric provides no information about save difficulty.
Furthermore, shot volume is highly influenced by tactical context. A team that is trailing will naturally take more shots, including lower-quality attempts from distance, which inflates their shot count without reflecting genuine attacking quality. Teams that dominate possession may take fewer shots overall but generate higher-quality chances, making their raw shot totals misleadingly low.
5.1.6 Clean Sheets: A Team Outcome Attributed to Individuals
Clean sheets are attributed to goalkeepers and defences, but they are heavily influenced by sample size, opponent quality, and team tactics. A goalkeeper behind a dominant defence may accumulate clean sheets while rarely being tested. Conversely, an outstanding goalkeeper behind a weak defence may make extraordinary saves match after match yet rarely keep a clean sheet.
Consider two goalkeepers: Keeper A plays behind a defence that concedes an average of 0.8 xG per match. Keeper A records 15 clean sheets in 38 matches. Keeper B plays behind a defence that concedes 1.6 xG per match. Keeper B records 6 clean sheets. The raw clean sheet count suggests Keeper A is far superior, but the underlying data shows that Keeper B's defence allowed twice as much threat per match. A better metric would measure how many goals the goalkeeper prevented relative to what was expected --- the post-shot xG model we cover in Chapter 11.
Real-World Application: When Arsenal signed goalkeeper Aaron Ramsdale in 2021, some analysts noted that his clean sheet record at Sheffield United was poor. However, Sheffield United's defence had been catastrophically bad, conceding enormous volumes of high-quality chances. Ramsdale's post-shot expected goals numbers told a different story: he was saving shots that most goalkeepers would not, masked by the sheer volume of chances his defence allowed.
5.1.7 Why Traditional Statistics Mislead: Common Themes
The limitations above share common themes that will recur throughout this chapter:
-
Lack of context. Raw counts ignore the difficulty of the task. Scoring against a low block is harder than scoring against a team pushing forward; completing a pass under pressure is harder than completing one in open space.
-
Counting without weighting. Not all events are created equal. A tackle that wins the ball in the opponent's box is more valuable than one on the halfway line, yet both count as "one tackle."
-
Credit assignment problems. Assists credit only the final passer. Goals credit only the scorer. The build-up play that created the opportunity goes unmeasured.
-
Small sample sizes. A striker might take only 80--100 shots per season. At that volume, random variation can easily swing a goal tally by five or more, making season-to-season comparisons unreliable.
-
Selection bias. Players who attempt difficult actions (long-range shots, risky dribbles) are penalized in percentage-based statistics, while conservative players are rewarded.
Let us quantify the noise problem with a short simulation.
import numpy as np
# Simulate a striker with true finishing skill = 0.15 goals per shot
np.random.seed(42)
true_conversion = 0.15
shots_per_season = 100
n_simulations = 10_000
goals = np.random.binomial(n=shots_per_season, p=true_conversion, size=n_simulations)
print(f"True conversion rate: {true_conversion:.1%}")
print(f"Simulated goal range (95% interval): {np.percentile(goals, 2.5):.0f} - {np.percentile(goals, 97.5):.0f}")
print(f"That's a conversion rate range of "
f"{np.percentile(goals, 2.5)/shots_per_season:.1%} - "
f"{np.percentile(goals, 97.5)/shots_per_season:.1%}")
Even for a player whose true conversion rate is exactly 15%, random variation alone produces seasons ranging from roughly 8 to 23 goals. This is the fundamental challenge: traditional statistics mix signal (the player's actual ability) with noise (luck, matchday circumstances, sample size).
5.1.8 The Data Revolution
The arrival of event-level data from providers such as Opta (now Stats Perform), StatsBomb, and Wyscout in the 2010s transformed what was possible. Instead of knowing merely that a shot occurred, analysts could now see the shot's location in $(x, y)$ coordinates, the body part used, whether it followed a cross or a dribble, and whether the shooter was under pressure.
This granularity made it possible to move beyond box-score statistics and build metrics that account for context, weight events by their value, and distribute credit more fairly. The rest of this chapter shows you how.
5.2 The Philosophy of Metric Design
5.2.1 What Makes a Good Metric?
Before constructing any metric, we should ask: what properties should it have? Drawing on measurement theory from psychometrics and sports science, we can identify five desirable properties.
-
Validity. The metric measures what it claims to measure. If we say a metric captures "defensive contribution," it should correlate with observable defensive outcomes, not merely with the number of minutes played.
-
Reliability. The metric produces consistent results under similar conditions. If a player's rating fluctuates wildly from match to match despite consistent performances, the metric is unreliable.
-
Discrimination. The metric separates players or teams who are genuinely different. If every central midfielder scores between 0.45 and 0.55 on a metric, it provides little useful information.
-
Interpretability. Stakeholders can understand what the number means. A metric expressed in "goals added per 90 minutes" is more interpretable than one expressed in abstract units.
-
Actionability. The metric points toward decisions. Knowing that a player creates 0.3 expected assists per 90 is actionable if it helps you decide whether to sign them. A metric that is interesting but not tied to any decision is a curiosity, not a tool.
Intuition: Think of a metric as a lens. A good lens brings the scene into sharp focus; a bad lens introduces distortion. The five properties above are criteria for lens quality.
5.2.2 The Signal-to-Noise Framework
Every observed statistic can be decomposed conceptually:
$$\text{Observed Value} = \text{True Talent} + \text{Context Effects} + \text{Random Noise}$$
The goal of metric design is to maximize the signal (true talent) relative to the noise (randomness and context). We can do this in three ways:
- Increase sample size to average out noise.
- Apply context adjustments to remove systematic biases.
- Choose the right unit of measurement (per 90, per possession, per touch) to normalize for opportunity.
We will address each of these strategies in the sections that follow.
5.2.3 Descriptive vs. Predictive vs. Prescriptive Metrics
It is useful to classify metrics by their purpose:
-
Descriptive metrics summarize what happened. Total goals, pass maps, and shot charts are descriptive. They answer: what occurred?
-
Predictive metrics forecast what will happen. Expected goals (xG), for instance, estimates the probability that a shot will be scored based on historical data. Predictive metrics answer: what is likely to happen if conditions remain similar?
-
Prescriptive metrics recommend actions. A recruitment model that outputs "sign this player because they will add X goals to your attack" is prescriptive. These metrics answer: what should we do?
Most of the metrics in this textbook are descriptive or predictive. Prescriptive metrics require combining multiple predictive metrics with a decision-making framework --- a topic we address in Part IV.
Best Practice: Always clarify the purpose of a metric before using it. A descriptive metric (like total distance run) should not be used as a predictive metric (to forecast future performance) without validating that relationship first.
5.2.4 Levels of Aggregation
Soccer metrics can be computed at many levels:
| Level | Example | Use Case |
|---|---|---|
| Event | xG of a single shot | Shot quality assessment |
| Sequence | xG chain value of a possession | Build-up evaluation |
| Match | Team xG difference | Post-match analysis |
| Multi-match | Rolling 10-game xG | Form tracking |
| Season | Per-90 stats over a full season | Player evaluation |
| Career | Career xG over-/under-performance | Talent identification |
The appropriate level depends on the question being asked. Scouting decisions typically require season-level data; in-game tactical adjustments need event or sequence-level data.
5.2.5 The Importance of a Clear Unit of Measurement
A metric's unit of measurement determines how it is interpreted and compared. Units should be natural and familiar whenever possible. "Goals added" is more intuitive than "standard deviations above the mean of weighted expected threat contribution." Even when a metric's internal computation is complex, the output should be translated into units that stakeholders understand.
Consider the difference between these two statements:
- "This player's xT contribution is 0.47 sigma above the positional mean."
- "This player's ball progression adds approximately 2.1 goals of value per season compared to an average midfielder."
The second statement is harder to compute but vastly easier for a coach or director to interpret and act upon. The translation from abstract units to goal-equivalents is a critical step in metric communication.
Best Practice: When designing a new metric, always ask: "Can I express this in terms of goals, points, or wins?" If the answer is yes, do so. If the answer is no, provide a clear reference group (league average, positional percentile) so that the number has context.
5.3 Rate Statistics vs. Counting Statistics
5.3.1 Definitions
A counting statistic accumulates over time: goals, assists, tackles, interceptions. A player who plays more minutes accumulates more.
A rate statistic normalizes a count by some denominator: goals per 90 minutes, tackles per possession, pass completion percentage. Rate statistics attempt to measure intensity or efficiency rather than volume.
Both types have legitimate uses, but confusing them leads to analytical errors.
5.3.2 The Per-90 Convention
The most common normalization in soccer analytics is per 90 minutes. If a player scores 10 goals in 2,000 minutes, their per-90 rate is:
$$\text{Goals per 90} = \frac{10}{2000} \times 90 = 0.45$$
Per-90 rates allow us to compare a starter who plays 3,000 minutes with a substitute who plays 900 minutes. Without normalization, the starter's raw totals would dominate any comparison.
However, per-90 rates introduce their own issues:
import pandas as pd
# Example: comparing two strikers
data = {
"Player": ["Striker A", "Striker B"],
"Goals": [12, 3],
"Minutes": [2700, 450],
}
df = pd.DataFrame(data)
df["Goals_per_90"] = (df["Goals"] / df["Minutes"]) * 90
print(df.to_string(index=False))
# Striker B has a higher per-90 rate (0.60 vs 0.40),
# but from a tiny sample of 450 minutes (5 full matches).
Common Pitfall: Per-90 rates for players with fewer than ~900 minutes (roughly 10 full matches) are extremely unreliable. Always report the sample size alongside any rate statistic. Many analysts use a minimum threshold of 500--1,000 minutes before computing per-90 rates.
5.3.3 Sample Size and the Small-Sample Trap
The per-90 convention creates a particularly dangerous trap when comparing players with vastly different playing time. A substitute who has played 200 minutes and scored 3 goals has a goals-per-90 rate of 1.35 --- a number that would be extraordinary if sustained over a full season. But the 95% confidence interval around that rate is enormous. Using the Poisson assumption:
$$\text{95\% CI} = \frac{90}{200} \times \left[\text{Poisson}_{2.5\%}(3),\; \text{Poisson}_{97.5\%}(3)\right] \approx [0.28,\; 3.52]$$
This interval spans from "below average" to "greatest striker in history." Reporting the point estimate of 1.35 without the confidence interval is misleading at best and irresponsible at worst.
A useful rule of thumb: the coefficient of variation of a Poisson count $k$ is approximately $1/\sqrt{k}$. For a player with 3 goals, the coefficient of variation is $1/\sqrt{3} \approx 58\%$. For a player with 20 goals, it drops to $1/\sqrt{20} \approx 22\%$. You need a substantial number of events before per-90 rates become trustworthy.
import numpy as np
from scipy.stats import poisson
def per90_confidence_interval(
events: int,
minutes: int,
confidence: float = 0.95,
) -> tuple[float, float, float]:
"""Compute per-90 rate with confidence interval.
Args:
events: Number of events observed.
minutes: Total minutes played.
confidence: Confidence level (default 0.95).
Returns:
Tuple of (lower_bound, point_estimate, upper_bound) per 90.
"""
alpha = 1 - confidence
lower_count = poisson.ppf(alpha / 2, events) if events > 0 else 0
upper_count = poisson.ppf(1 - alpha / 2, events)
rate = events / minutes * 90
lower = lower_count / minutes * 90
upper = upper_count / minutes * 90
return lower, rate, upper
# Example
lo, mid, hi = per90_confidence_interval(events=3, minutes=200)
print(f"Per-90 rate: {mid:.2f} (95% CI: [{lo:.2f}, {hi:.2f}])")
Intuition: Would you bet your career on the result of flipping a coin three times? That is essentially what you are doing when you draw conclusions from a player's per-90 rate based on 200 minutes of data. The more events you observe, the more confident you can be that the rate reflects genuine ability rather than luck.
5.3.4 Alternative Denominators
Per-90 is not always the best denominator. Consider these alternatives:
| Denominator | When to Use | Example |
|---|---|---|
| Per 90 minutes | General player comparison | Goals per 90 |
| Per touch | Evaluating involvement efficiency | Tackles per touch |
| Per possession | Measuring team actions in context | Shots per team possession |
| Per pass received | Assessing a player's on-ball impact | Progressive carries per pass received |
| Per match | When minutes are roughly equal | Clean sheets per match |
The choice of denominator should reflect the opportunity a player has to perform the action. A wide midfielder who rarely tackles because their team dominates possession should not be penalized in a tackles-per-90 metric; a tackles-per-defensive-action metric would be fairer.
For example, when evaluating a centre-back's aerial ability, headers won per 90 is a poor denominator because it does not account for how many aerial duels the player was involved in. Headers won as a percentage of aerial duels is more informative, but even that requires a minimum sample of aerial duels (typically 30-50) before the percentage becomes reliable.
5.3.5 When Counting Statistics Are Preferable
Rate statistics are not always superior. Consider these scenarios:
-
Team-level outcomes. A team's total points, total goals scored, and total goals conceded over a season determine league position. Here, volume matters.
-
Workload and durability. Knowing that a player covered 340 km over 38 matches is more relevant for fitness management than knowing their per-90 distance.
-
Threshold effects. The Golden Boot goes to the player who scores the most goals, not the one with the best conversion rate. In contexts where absolute volume determines the reward, counting statistics are the right tool.
-
Squad contribution. A substitute who contributes 0.8 goals per 90 in 500 minutes adds fewer actual goals than a starter with 0.4 goals per 90 over 3,000 minutes. When evaluating total squad value, counting statistics capture real output.
-
Durability as a skill. Availability is one of the most underrated attributes in professional soccer. A player who maintains high performance across 3,000 minutes contributes more to a team than one who is brilliant for 800 minutes and injured for the rest. Counting statistics implicitly reward durability.
Intuition: Counting statistics answer "how much did this player produce?" Rate statistics answer "how efficiently did this player produce?" Both questions matter --- the key is knowing which one you are trying to answer.
5.3.6 Combining Rate and Volume
A useful visualization is the rate-volume scatter plot, which places a counting statistic on one axis and the corresponding rate on the other. Players in the top-right quadrant are both prolific and efficient --- the ideal combination.
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(7)
n = 50
minutes = np.random.uniform(800, 3400, n)
true_rate = np.random.normal(0.35, 0.12, n).clip(0.05, 0.80)
goals = np.random.poisson(true_rate * minutes / 90)
per90 = goals / minutes * 90
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(goals, per90, alpha=0.6, edgecolors="k", linewidths=0.5)
ax.axhline(np.median(per90), color="gray", linestyle="--", linewidth=0.8)
ax.axvline(np.median(goals), color="gray", linestyle="--", linewidth=0.8)
ax.set_xlabel("Total Goals (Counting)", fontsize=12)
ax.set_ylabel("Goals per 90 (Rate)", fontsize=12)
ax.set_title("Rate vs. Volume Scatter Plot", fontsize=14)
ax.annotate("High volume,\nhigh rate", xy=(0.85, 0.85), xycoords="axes fraction",
fontsize=10, ha="center", color="green")
ax.annotate("Low volume,\nlow rate", xy=(0.15, 0.15), xycoords="axes fraction",
fontsize=10, ha="center", color="red")
plt.tight_layout()
plt.savefig("rate_vs_volume.png", dpi=150)
plt.show()
This four-quadrant framework is one of the simplest yet most effective tools in the soccer analyst's visualization arsenal. When presenting to scouts or coaches, the rate-volume scatter immediately highlights four player archetypes: the prolific elite (high volume, high rate), the efficient specialist (low volume, high rate), the high-volume grinder (high volume, low rate), and the underperformer (low volume, low rate). Each archetype has different recruitment implications.
5.4 Context and Adjustment Factors
5.4.1 Why Context Matters
Two players with identical per-90 statistics may be performing at very different levels if their contexts differ. A defender who records 3.0 tackles per 90 in a team that faces 40 attacks per match is less remarkable than one who records 3.0 tackles per 90 in a team that faces only 25 attacks. The first has more opportunities to tackle.
Context adjustments attempt to remove these environmental effects so that comparisons are fairer. The most common adjustments are:
- Opponent adjustment
- Game-state adjustment
- Possession adjustment
- Venue adjustment (home/away)
- League and competition adjustment
5.4.2 Opponent Adjustment
Not all opponents are equal. To compare a striker who plays in the Premier League with one who plays in the Eredivisie, we need to account for the difference in defensive quality.
A simple opponent adjustment works as follows. Suppose a team scores 2 goals against an opponent that typically concedes 1.2 goals per match. The league average is 1.35 goals conceded per match. The opponent-adjusted goals scored can be estimated as:
$$\text{Adjusted Goals} = \text{Raw Goals} \times \frac{\text{League Average Conceded}}{\text{Opponent Average Conceded}}$$
$$\text{Adjusted Goals} = 2 \times \frac{1.35}{1.20} = 2.25$$
This upward adjustment reflects the fact that scoring 2 against a strong defence is more impressive than average. Conversely, scoring 2 against a team that concedes 2.0 per match would be adjusted downward:
$$\text{Adjusted Goals} = 2 \times \frac{1.35}{2.00} = 1.35$$
Advanced: More sophisticated opponent adjustments use ridge regression or hierarchical Bayesian models to estimate opponent strength, accounting for the fact that an opponent's goals-conceded record is itself influenced by the quality of teams they have faced. This avoids circular reasoning.
5.4.3 Game-State Adjustment
A team that is winning behaves differently from one that is losing. When ahead, teams often sit deeper and absorb pressure, leading to fewer shots but more defensive actions. When behind, they push forward, generating more chances but leaving space at the back.
Game state is typically defined by the score differential at the time of each event:
| Game State | Description |
|---|---|
| GS = +2 or more | Winning comfortably |
| GS = +1 | Winning by one goal |
| GS = 0 | Level |
| GS = -1 | Losing by one goal |
| GS = -2 or less | Losing badly |
A player's statistics can be split by game state and then re-weighted to a neutral baseline. For example, if a team spends 60% of its time winning (because it is a strong team), its defenders will accumulate fewer tackles than they would in a neutral game state. Adjusting for game state gives a fairer picture of defensive workload.
import pandas as pd
import numpy as np
# Simulated game-state data for a defender
np.random.seed(21)
game_states = [-2, -1, 0, 1, 2]
minutes_in_state = [100, 350, 600, 700, 250] # Strong team: more time ahead
tackles_in_state = [8, 25, 35, 30, 8]
df_gs = pd.DataFrame({
"game_state": game_states,
"minutes": minutes_in_state,
"tackles": tackles_in_state,
})
df_gs["tackles_per_90"] = df_gs["tackles"] / df_gs["minutes"] * 90
# League average distribution of game-state minutes (hypothetical)
league_avg_pct = [0.10, 0.20, 0.35, 0.20, 0.15]
total_minutes = sum(minutes_in_state)
# Re-weight tackles to neutral game-state distribution
df_gs["league_avg_minutes"] = [p * total_minutes for p in league_avg_pct]
df_gs["adjusted_tackles"] = (
df_gs["tackles_per_90"] / 90 * df_gs["league_avg_minutes"]
)
raw_total = df_gs["tackles"].sum()
adj_total = df_gs["adjusted_tackles"].sum()
print(f"Raw tackles: {raw_total}")
print(f"Game-state adjusted tackles: {adj_total:.1f}")
print(f"Adjustment factor: {adj_total / raw_total:.2f}")
The game-state adjustment is particularly important for evaluating players on dominant teams. A centre-back at Manchester City or Bayern Munich spends a disproportionate amount of time in positive game states, which reduces defensive workload. Without adjustment, such a player may appear to make fewer tackles and interceptions than an equivalent player at a mid-table club, when in fact the difference is entirely driven by context.
5.4.4 Possession Adjustment
A team that holds 65% of possession will naturally accumulate more passes, touches, and progressive carries than a team with 35% possession. Conversely, the low-possession team will have more opportunities for tackles, interceptions, and blocks.
Possession adjustment re-scales statistics to a hypothetical 50-50 possession split. For offensive actions:
$$\text{Possession-Adjusted} = \text{Raw Value} \times \frac{50\%}{\text{Team Possession \%}}$$
For defensive actions, the adjustment uses the opponent's possession:
$$\text{Possession-Adjusted (Defensive)} = \text{Raw Value} \times \frac{50\%}{1 - \text{Team Possession \%}}$$
Real-World Application: Possession adjustment is especially important when comparing players across leagues. The average possession split in La Liga is historically more skewed (because of Barcelona and Real Madrid) than in the Premier League, where possession tends to be more evenly distributed.
5.4.5 Venue Adjustment
Home advantage is well-documented in soccer. Teams score more goals, win more matches, and commit fewer fouls at home. While the magnitude of home advantage has decreased in recent years (and dropped sharply during the COVID-19 behind-closed-doors period), it remains a relevant factor.
A simple venue adjustment applies a multiplicative factor based on historical home/away scoring rates:
$$\text{Home Adjustment Factor} = \frac{\text{League Goals per Match}}{\text{League Home Goals per Match}} \approx 0.87$$
$$\text{Away Adjustment Factor} = \frac{\text{League Goals per Match}}{\text{League Away Goals per Match}} \approx 1.15$$
These factors vary by league and era. Analysts should compute them from the specific dataset they are working with.
5.4.6 League and Competition Adjustment
Comparing a player's output in the Dutch Eredivisie to a player's output in the English Premier League requires an adjustment for league quality. A forward who scores 0.6 goals per 90 in the Eredivisie will not necessarily maintain that rate in a stronger league.
League adjustment factors can be estimated by studying players who transfer between leagues. If, on average, a striker moving from the Eredivisie to the Premier League sees their goals-per-90 rate decline by 30%, then a rough league adjustment factor is 0.70. More sophisticated approaches use transfer-based models, international performance comparisons, or Elo-style rating systems for leagues.
Common Pitfall: League adjustment is one of the hardest adjustments to get right because the sample of cross-league transfers is limited, and players who transfer are not representative of the average player in their origin league (they tend to be the best). Selection bias makes naive transfer-based adjustments unreliable without careful modeling.
5.4.7 Combining Multiple Adjustments
In practice, multiple adjustments are applied simultaneously. Care must be taken to avoid over-adjustment --- removing so much context that meaningful signal is lost --- or double-counting --- adjusting for the same factor twice through different mechanisms.
A practical approach is to build a regression model with the raw statistic as the dependent variable and context factors as independent variables:
$$Y_i = \beta_0 + \beta_1 \cdot \text{Opponent}_i + \beta_2 \cdot \text{GameState}_i + \beta_3 \cdot \text{Possession}_i + \beta_4 \cdot \text{Venue}_i + \epsilon_i$$
The residuals $\epsilon_i$ represent the context-free component of the player's performance. Alternatively, one can compute expected values under average context and compare to observed values.
Best Practice: When applying context adjustments, always report both the raw and adjusted values. This transparency allows stakeholders to understand the magnitude of the adjustment and build trust in the methodology.
5.5 Metric Validation Frameworks
5.5.1 The Three Pillars of Validation
How do we know if a metric is any good? Intuition and face validity are starting points, but rigorous validation requires systematic testing. We propose a three-pillar framework:
- Stability (Reliability): Does the metric produce consistent values for the same player or team over time?
- Discrimination: Does the metric separate genuinely different players or teams?
- Predictive Power: Does the metric predict future outcomes better than simpler alternatives?
Each pillar is necessary but not sufficient. A metric can be stable but not predictive (e.g., height), discriminating but not stable (e.g., goals in a 5-match window), or predictive but not discriminating (e.g., a constant that predicts the league average).
5.5.2 Stability Analysis
The most common stability test is split-half reliability. We split a player's data into two halves --- for example, odd-numbered matches and even-numbered matches --- and compute the metric for each half separately. If the metric is stable, the two halves should correlate strongly.
$$r_{\text{split-half}} = \text{Corr}(X_{\text{odd}}, X_{\text{even}})$$
We can also use the Spearman-Brown prophecy formula to estimate the reliability of the full-length metric from the split-half correlation:
$$r_{\text{full}} = \frac{2 \cdot r_{\text{split-half}}}{1 + r_{\text{split-half}}}$$
Typical stability benchmarks for soccer metrics:
| Stability Level | Correlation | Interpretation |
|---|---|---|
| High | $r > 0.7$ | Metric reflects a persistent skill |
| Moderate | $0.4 < r \leq 0.7$ | Mix of skill and randomness |
| Low | $r \leq 0.4$ | Dominated by randomness |
import numpy as np
from scipy.stats import pearsonr
def split_half_reliability(
match_values: np.ndarray,
) -> tuple[float, float]:
"""Compute split-half reliability of a per-match metric.
Args:
match_values: Array of per-match metric values.
Returns:
Tuple of (split-half r, Spearman-Brown full r).
"""
odd = match_values[::2]
even = match_values[1::2]
n = min(len(odd), len(even))
r_half, _ = pearsonr(odd[:n], even[:n])
r_full = (2 * r_half) / (1 + r_half)
return r_half, r_full
# Example: pass completion % has moderate-to-high stability
np.random.seed(12)
true_skill = 0.82
match_pass_pct = np.random.normal(true_skill, 0.05, size=38)
r_half, r_full = split_half_reliability(match_pass_pct)
print(f"Split-half r: {r_half:.3f}")
print(f"Spearman-Brown r: {r_full:.3f}")
5.5.3 Discrimination Analysis
A metric with high discrimination means there is substantial between-player variance relative to within-player variance. The intraclass correlation coefficient (ICC) formalizes this idea:
$$\text{ICC} = \frac{\sigma^2_{\text{between}}}{\sigma^2_{\text{between}} + \sigma^2_{\text{within}}}$$
An ICC near 1.0 means most variance is between players (good discrimination). An ICC near 0 means most variance is within players (poor discrimination --- the metric fluctuates randomly match to match, and players are not meaningfully separated).
We can also visualize discrimination with a violin plot that shows the distribution of per-match values for each player. If the violins overlap heavily, the metric does not discriminate well.
5.5.4 Predictive Validity
The ultimate test of a metric is whether it predicts outcomes we care about. For team metrics, the standard test is whether a metric computed over the first half of the season predicts results in the second half. For player metrics, we ask whether this season's metric predicts next season's goals, assists, or other outcomes.
A simple protocol:
- Compute the metric for each team/player over games 1--19.
- Compute the outcome variable over games 20--38.
- Measure the correlation or $R^2$ between the metric and the outcome.
Compare the predictive metric against a naive baseline (e.g., raw goal difference in the first half predicting second-half results). A good metric should outperform the baseline.
$$R^2_{\text{metric}} > R^2_{\text{baseline}}$$
Real-World Application: When StatsBomb and other data providers introduced expected goals (xG), its acceptance was accelerated by validation studies showing that first-half xG difference predicted second-half points better than actual goal difference. This demonstrated that xG stripped away noise and captured underlying quality more effectively.
5.5.5 Sample Size and Stabilization Points
A critical practical question is: how many minutes (or events) does a player need before we can trust a metric? The stabilization point is the sample size at which the metric's reliability reaches 0.5 --- the point where signal and noise contribute equally.
The stabilization point $n^*$ can be estimated using the formula:
$$n^* = \frac{1 - \text{ICC}}{\text{ICC}}$$
where ICC is the intraclass correlation coefficient (computed as above). If ICC = 0.25, then $n^* = 3$, meaning the metric stabilizes after about 3 seasons. If ICC = 0.70, then $n^* \approx 0.43$ seasons --- it stabilizes within a single season.
Common stabilization points for soccer metrics:
| Metric | Approximate Stabilization (90-min matches) |
|---|---|
| Pass completion % | 6--8 matches |
| Shot volume (shots per 90) | 10--12 matches |
| Tackle rate | 8--10 matches |
| Goal conversion rate | 35--40+ matches |
| xG per shot | 15--20 matches |
| Save percentage | 30--40+ matches |
Notice that goal-related statistics take much longer to stabilize than process statistics like passing and tackling. This is because goals are rare events with high variance.
Intuition: Metrics based on frequent events (passes, touches) stabilize quickly because there are many data points per match. Metrics based on rare events (goals, assists) stabilize slowly because each match provides little information.
5.5.6 Putting It All Together: A Validation Checklist
When evaluating any soccer metric, work through this checklist:
- [ ] Face validity: Does the metric make intuitive sense?
- [ ] Split-half reliability: Is $r > 0.5$?
- [ ] ICC / discrimination: Is ICC $> 0.3$?
- [ ] Predictive validity: Does it predict future outcomes better than a baseline?
- [ ] Sample size: Have you enforced a minimum minutes/events threshold?
- [ ] Context sensitivity: Have you tested whether the metric is confounded by opponent quality, game state, or possession?
- [ ] Comparability: Can the metric be compared across leagues and seasons?
A metric need not score perfectly on every criterion to be useful, but weaknesses should be acknowledged and communicated to stakeholders.
5.6 Percentile Rankings and Z-Scores for Player Comparison
5.6.1 Why Raw Numbers Are Not Enough
When a scout reports that a midfielder completes 5.2 progressive passes per 90, the natural follow-up question is: "Is that good?" Without a reference frame, the number is meaningless. Percentile rankings and Z-scores provide that reference frame by placing a player's output in the context of their peers.
5.6.2 Percentile Rankings
A percentile rank tells you the percentage of players in a reference group who score at or below a given value. If a player's progressive passes per 90 are at the 85th percentile among Premier League central midfielders, it means 85% of central midfielders in the league have a lower or equal rate.
Computing percentile ranks is straightforward:
import numpy as np
from scipy.stats import percentileofscore
def compute_percentile(
player_value: float,
reference_values: np.ndarray,
kind: str = "rank",
) -> float:
"""Compute the percentile rank of a player's value.
Args:
player_value: The player's metric value.
reference_values: Array of metric values from the reference group.
kind: Method for handling ties ('rank', 'weak', 'strict', 'mean').
Returns:
Percentile rank (0-100).
"""
return percentileofscore(reference_values, player_value, kind=kind)
Best Practice: Always define the reference group carefully. Comparing a centre-back's progressive passing to all players is misleading because forwards naturally complete fewer progressive passes from deep positions. Position-specific reference groups (e.g., "Premier League centre-backs with 900+ minutes") produce more meaningful percentiles.
5.6.3 Choosing the Reference Group
The choice of reference group has a dramatic impact on percentile rankings. Consider a midfielder who averages 2.5 tackles per 90:
| Reference Group | Percentile |
|---|---|
| All outfield players, all leagues | 62nd |
| All midfielders, top 5 leagues | 55th |
| Central midfielders, Premier League | 71st |
| Defensive midfielders, Premier League | 38th |
The same raw number yields percentiles ranging from 38th to 71st depending on the comparison group. This is not a flaw --- it reflects the fact that "good" is context-dependent. The analyst's job is to choose the reference group that matches the question being asked.
5.6.4 Z-Scores
A Z-score standardizes a value by expressing it in terms of standard deviations from the mean:
$$Z = \frac{x - \mu}{\sigma}$$
where $x$ is the player's value, $\mu$ is the reference group mean, and $\sigma$ is the reference group standard deviation. A Z-score of +1.5 means the player is 1.5 standard deviations above the reference group average.
Z-scores are useful because they put different metrics on a common scale. Goals per 90, tackles per 90, and pass completion percentage all have different units and ranges, but their Z-scores are all in "standard deviation" units, making them directly comparable.
import numpy as np
def compute_zscore(
player_value: float,
reference_values: np.ndarray,
) -> float:
"""Compute the Z-score of a player's value relative to a reference group."""
mu = np.mean(reference_values)
sigma = np.std(reference_values, ddof=1)
return (player_value - mu) / sigma
Common Pitfall: Z-scores assume a roughly normal (bell-shaped) distribution. Many soccer metrics are skewed --- for example, goals per 90 has a long right tail because most players score rarely while a few score frequently. For skewed distributions, percentile ranks are more robust than Z-scores. Alternatively, you can log-transform the metric before computing Z-scores.
5.6.5 Practical Applications of Percentile Profiles
Percentile profiles (also called "pizza plots" or "radar charts") are one of the most popular tools in modern soccer analytics. They display a player's percentile rank on multiple metrics simultaneously, creating a visual fingerprint of their playing style and quality.
A well-designed percentile profile groups metrics into categories (e.g., attacking, passing, defending, physical) and uses color coding to distinguish them. The reference group and minimum minutes threshold should always be stated clearly on the visualization.
import matplotlib.pyplot as plt
import numpy as np
def plot_percentile_profile(
metrics: list[str],
percentiles: list[float],
player_name: str,
reference_group: str,
colors: list[str] | None = None,
) -> None:
"""Plot a horizontal bar chart of percentile rankings."""
if colors is None:
colors = [
"#2ecc71" if p >= 75 else "#f39c12" if p >= 50 else "#e74c3c"
for p in percentiles
]
fig, ax = plt.subplots(figsize=(10, 5))
bars = ax.barh(metrics, percentiles, color=colors, edgecolor="white", height=0.6)
ax.set_xlim(0, 100)
ax.set_xlabel("Percentile Rank", fontsize=11)
ax.set_title(f"{player_name} — Percentile Profile\n({reference_group})", fontsize=13)
for bar, pct in zip(bars, percentiles):
ax.text(bar.get_width() + 2, bar.get_y() + bar.get_height() / 2,
f"{pct:.0f}th", va="center", fontsize=10)
plt.tight_layout()
plt.show()
5.7 Composite Metrics and Player Ratings
5.7.1 Why Combine Metrics?
No single metric captures everything a player does. A striker might be elite at finishing but poor at pressing; a midfielder might be an outstanding passer but a liability in defensive transitions. Composite metrics attempt to combine multiple individual metrics into a single number that represents a player's overall contribution.
5.7.2 Weighted Sums
The simplest composite metric is a weighted sum of standardized component metrics:
$$\text{Rating} = w_1 Z_1 + w_2 Z_2 + \cdots + w_k Z_k$$
where $Z_i$ is the Z-score of the $i$-th metric and $w_i$ is the weight assigned to it. The weights reflect the relative importance of each component. For a striker, finishing metrics might receive higher weights; for a centre-back, defensive metrics might dominate.
The challenge is choosing the weights. Common approaches include:
-
Expert judgment. A coach or analyst assigns weights based on domain knowledge. This is subjective but can encode important tactical priorities.
-
Regression weights. Use a regression model with a valued outcome (e.g., team points, match result) as the dependent variable and the component metrics as predictors. The regression coefficients become the weights.
-
Principal Component Analysis (PCA). Let the data determine the weights. PCA finds the linear combination of metrics that captures the most variance. The first principal component often serves as a natural "overall quality" metric.
-
Goal-equivalent weights. Express each component in terms of its marginal contribution to goals. For example, if one additional progressive pass per 90 is associated with 0.02 additional goals per 90 (estimated via regression), then the weight for progressive passes can be set to 0.02 "goals per progressive pass."
Real-World Application: The VAEP (Valuing Actions by Estimating Probabilities) framework, developed by Tom Decroos and colleagues at KU Leuven, avoids the weight-selection problem entirely. It values every on-ball action by estimating how much it changed the probability of scoring (or conceding) a goal. The result is a single "goals added" number per player that emerges directly from the model rather than from subjective weight choices.
5.7.3 Pitfalls of Composite Metrics
Composite metrics are powerful but dangerous:
-
Loss of information. Reducing a player's multidimensional contribution to a single number necessarily loses nuance. Two players with the same composite rating may have very different profiles.
-
Weight sensitivity. Small changes in weights can produce large changes in rankings, especially when players are tightly bunched.
-
Correlation between components. If two components are highly correlated (e.g., shots and xG), including both effectively double-counts that dimension. Analysts should check for multicollinearity before combining metrics.
-
Positional bias. A composite metric designed for general use may implicitly favor certain positions. Forwards tend to have higher values on goal-related metrics; midfielders on passing metrics. Without position-specific calibration, the composite may not compare players fairly across roles.
Common Pitfall: Never average Z-scores across metrics without considering whether the metrics measure independent skills. If three of your five metrics are variations of passing (pass completion, progressive passes, key passes), your composite is effectively 60% passing and 40% everything else, regardless of what your weights say.
5.7.4 Publicly Available Player Rating Systems
Several well-known player rating systems are used in soccer analytics:
| Rating System | Developer | Key Feature |
|---|---|---|
| VAEP | KU Leuven | Values each action by change in goal probability |
| xG Chain / xG Buildup | Various | Credits all players involved in goal-scoring possessions |
| Goals Added (g+) | American Soccer Analysis | Additive model across action types |
| Player Match Rating | WhoScored / SofaScore | Proprietary weighted sum of event-based features |
Each system has trade-offs. Academic systems like VAEP are transparent and reproducible but require event data access. Commercial systems like WhoScored ratings are widely available but opaque in their methodology.
5.8 Communicating Metrics to Stakeholders
5.8.1 Know Your Audience
The most technically rigorous metric is worthless if decision-makers do not understand or trust it. Soccer analytics operates in a multi-stakeholder environment:
| Stakeholder | Technical Level | Primary Concern |
|---|---|---|
| Head coach | Low to medium | Tactics, match preparation |
| Sporting director | Medium | Recruitment, squad planning |
| Scout | Low to medium | Player identification |
| Owner / board | Low | Financial return, competitiveness |
| Players | Low | Personal improvement |
| Media / fans | Variable | Narrative, entertainment |
Each audience requires a different communication strategy.
5.8.2 Principles of Metric Communication
1. Lead with the question, not the method.
Instead of: "I built a Bayesian hierarchical model that estimates opponent-adjusted expected threat per possession."
Try: "I found three midfielders who consistently create dangerous attacks, even against strong opponents. Here is the evidence."
2. Use natural units.
Express metrics in units people understand: goals, points, wins. "This signing would add approximately 3 goals per season to our attack" is more compelling than "this player has an xG per shot of 0.14, which is 0.03 above the positional average."
3. Provide comparisons.
Numbers in isolation are meaningless. Always compare to a reference group: league average, positional average, the player being replaced. "He creates 0.25 xA per 90, which ranks in the 88th percentile among Premier League wingers" gives the number meaning.
4. Visualize uncertainty.
Do not present point estimates as certainties. Use confidence intervals, ranges, or probabilistic language: "We are 80% confident that his true conversion rate is between 12% and 18%."
5. Tell a story.
Weave metrics into a narrative. Instead of presenting a table of numbers, explain why the numbers matter and what they imply for the decision at hand.
Best Practice: When presenting to coaches, use video clips alongside data. Show the specific passages of play that generated the numbers. This grounds the abstract metric in concrete, observable reality.
5.8.3 Visualization Best Practices
Good visualizations communicate metrics faster and more intuitively than tables.
Radar charts (pizza plots) are popular for player profiles because they show multiple metrics simultaneously. However, they can be misleading if axes are not scaled consistently or if too many variables are included.
Scatter plots are excellent for showing relationships between two metrics (e.g., xG vs. actual goals, or defensive actions vs. opponent quality).
Rolling averages show how a metric evolves over time, helping coaches track form.
Percentile bar charts show where a player ranks on each metric relative to their positional peers. These are intuitive for non-technical audiences.
import matplotlib.pyplot as plt
import numpy as np
# Simple percentile bar chart
metrics = ["Goals/90", "xA/90", "Progressive\nCarries/90",
"Pressures/90", "Aerial\nWin %"]
percentiles = [85, 72, 91, 45, 63]
colors = ["#2ecc71" if p >= 75 else "#f39c12" if p >= 50 else "#e74c3c"
for p in percentiles]
fig, ax = plt.subplots(figsize=(8, 4))
bars = ax.barh(metrics, percentiles, color=colors, edgecolor="white", height=0.6)
ax.set_xlim(0, 100)
ax.set_xlabel("Percentile Rank (vs. Position Peers)", fontsize=11)
ax.set_title("Player Profile: Percentile Rankings", fontsize=13)
for bar, pct in zip(bars, percentiles):
ax.text(bar.get_width() + 2, bar.get_y() + bar.get_height()/2,
f"{pct}th", va="center", fontsize=10)
plt.tight_layout()
plt.savefig("percentile_chart.png", dpi=150)
plt.show()
5.8.4 Common Communication Failures
Over-precision. Reporting "0.3247 xG per 90" implies a level of precision that the data does not support. Round to two significant figures: "0.32 xG per 90."
Context-free rankings. Saying "he is ranked 7th in the league for tackles" without mentioning that the gap between ranks 3 and 15 is tiny can be misleading.
Confusing description with prescription. "His xG was 15 but he scored 20" describes what happened. It does not necessarily prescribe a conclusion ("he is overperforming and will regress") without further analysis of why the overperformance occurred.
Dashboard overload. Presenting 50 metrics on a single screen overwhelms the audience. Focus on the 3--5 metrics most relevant to the decision at hand.
Common Pitfall: Analysts often fall in love with methodological complexity. Remember that the goal is not to impress stakeholders with your model's sophistication but to improve their decisions. If a simpler metric communicates the same insight more clearly, use the simpler metric.
5.8.5 Building Trust Over Time
Trust in analytics is not built in a single presentation. It develops through:
- Transparency: Share your methodology. Explain what the metric captures and what it does not.
- Track record: Keep a record of predictions and recommendations, and honestly assess which ones were right and which were wrong.
- Humility: Acknowledge uncertainty. Say "the data suggests" rather than "the data proves."
- Responsiveness: Answer questions from coaches and scouts promptly and without condescension.
- Integration: Embed metrics into existing workflows rather than asking stakeholders to adopt entirely new processes.
Real-World Application: Liverpool FC's analytics department, led initially by Ian Graham, built trust with manager Jurgen Klopp by starting with small, low-stakes recommendations and gradually expanding the role of data in recruitment and tactics. This incremental approach is more effective than attempting a top-down analytics revolution.
5.8.6 The "So What?" Test
Every metric presentation should pass the "So What?" test. After presenting a finding, imagine a coach asking: "So what should I do about it?" If you cannot answer that question, the metric may be descriptively interesting but not practically useful.
For example:
- Fails the test: "Our xG against is 1.4 per match, which is above the league average." (So what?)
- Passes the test: "We are conceding 0.4 xG per match from crosses into the box, primarily from the left flank. Our left-sided centre-back is being pulled wide too often, leaving space at the back post. Here are three clips showing the pattern, and here is how we can adjust our defensive shape to address it."
The second version connects the metric to a specific tactical problem and a potential solution. That is the difference between data reporting and data-driven decision-making.
5.9 The Metrics Lifecycle: Creation, Validation, Adoption, Iteration
5.9.1 Overview
Metrics are not static objects. They are created, tested, refined, and sometimes abandoned. Understanding the lifecycle of a metric helps analysts avoid common mistakes and build metrics that endure.
5.9.2 Stage 1: Creation
A new metric begins with a question: "How can we measure X?" The question might come from a coach ("I want to know which of my midfielders is best at progressing the ball"), a scout ("I need to compare centre-backs across different leagues"), or an analyst's own curiosity.
The creation stage involves:
- Defining the concept being measured (e.g., "ball progression").
- Choosing the data inputs (e.g., progressive passes, progressive carries, progressive received passes).
- Designing the computation (e.g., weighted sum, model-based estimate, simple count).
- Selecting the unit of measurement (e.g., per 90, per possession, total).
Best Practice: Before creating a new metric, survey existing ones. The soccer analytics community has produced hundreds of metrics, and there is a good chance that someone has already addressed your question. Reinventing the wheel wastes time and creates unnecessary confusion when communicating with stakeholders who may already be familiar with an established metric.
5.9.3 Stage 2: Validation
Once a metric is defined, it must be validated using the three-pillar framework described in Section 5.5:
- Stability: Does the metric produce consistent values across split halves or across seasons?
- Discrimination: Does the metric separate players or teams who are genuinely different?
- Predictive power: Does the metric predict future outcomes better than baselines?
Validation should be conducted on held-out data --- not the same data used to design the metric. If you used 2022-23 data to calibrate your metric's weights, validate on 2023-24 data. This prevents overfitting and gives a realistic estimate of the metric's usefulness.
5.9.4 Stage 3: Adoption
A validated metric must be adopted by stakeholders before it can influence decisions. Adoption barriers include:
- Complexity: If the metric is hard to explain, it will be hard to trust.
- Inertia: Decision-makers are comfortable with existing metrics and reluctant to change.
- Fear of obsolescence: Scouts and coaches may worry that data-driven metrics will replace their expertise.
- Lack of integration: If the metric is not available in the tools that stakeholders already use (video platforms, scouting databases), it will be ignored.
Successful adoption requires the communication strategies described in Section 5.8: leading with questions, using natural units, providing comparisons, and building trust incrementally.
5.9.5 Stage 4: Iteration
Even successful metrics require ongoing maintenance:
- Data changes: When a data provider updates their event definitions (e.g., changing how "pressure" events are coded), metrics built on those events may need recalibration.
- Tactical evolution: As the game evolves, the relative importance of different skills changes. A metric designed to capture value in a possession-dominated era may be less relevant in an era of intense pressing and counter-attacking.
- New data sources: The arrival of tracking data, broadcast tracking, and advanced ball-tracking systems opens possibilities for more granular metrics that may supersede simpler event-based ones.
The best analytics departments treat metrics as living documents, subject to regular review and revision. They maintain version histories, document changes, and communicate updates to stakeholders.
Intuition: Think of a metric as software. Version 1.0 is rarely the final version. Bugs are found, requirements change, and new features become possible. The best metrics are maintained, documented, and iterated upon --- just like the best code.
5.10 Common Fallacies When Interpreting Soccer Metrics
5.10.1 The Regression Fallacy
One of the most common errors in soccer analytics is failing to account for regression to the mean. If a player has an unusually good season, their next season is likely to be worse --- not because they have declined, but because the unusually good season contained a component of luck that is unlikely to recur.
This fallacy leads to two common mistakes:
-
Overpaying for hot streaks. A team signs a striker based on a 25-goal season, much of which was driven by unsustainable finishing luck. The player regresses to their true level (15 goals) and is deemed a failure.
-
Dismissing regression too quickly. Not all overperformance is luck. Some players genuinely improve due to better coaching, tactical fit, or physical maturation. The analyst's job is to distinguish genuine improvement from noise.
Common Pitfall: The phrase "regression to the mean" does not mean "every player will become average." It means that extreme performances tend to move toward a player's true level. A player with a true talent level of 20 goals who scores 25 is expected to regress toward 20, not toward the league average of 10.
5.10.2 Simpson's Paradox
Simpson's paradox occurs when a trend that appears in subgroups reverses when the subgroups are combined. In soccer, this can happen when comparing players across different game states or opponent quality tiers.
Example: Player A has a higher pass completion rate than Player B in every individual match. Yet over the season, Player B has a higher overall completion rate. How? Player B plays more minutes against weaker opponents (where completion rates are naturally higher), which inflates their season-long average. Player A plays more minutes against strong opponents, dragging down their average despite being the better passer in every head-to-head comparison.
This paradox is not rare --- it can arise whenever the distribution of playing time across contexts differs between players. The solution is to always examine context-specific breakdowns, not just aggregate numbers.
5.10.3 The Ecological Fallacy
The ecological fallacy occurs when conclusions about individuals are drawn from aggregate (group-level) data. In soccer, this often manifests as attributing team-level performance to individual players.
For example: "Team X has the best defence in the league, therefore centre-back Y must be an excellent defender." But Team X's defensive record might be driven by their midfield pressing, their goalkeeper's shot-stopping, or their tactical system. Centre-back Y might be average or even below average; his team's system is protecting him.
The reverse ecological fallacy is equally common: "Team Z has a poor attack, therefore forward W is not worth signing." Forward W might be an excellent player trapped in a dysfunctional system. Individual metrics (xG per 90, shot creation actions) can help disentangle individual contribution from team effects, but even these are not immune to systemic influences.
5.10.4 Survivorship Bias
Survivorship bias occurs when we analyze only the players or teams that "survived" some selection process, ignoring those that did not. In soccer analytics, this manifests in several ways:
-
Transfer analysis. Studying only successful transfers to identify predictive metrics ignores the many transfers that failed despite similar metrics. Without examining failures, we cannot assess whether the metric genuinely predicts success.
-
Career trajectory analysis. Studying the development curves of players who became elite ignores the many young players who had similar early-career metrics but never broke through. Survivorship bias inflates our confidence in early predictive metrics.
-
Tactical analysis. Studying only the tactics of title-winning teams ignores teams that used similar tactics but finished mid-table. The tactics may not have been the differentiating factor.
Best Practice: Whenever you analyze a subset of outcomes (successful signings, title-winning seasons, breakout young players), always ask: "What about the cases that look similar but had different outcomes?" Including the full universe of cases is essential for valid conclusions.
5.10.5 Confusing Correlation with Causation
This classic statistical fallacy is particularly tempting in soccer analytics. "Teams with higher pressing intensity win more matches" does not mean that pressing harder causes more wins. It could be that better teams (with more skilled players) are able to press harder, and it is their overall quality, not the pressing specifically, that causes the wins.
Establishing causation in soccer is extremely difficult because teams cannot run controlled experiments (you cannot randomly assign pressing intensity to teams). The best we can usually do is control for confounding variables and look for quasi-experimental opportunities (e.g., managerial changes, injuries that force tactical shifts).
5.10.6 The Base Rate Neglect
Base rate neglect occurs when analysts focus on the metric's value without considering how common that value is. For example, "This 18-year-old has 0.5 xG per 90 in his first 500 minutes" sounds impressive. But without knowing the base rate --- how many 18-year-olds with 0.5 xG per 90 in their first 500 minutes actually go on to sustain that level --- the number is not actionable.
Bayesian thinking provides the antidote: combine the observed metric with the prior probability of a player at that age and level reaching the implied performance level. This naturally tempers excitement about small-sample outliers.
Intuition: If a test for a rare disease has a 5% false positive rate, and the disease affects 1 in 1,000 people, then a positive test result is still more likely to be wrong than right. The same logic applies when scouting: if only 1 in 100 young players with impressive early metrics sustains them, you should be cautious about any individual case, no matter how eye-catching the numbers.
Summary
This chapter established the conceptual foundation for soccer metrics. We began by cataloguing the traditional statistics that dominate public discourse --- goals, assists, pass completion rates, clean sheets --- and identified their systematic limitations: lack of context, failure to weight events by value, credit assignment problems, small sample sizes, and selection bias.
We then articulated what makes a good metric (validity, reliability, discrimination, interpretability, actionability) and introduced the signal-to-noise framework that guides metric design. The distinction between rate and counting statistics taught us to choose our denominator carefully and always consider sample size.
Context adjustments --- for opponent quality, game state, possession, venue, and league quality --- allow us to make fairer comparisons, but must be applied judiciously to avoid over-adjustment. Our three-pillar validation framework (stability, discrimination, predictive power) provides a systematic protocol for evaluating any metric, and stabilization-point analysis tells us how much data we need before trusting a number.
We introduced percentile rankings and Z-scores as essential tools for placing a player's output in context, and explored composite metrics and player rating systems that combine multiple individual measures into holistic assessments.
We addressed the human element: communicating metrics to coaches, directors, and fans requires leading with questions, using natural units, providing comparisons, visualizing uncertainty, and building trust over time. The "So What?" test ensures that every metric presentation connects to an actionable decision.
The metrics lifecycle --- creation, validation, adoption, and iteration --- frames metric development as an ongoing process rather than a one-time event. And the catalogue of common fallacies --- regression to the mean, Simpson's paradox, the ecological fallacy, survivorship bias, correlation versus causation, and base rate neglect --- arms you with the critical thinking tools needed to avoid the most dangerous interpretive errors.
In the next chapter, we will apply these principles to our first major metric: expected goals (xG).
Chapter 5 Notation Reference
| Symbol | Meaning |
|---|---|
| $x_i$ | Value of metric $x$ for observation $i$ |
| $\bar{x}$ | Mean of metric $x$ |
| $r$ | Pearson correlation coefficient |
| $R^2$ | Coefficient of determination |
| ICC | Intraclass correlation coefficient |
| $\sigma^2_{\text{between}}$ | Between-subject variance |
| $\sigma^2_{\text{within}}$ | Within-subject variance |
| $n^*$ | Stabilization point (number of observations) |
| GS | Game state (score differential) |
| xG | Expected goals |
| xA | Expected assists |
| Per 90 | Normalized to 90-minute equivalent |
| $Z$ | Z-score (standardized value) |
| $\mu$ | Population or reference group mean |
| $\sigma$ | Standard deviation |
| $w_i$ | Weight for the $i$-th component in a composite metric |