56 min read

> "Statistics are like a bikini. What they reveal is suggestive, but what they conceal is vital."

Learning Objectives

  • Calculate and interpret descriptive statistics for soccer data
  • Apply probability concepts to match outcomes and player performance
  • Conduct hypothesis tests to evaluate claims about soccer performance
  • Apply Bayesian thinking to update beliefs and estimate player quality
  • Understand correlation and its limitations in soccer contexts
  • Interpret regression models and their coefficients
  • Assess practical significance using effect sizes
  • Apply bootstrap and resampling methods to small-sample soccer data
  • Recognize common statistical pitfalls in soccer analysis

Chapter 3: Statistical Foundations for Soccer Analysis

"Statistics are like a bikini. What they reveal is suggestive, but what they conceal is vital." — Aaron Levenstein

Chapter Overview

In the 2015-16 Premier League season, Leicester City won the title despite being 5000-1 outsiders at the start of the campaign. Pundits called it a miracle, a statistical anomaly that defied all logic. But was it really? And how do we use statistics to understand whether Leicester's triumph was skill, luck, or some combination of both?

This question---separating signal from noise, skill from luck---lies at the heart of soccer analytics. Statistics provides the tools to answer it. But applying statistics to soccer requires care: the sport's low-scoring nature, contextual complexity, and inherent randomness create unique challenges.

This chapter builds your statistical foundation. We'll cover the essential concepts you need to analyze soccer data rigorously, always grounding abstract ideas in concrete soccer examples. By the end, you'll be equipped to conduct defensible analyses and recognize when others' claims don't hold up to scrutiny.

In this chapter, you will learn to: - Summarize soccer data using appropriate descriptive statistics - Calculate probabilities for match outcomes and player events - Test hypotheses about player and team performance - Apply Bayesian thinking to update beliefs and estimate player quality - Understand what correlation does and doesn't tell us - Build and interpret basic regression models - Evaluate practical significance using effect sizes - Use bootstrap and resampling methods when classical assumptions break down - Avoid common statistical mistakes in soccer analysis


3.1 Descriptive Statistics in Soccer

3.1.1 Why Descriptive Statistics Matter

Before building complex models, we need to understand our data. Descriptive statistics summarize datasets, revealing patterns, central tendencies, and variability. In soccer, they help us answer questions like:

  • How many goals does a typical Premier League team score per season?
  • How consistent is this player's passing accuracy?
  • Are shots from this position usually converted?
  • What does the distribution of expected goals (xG) per shot look like?
  • How spread out are progressive passing distances among midfielders?

Think of descriptive statistics as the foundation of any analysis. Just as an architect surveys the land before drawing blueprints, an analyst must thoroughly understand the data before building models. Skipping this step leads to models built on faulty assumptions---a problem that no amount of algorithmic sophistication can fix.

3.1.2 Measures of Central Tendency

The Mean (Average)

The arithmetic mean is the sum of values divided by the count:

$$ \bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i = \frac{x_1 + x_2 + ... + x_n}{n} $$

Example: Goals per Season. A striker's goals in five consecutive seasons: 15, 22, 18, 24, 21

$$ \bar{x} = \frac{15 + 22 + 18 + 24 + 21}{5} = \frac{100}{5} = 20 \text{ goals per season} $$

The mean is useful but can be misleading when data contains outliers. Consider a league where 19 teams average around 45 goals per season, but one team scores 95. The mean across all teams would be pulled upward, giving a distorted picture of the "typical" team.

Example: Mean Goals per Game in the Premier League. Across a typical 380-match Premier League season, the total number of goals is approximately 1,025. The mean goals per game is therefore:

$$ \bar{x} = \frac{1025}{380} \approx 2.70 \text{ goals per game} $$

This single number tells us something important: the average Premier League match produces roughly 2.7 goals, which is useful for calibrating expectations and modeling.

The Weighted Mean

Sometimes observations carry different importance. The weighted mean accounts for this:

$$ \bar{x}_w = \frac{\sum_{i=1}^{n} w_i x_i}{\sum_{i=1}^{n} w_i} $$

Example: When calculating a player's average xG per shot across a season, shots from open play, free kicks, and penalties represent fundamentally different situations. If we want to estimate a player's open-play finishing ability, we should either filter to open-play shots only or use the weighted mean with appropriate weights to account for the differing nature of each shot type.

The Median

The median is the middle value when data is sorted. For odd n, it's the middle element; for even n, it's the average of the two middle elements.

Example: Premier League team goals in 2022-23 (sorted): 31, 34, 37, 38, 40, 42, 44, 45, 48, 51, 52, 55, 58, 59, 68, 72, 75, 83, 88, 94

With n=20, the median is the average of the 10th and 11th values: $$ \text{Median} = \frac{51 + 52}{2} = 51.5 \text{ goals} $$

The median is robust to outliers---extreme values don't affect it much. Notice that the mean of these same values is 55.3, pulled upward by high-scoring teams like Manchester City. The median gives a better sense of what a "typical" team scores.

Example: Median Passing Distance. Consider the distribution of passing distances in a match. Most passes are short (under 15 meters), but a few long diagonal switches can be 50+ meters. The median passing distance (perhaps 12 meters) more accurately represents the "typical" pass than the mean (perhaps 16 meters), which is inflated by those long balls.

The Mode

The mode is the most frequent value. It's most useful for categorical data.

Example: Most common scoreline in Premier League: 1-0 and 1-1 are typically the modes. In a typical season, 1-1 occurs around 44 times out of 380 matches (11.6%), making it the most frequent result.

Intuition: Use the mean when data is roughly symmetric without extreme outliers. Use the median when data is skewed or contains outliers. The mode is best for categorical data or when you want the "most typical" discrete value. In soccer analytics, xG data tends to be right-skewed (many low-xG shots, few high-xG chances), so the median xG per shot is often more informative than the mean.

3.1.3 Measures of Spread

Central tendency alone doesn't tell the full story. We also need to understand how spread out the data is. Two players might average the same number of goals per season, but one could be remarkably consistent while the other oscillates wildly between brilliant and poor campaigns.

Range

The simplest measure: maximum minus minimum.

$$ \text{Range} = x_{max} - x_{min} $$

The range is easy to compute but highly sensitive to outliers. A single anomalous match can inflate the range dramatically.

Variance

Variance measures the average squared deviation from the mean:

$$ s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2 $$

To understand why we square the deviations, consider that raw deviations (positive and negative) would cancel out. Squaring ensures all deviations contribute positively. The derivation proceeds from the definition of "average squared distance from the mean":

$$ s^2 = \frac{1}{n-1}\left[(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + \cdots + (x_n - \bar{x})^2\right] $$

Common Pitfall: We divide by n-1 (not n) for sample variance. This "Bessel's correction" makes the sample variance an unbiased estimator of the population variance. The intuition is that the sample mean $\bar{x}$ is already "fit" to the data, so we lose one degree of freedom. When n is large, the difference between dividing by n and n-1 is negligible, but for the small samples common in soccer analytics (e.g., 10 matches), the correction matters.

Standard Deviation

The standard deviation is the square root of variance, returning the measure to the original units:

$$ s = \sqrt{s^2} = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2} $$

Example: Two midfielders' passing accuracy over 10 matches:

Match Player A Player B
1 85% 78%
2 83% 88%
3 86% 72%
4 84% 90%
5 85% 75%
6 82% 92%
7 87% 70%
8 84% 85%
9 85% 82%
10 84% 78%

Both players average 84.5% passing accuracy, but: - Player A: Standard deviation ≈ 1.4% - Player B: Standard deviation ≈ 7.4%

Player A is much more consistent, despite the same average. A manager deciding between these two players needs to know this. If tactical reliability is paramount, Player A is the clear choice. If you need a player capable of exceptional performances (and can tolerate poor ones), Player B's higher variance might be acceptable.

Example: Standard Deviation of xG. Suppose a team's per-match xG values over 10 matches are: 1.2, 2.5, 0.8, 3.1, 1.5, 1.9, 0.6, 2.2, 1.7, 2.0. The mean is 1.75 and the standard deviation is approximately 0.76. This tells us the team's chance creation fluctuates by about 0.76 xG around the mean on a typical matchday---a substantial amount given the mean is only 1.75.

Coefficient of Variation

To compare variability across different scales, use the coefficient of variation:

$$ CV = \frac{s}{\bar{x}} \times 100\% $$

Example: Comparing variability in goals scored (mean = 50, std = 12, CV = 24%) versus goals conceded (mean = 40, std = 8, CV = 20%) across teams. Despite the lower absolute standard deviation in goals conceded, the relative variability (CV) tells us that offensive output is slightly more variable across the league than defensive output.

3.1.4 Percentiles and Quartiles

Percentiles divide data into 100 equal parts. The pth percentile is the value below which p% of observations fall.

Key percentiles: - 25th percentile (Q1): First quartile - 50th percentile (Q2): Median - 75th percentile (Q3): Third quartile

Interquartile Range (IQR): $$ IQR = Q3 - Q1 $$

The IQR contains the middle 50% of data and is robust to outliers.

Example: If a player's xG per 90 is at the 90th percentile among strikers, they create chances at a rate better than 90% of their peers. Percentile rankings are widely used in player scouting---a radar chart comparing a player's percentile ranks across multiple metrics gives an immediate visual summary of strengths and weaknesses.

Identifying Outliers with the IQR Method

A common rule for identifying outliers uses the IQR:

$$ \text{Lower fence} = Q1 - 1.5 \times IQR $$ $$ \text{Upper fence} = Q3 + 1.5 \times IQR $$

Any observation below the lower fence or above the upper fence is considered a potential outlier. In soccer, this might flag a goalkeeper with an unusually high save percentage, which could reflect genuine skill or simply a small sample of easy shots faced.

3.1.5 Skewness and Kurtosis

Skewness measures the asymmetry of a distribution:

$$ \text{Skewness} = \frac{n}{(n-1)(n-2)} \sum_{i=1}^{n} \left(\frac{x_i - \bar{x}}{s}\right)^3 $$

  • Skewness = 0: Symmetric distribution
  • Skewness > 0: Right-skewed (tail extends to the right)
  • Skewness < 0: Left-skewed (tail extends to the left)

Soccer Example: The distribution of xG values per shot is heavily right-skewed. Most shots have low xG (0.03 to 0.08), but penalties (xG around 0.76) and one-on-ones (xG around 0.35--0.45) create a long right tail. Reporting the mean xG per shot without acknowledging this skewness can be misleading.

Kurtosis measures the "tailedness" of a distribution---how much probability mass is in the tails versus the center:

$$ \text{Excess Kurtosis} = \frac{n(n+1)}{(n-1)(n-2)(n-3)} \sum_{i=1}^{n} \left(\frac{x_i - \bar{x}}{s}\right)^4 - \frac{3(n-1)^2}{(n-2)(n-3)} $$

High kurtosis means more extreme values than a normal distribution would predict. This matters in soccer when we want to understand the likelihood of extreme performances---a league with high kurtosis in team points produces more unexpected champions and surprising relegations.

3.1.6 Implementing Descriptive Statistics

import pandas as pd
import numpy as np

def descriptive_stats(data: pd.Series, name: str = "Variable") -> dict:
    """
    Calculate comprehensive descriptive statistics.

    Parameters
    ----------
    data : pd.Series
        Numeric data series
    name : str
        Name of the variable

    Returns
    -------
    dict
        Dictionary of descriptive statistics
    """
    stats = {
        'variable': name,
        'n': len(data),
        'mean': data.mean(),
        'median': data.median(),
        'std': data.std(),
        'variance': data.var(),
        'min': data.min(),
        'max': data.max(),
        'range': data.max() - data.min(),
        'q25': data.quantile(0.25),
        'q75': data.quantile(0.75),
        'iqr': data.quantile(0.75) - data.quantile(0.25),
        'skewness': data.skew(),
        'kurtosis': data.kurtosis()
    }

    return stats

# Example usage
goals = pd.Series([15, 22, 18, 24, 21, 19, 23, 17, 20, 25])
print(descriptive_stats(goals, "Goals per Season"))

With this foundation in summarizing data, we can now move to the language of uncertainty. Probability theory provides the formal framework for reasoning about chance events---and soccer, perhaps more than any other major sport, is a game profoundly shaped by chance.


3.2 Probability for Soccer

3.2.1 Basic Probability Concepts

Probability quantifies uncertainty. In soccer, we use probability to answer questions like: - What's the probability this team wins their next match? - What's the probability a shot from this location results in a goal? - What's the probability a player scores a hat trick? - Given it is 0-0 at halftime, what is the probability the home team wins?

Probability Definition:

For an event A, the probability P(A) is a number between 0 and 1: - P(A) = 0: Event is impossible - P(A) = 1: Event is certain - 0 < P(A) < 1: Event may or may not occur

The three axioms of probability, established by Andrey Kolmogorov, are:

  1. Non-negativity: $P(A) \geq 0$ for any event A
  2. Normalization: $P(\Omega) = 1$ where $\Omega$ is the entire sample space
  3. Additivity: For mutually exclusive events $A_1, A_2, \ldots$: $P(A_1 \cup A_2 \cup \cdots) = P(A_1) + P(A_2) + \cdots$

Frequentist Interpretation:

$$ P(A) = \lim_{n \to \infty} \frac{\text{Number of times A occurs}}{n} $$

If we observe many similar situations, the probability is the long-run frequency. For example, if we observe thousands of shots from similar positions with similar defensive pressure, the fraction that result in goals converges to the "true" probability---this is essentially what xG models estimate.

Subjective (Bayesian) Interpretation:

Probability can also represent a degree of belief. When a pundit says "I think there's a 60% chance Liverpool wins," they are expressing a subjective probability based on their assessment of the available information. The Bayesian framework, which we explore in depth in Section 3.5, formalizes how to update such beliefs with new evidence.

3.2.2 Probability Rules

Complement Rule: $$ P(\text{not } A) = 1 - P(A) $$

Example: If a team has 0.45 probability of winning, their probability of not winning is 0.55. This decomposition is useful: the probability of "not winning" includes both drawing and losing.

Addition Rule (for mutually exclusive events): $$ P(A \text{ or } B) = P(A) + P(B) $$

Example: If P(home win) = 0.45 and P(draw) = 0.25, and these are mutually exclusive, then P(home win or draw) = 0.70. Equivalently, P(away win) = 1 - 0.70 = 0.30.

General Addition Rule: $$ P(A \text{ or } B) = P(A) + P(B) - P(A \text{ and } B) $$

Example: Consider the events "Player scores" and "Player provides an assist" in a given match. These are not mutually exclusive---a player can both score and assist. If P(scores) = 0.30, P(assists) = 0.20, and P(scores and assists) = 0.05, then:

$$ P(\text{scores or assists}) = 0.30 + 0.20 - 0.05 = 0.45 $$

Multiplication Rule (for independent events): $$ P(A \text{ and } B) = P(A) \times P(B) $$

Example: If each penalty has 0.76 probability of being scored, and kicks are independent, the probability of scoring 5 consecutive penalties is: $$ P(\text{5 scored}) = 0.76^5 \approx 0.253 $$

Common Pitfall: The independence assumption is critical here. In a penalty shootout, psychological pressure mounts with each kick, so independence may not hold perfectly. A miss by a teammate may increase the pressure on the next kicker, changing the probability. Always question whether the independence assumption is reasonable in your specific soccer context.

3.2.3 Conditional Probability

The probability of A given that B has occurred:

$$ P(A|B) = \frac{P(A \text{ and } B)}{P(B)} $$

This formula can be rearranged to derive the multiplication rule for dependent events:

$$ P(A \text{ and } B) = P(A|B) \times P(B) $$

Example: Winning After Scoring First. What's the probability a team wins given they scored first?

From historical Premier League data: - P(win and score first) = 0.42 - P(score first) = 0.48

$$ P(\text{win}|\text{score first}) = \frac{0.42}{0.48} = 0.875 $$

Teams that score first win 87.5% of the time---much higher than the unconditional win probability.

Example: Goal Probability Given Shot Location. Conditional probability is the backbone of xG models. The key question is:

$$ P(\text{goal} | \text{shot from penalty area}) \approx 0.12 $$

Compare this to:

$$ P(\text{goal} | \text{shot from outside the box}) \approx 0.03 $$

By conditioning on additional variables---angle to goal, defensive pressure, body part---we build increasingly refined probability estimates. An xG model is essentially a machine that estimates:

$$ P(\text{goal} | \text{location}, \text{angle}, \text{body part}, \text{situation}, \ldots) $$

Example: Chain of Conditional Probabilities. What is the probability that a team both creates a big chance AND scores from it?

$$ P(\text{big chance and goal}) = P(\text{goal} | \text{big chance}) \times P(\text{big chance}) $$

If a team creates a big chance with probability 0.15 per possession, and scores from big chances 40% of the time:

$$ P(\text{big chance and goal}) = 0.40 \times 0.15 = 0.06 $$

So roughly 6% of possessions result in a goal from a big chance.

3.2.4 The Law of Total Probability

The law of total probability connects conditional and unconditional probabilities. If events $B_1, B_2, \ldots, B_k$ form a partition of the sample space (they are mutually exclusive and exhaustive), then:

$$ P(A) = \sum_{i=1}^{k} P(A|B_i) \times P(B_i) $$

Soccer Application: What is the overall probability of scoring a goal, given different shot types?

Let the shot types be: open play, set piece, and penalty. Then:

$$ P(\text{goal}) = P(\text{goal}|\text{open play}) \times P(\text{open play}) + P(\text{goal}|\text{set piece}) \times P(\text{set piece}) + P(\text{goal}|\text{penalty}) \times P(\text{penalty}) $$

With typical values:

$$ P(\text{goal}) = 0.08 \times 0.75 + 0.05 \times 0.20 + 0.76 \times 0.05 = 0.06 + 0.01 + 0.038 = 0.108 $$

So approximately 10.8% of all shots result in goals, which matches observed conversion rates in top leagues.

Intuition: The law of total probability is the mathematical justification for stratified analysis. Whenever you hear "it depends on the situation," you are implicitly invoking this law. A player's overall conversion rate is the weighted average of their rates in different situations, weighted by how often they find themselves in each situation.

3.2.5 Bayes' Theorem

Bayes' theorem relates conditional probabilities:

$$ P(A|B) = \frac{P(B|A) \times P(A)}{P(B)} $$

This can be expanded using the law of total probability:

$$ P(A|B) = \frac{P(B|A) \times P(A)}{P(B|A) \times P(A) + P(B|\text{not } A) \times P(\text{not } A)} $$

Soccer Application: Updating beliefs about team quality after observing results.

Example: Is this team actually good, or just lucky?

Let G = team is genuinely good, W = team won their first 5 matches

  • Prior: P(G) = 0.20 (only 20% of teams are "genuinely good")
  • P(W|G) = 0.35 (good teams win first 5 about 35% of the time)
  • P(W|not G) = 0.05 (other teams win first 5 about 5% of the time)

First, find P(W) using the law of total probability: $$ P(W) = P(W|G)P(G) + P(W|\text{not } G)P(\text{not } G) $$ $$ P(W) = 0.35 \times 0.20 + 0.05 \times 0.80 = 0.07 + 0.04 = 0.11 $$

Now apply Bayes' theorem: $$ P(G|W) = \frac{P(W|G) \times P(G)}{P(W)} = \frac{0.35 \times 0.20}{0.11} \approx 0.636 $$

After winning 5 straight, our belief that the team is genuinely good increases from 20% to 64%. Note that this is still far from certainty---the data is consistent with luck. We will revisit Bayesian thinking in much greater depth in Section 3.5.

3.2.6 Common Probability Distributions

Binomial Distribution

Models the number of successes in n independent trials, each with probability p.

$$ P(X = k) = \binom{n}{k} p^k (1-p)^{n-k} $$

where the binomial coefficient is:

$$ \binom{n}{k} = \frac{n!}{k!(n-k)!} $$

The mean and variance of the binomial distribution are:

$$ E[X] = np, \quad \text{Var}(X) = np(1-p) $$

Example: A striker has a 0.15 conversion rate (15% of shots become goals). In a match with 6 shots, what's the probability of scoring exactly 2 goals?

$$ P(X = 2) = \binom{6}{2} (0.15)^2 (0.85)^4 = 15 \times 0.0225 \times 0.522 \approx 0.176 $$

About 17.6% probability. What about the probability of scoring at least one goal?

$$ P(X \geq 1) = 1 - P(X = 0) = 1 - (0.85)^6 = 1 - 0.377 \approx 0.623 $$

So even with 6 shots at a 15% conversion rate, there is a 37.7% chance of not scoring at all. This illustrates why soccer is such a low-scoring, high-variance sport.

Poisson Distribution

Models the number of events in a fixed interval when events occur independently at a constant rate $\lambda$.

$$ P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!} $$

The mean and variance of the Poisson distribution are both equal to $\lambda$:

$$ E[X] = \lambda, \quad \text{Var}(X) = \lambda $$

Soccer Application: Goals in a match are often modeled as Poisson distributed. This is one of the most important distributional assumptions in soccer analytics.

Example: A team's expected goals (xG) for a match is 1.8. Probability of scoring exactly 2 goals:

$$ P(X = 2) = \frac{1.8^2 e^{-1.8}}{2!} = \frac{3.24 \times 0.1653}{2} \approx 0.268 $$

About 26.8% probability. We can compute the full distribution:

$$ P(X = 0) = e^{-1.8} \approx 0.165 $$ $$ P(X = 1) = 1.8 \times e^{-1.8} \approx 0.298 $$ $$ P(X = 3) = \frac{1.8^3 \times e^{-1.8}}{6} \approx 0.161 $$

Advanced: The Poisson distribution arises naturally when many independent "trials" each have a small probability of success. In a match, a team might have hundreds of moments of possession, each with a tiny probability of producing a goal. This is exactly the setting where the Poisson model applies. However, the independence assumption can be violated---teams may change tactics after scoring, and fatigue or psychological effects can alter goal rates within a match. The Dixon-Coles model, discussed in later chapters, extends the basic Poisson to handle some of these complications.

Normal (Gaussian) Distribution

The bell curve---many natural phenomena approximate this distribution.

$$ f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} $$

Parameters: $\mu$ (mean), $\sigma$ (standard deviation)

Properties (the 68-95-99.7 rule): - 68% of data within $\mu \pm \sigma$ - 95% of data within $\mu \pm 2\sigma$ - 99.7% of data within $\mu \pm 3\sigma$

Soccer Application: While individual match goals follow a Poisson distribution (discrete counts), many aggregated soccer statistics approximate a normal distribution. For instance, the distribution of team point totals across many seasons, or the distribution of a player's per-90 statistics across a large sample of matches, often appear roughly bell-shaped.

Example: Z-Scores for Player Comparison. A z-score standardizes a value by expressing how many standard deviations it is from the mean:

$$ z = \frac{x - \mu}{\sigma} $$

If the average xG per 90 for strikers is 0.42 with a standard deviation of 0.12, and a player has an xG/90 of 0.66:

$$ z = \frac{0.66 - 0.42}{0.12} = 2.0 $$

This player is 2 standard deviations above the mean---roughly in the top 2.5% of strikers. Z-scores allow us to compare players across different metrics on a common scale.

from scipy import stats
import numpy as np

# Poisson probability example
xg = 1.8  # Expected goals
for goals in range(6):
    prob = stats.poisson.pmf(goals, xg)
    print(f"P({goals} goals | xG={xg}) = {prob:.3f}")

# Output:
# P(0 goals | xG=1.8) = 0.165
# P(1 goals | xG=1.8) = 0.298
# P(2 goals | xG=1.8) = 0.268
# P(3 goals | xG=1.8) = 0.161
# P(4 goals | xG=1.8) = 0.072
# P(5 goals | xG=1.8) = 0.026

Best Practice: Match outcome prediction models often use independent Poisson distributions for each team's goals, then combine them to calculate win/draw/loss probabilities. To find the probability of a draw, for example, you sum $P(\text{Home} = k) \times P(\text{Away} = k)$ for all k from 0 to some reasonable maximum.

Probability theory provides the language for reasoning about uncertain outcomes. But in practice, we rarely know the true probabilities---we must estimate them from data. This is where statistical inference enters the picture.


3.3 Statistical Inference

3.3.1 From Sample to Population

We rarely observe entire populations. Instead, we collect samples and use them to make inferences about populations.

Key concepts: - Population: The complete group we're interested in (all Premier League matches ever) - Sample: The subset we actually observe (this season's matches) - Parameter: A value describing the population (true mean goals per match) - Statistic: A value calculated from the sample (sample mean)

The fundamental challenge of inference is that sample statistics are imperfect estimates of population parameters. If we observed a different sample, we would get a different estimate. Inference quantifies this uncertainty.

3.3.2 Sampling Distributions and the Central Limit Theorem

If we repeatedly took samples and calculated a statistic (like the mean), the distribution of those statistics is the sampling distribution.

Central Limit Theorem (CLT):

For large enough samples, the sampling distribution of the mean is approximately normal, regardless of the population distribution:

$$ \bar{X} \sim N\left(\mu, \frac{\sigma}{\sqrt{n}}\right) $$

More precisely, as $n \to \infty$:

$$ \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \xrightarrow{d} N(0, 1) $$

The standard error (SE) measures how much sample means typically vary:

$$ SE = \frac{s}{\sqrt{n}} $$

Derivation of the Standard Error. Why does the standard error decrease as $\sqrt{n}$? Consider the variance of the sample mean:

$$ \text{Var}(\bar{X}) = \text{Var}\left(\frac{1}{n}\sum_{i=1}^{n} X_i\right) = \frac{1}{n^2} \sum_{i=1}^{n} \text{Var}(X_i) = \frac{n\sigma^2}{n^2} = \frac{\sigma^2}{n} $$

Taking the square root gives $SE = \sigma / \sqrt{n}$.

Implications for soccer: - Larger samples lead to more precise estimates - Even non-normal data (like goals, which follow a Poisson distribution) produces approximately normal means - The rate of precision improvement decreases: going from 10 to 20 matches halves the variance, but going from 100 to 110 barely changes it

Example: If a player's xG per shot has a standard deviation of 0.10, the standard error of the mean xG per shot based on 100 shots is:

$$ SE = \frac{0.10}{\sqrt{100}} = 0.01 $$

But based on only 25 shots:

$$ SE = \frac{0.10}{\sqrt{25}} = 0.02 $$

The estimate from 25 shots is twice as uncertain as the estimate from 100 shots.

3.3.3 Confidence Intervals

A confidence interval provides a range of plausible values for a parameter.

95% Confidence Interval for a Mean:

$$ \bar{x} \pm t_{\alpha/2, n-1} \times \frac{s}{\sqrt{n}} $$

where $t_{\alpha/2, n-1}$ is the critical value from the t-distribution with n-1 degrees of freedom.

Interpretation: If we repeated the sampling process many times, 95% of the resulting intervals would contain the true population mean.

Common Pitfall: A 95% CI does NOT mean there's a 95% probability the true value is in this specific interval. The true value either is or isn't in the interval---the 95% refers to the procedure, not this specific interval. This is a subtle but important distinction in the frequentist framework. If you find this unintuitive, you are not alone---this is one reason many analysts prefer Bayesian credible intervals (Section 3.5).

Example: A player's xG per 90 over 20 matches

Data: mean = 0.45, std = 0.15, n = 20

$$ SE = \frac{0.15}{\sqrt{20}} = 0.034 $$

Using t-value for 95% CI with df=19: t ≈ 2.093

$$ CI = 0.45 \pm 2.093 \times 0.034 = 0.45 \pm 0.071 = (0.379, 0.521) $$

We're 95% confident the player's true xG/90 is between 0.38 and 0.52. This is a wide interval---the true value could be 20% lower or 15% higher than our estimate. This illustrates the challenge of drawing conclusions from small soccer samples.

Confidence Interval for a Proportion:

For proportions (e.g., conversion rates), the confidence interval uses:

$$ \hat{p} \pm z_{\alpha/2} \times \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} $$

Example: A player converts 18 out of 120 shots ($\hat{p}$ = 0.15). The 95% CI is:

$$ 0.15 \pm 1.96 \times \sqrt{\frac{0.15 \times 0.85}{120}} = 0.15 \pm 1.96 \times 0.0326 = 0.15 \pm 0.064 = (0.086, 0.214) $$

The true conversion rate could plausibly be anywhere from 8.6% to 21.4%. This enormous range highlights how uncertain we should be about conversion rates estimated from only 120 shots.

3.3.4 Hypothesis Testing

Hypothesis testing formalizes the question: "Is this pattern real or just random noise?"

The Framework:

  1. Null Hypothesis (H₀): The default assumption (usually "no effect" or "no difference")
  2. Alternative Hypothesis (H₁): What we're trying to provide evidence for
  3. Test Statistic: A number summarizing the evidence against H₀
  4. P-value: Probability of observing our data (or more extreme) if H₀ is true
  5. Decision: Reject H₀ if p-value < α (typically 0.05)

Example: Is this player's conversion rate better than league average?

  • H₀: Player's conversion rate = 12% (league average)
  • H₁: Player's conversion rate ≠ 12%
  • Data: 18 goals from 120 shots = 15% conversion

Test statistic (using normal approximation to binomial):

$$ z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} = \frac{0.15 - 0.12}{\sqrt{\frac{0.12 \times 0.88}{120}}} = \frac{0.03}{0.0297} \approx 1.01 $$

P-value for two-tailed test: 2 × P(Z > 1.01) ≈ 0.31

Conclusion: p-value = 0.31 > 0.05, so we fail to reject H₀. The observed difference could easily occur by chance---we don't have strong evidence the player is better than average.

3.3.5 Worked Example: Is a Team's Home Record Significantly Better?

Let us work through a complete hypothesis test with all steps shown. We want to determine whether a specific team has a significantly better home record than away record.

Setup: Over 38 matches (19 home, 19 away), a team earned the following points: - Home: 42 points from 19 matches (mean = 2.21 points per match, std = 1.08) - Away: 28 points from 19 matches (mean = 1.47 points per match, std = 1.22)

Hypotheses: - H₀: The team's expected points per match at home equals their expected points per match away ($\mu_H = \mu_A$) - H₁: The team's expected points per match at home exceeds their expected points per match away ($\mu_H > \mu_A$)

Test Statistic (Welch's t-test for unequal variances):

$$ t = \frac{\bar{x}_H - \bar{x}_A}{\sqrt{\frac{s_H^2}{n_H} + \frac{s_A^2}{n_A}}} = \frac{2.21 - 1.47}{\sqrt{\frac{1.08^2}{19} + \frac{1.22^2}{19}}} = \frac{0.74}{\sqrt{0.0614 + 0.0783}} = \frac{0.74}{0.374} \approx 1.98 $$

Degrees of freedom (Welch-Satterthwaite approximation):

$$ df = \frac{\left(\frac{s_H^2}{n_H} + \frac{s_A^2}{n_A}\right)^2}{\frac{(s_H^2/n_H)^2}{n_H - 1} + \frac{(s_A^2/n_A)^2}{n_A - 1}} \approx 35.3 $$

P-value for a one-sided test with t = 1.98 and df = 35.3: p ≈ 0.028.

Conclusion: Since p = 0.028 < 0.05, we reject H₀ at the 5% significance level. There is statistically significant evidence that this team performs better at home. However, is this surprising? Most teams show home advantage. The more interesting question might be whether this team's home advantage is larger than the league-wide average.

Best Practice: Always state your hypotheses before looking at the data. Post-hoc hypothesis formulation (coming up with the hypothesis after seeing the pattern) invalidates the p-value. This is especially tempting in soccer analytics where there are so many variables and potential patterns to discover.

3.3.6 Statistical Significance vs. Practical Significance

A result can be statistically significant but practically meaningless, or vice versa.

Example: With a very large sample (10,000 shots), even a tiny difference (12.1% vs 12.0%) might be statistically significant, but it's meaningless in practice---the difference amounts to about 1 extra goal per 10,000 shots.

Conversely, with small samples, even large differences might not be statistically significant due to uncertainty. A player converting at 20% versus the league average of 12% might fail a significance test if based on only 50 shots.

Intuition: Statistical significance tells you whether an effect is likely real. Effect size tells you whether it matters. Always consider both.

3.3.7 Effect Sizes and Practical Significance

Cohen's d for comparing two means:

$$ d = \frac{\bar{x}_1 - \bar{x}_2}{s_{pooled}} $$

where the pooled standard deviation is:

$$ s_{pooled} = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}} $$

Guidelines for interpreting Cohen's d: - |d| < 0.2: Small effect - 0.2 ≤ |d| < 0.5: Small to medium effect - 0.5 ≤ |d| < 0.8: Medium to large effect - |d| ≥ 0.8: Large effect

Soccer Example: Comparing xG per 90 between two groups of strikers (those who score 15+ goals vs. those who score fewer): - Group 1 (15+ goals): mean xG/90 = 0.52, std = 0.11, n = 30 - Group 2 (<15 goals): mean xG/90 = 0.35, std = 0.10, n = 80

$$ s_{pooled} = \sqrt{\frac{29 \times 0.0121 + 79 \times 0.01}{108}} = \sqrt{\frac{0.351 + 0.79}{108}} \approx 0.103 $$

$$ d = \frac{0.52 - 0.35}{0.103} \approx 1.65 $$

This is a very large effect size, confirming that the difference in chance creation between prolific and non-prolific scorers is not just statistically significant but practically meaningful.

Odds Ratio for comparing proportions:

$$ OR = \frac{p_1 / (1 - p_1)}{p_2 / (1 - p_2)} $$

Example: If the conversion rate from inside the box is 12% and from outside the box is 3%:

$$ OR = \frac{0.12 / 0.88}{0.03 / 0.97} = \frac{0.136}{0.031} \approx 4.4 $$

The odds of scoring are 4.4 times higher for shots inside the box. This quantifies the practical importance of shot location beyond a simple percentage comparison.

With the framework of inference established---estimation, hypothesis testing, and effect sizes---we must now confront a challenge that plagues soccer analytics more than almost any other field: small samples.


3.4 Sample Size and Stabilization

3.4.1 The Small Sample Problem

Soccer's low-scoring nature creates chronic small sample problems. A player might only take 50 shots in a season---not enough to reliably estimate their true conversion rate.

Example: True conversion rate is 15%. In samples of n=50 shots:

Sample Goals Observed Rate
1 5 10%
2 9 18%
3 7 14%
4 11 22%
5 6 12%

The observed rates vary wildly (10% to 22%) despite the same true rate!

We can quantify this. The standard error of a proportion with p = 0.15 and n = 50 is:

$$ SE = \sqrt{\frac{0.15 \times 0.85}{50}} = \sqrt{0.00255} \approx 0.050 $$

A 95% confidence interval around the true rate is approximately $0.15 \pm 0.10$, spanning from 5% to 25%. The sample size simply does not support precise estimation.

3.4.2 Stabilization Points

Different metrics stabilize (become reliable) at different sample sizes. A metric "stabilizes" when observations become more predictive of future observations than the league average. Formally, a metric is said to have stabilized when the split-half reliability (the correlation between the metric measured in two random halves of the sample) exceeds 0.70.

Approximate Stabilization Points:

Metric Stabilization Point
Shooting % ~700 shots
Save % ~1000 shots faced
Pass completion % ~400 passes
xG per shot ~200 shots
Shot volume ~10 matches
Tackle success ~100 tackles
Aerial duel win % ~80 duels
Dribble success % ~100 attempts

Implications: - Don't overreact to small-sample performance. A player who converts 25% of shots through 10 matches is not necessarily elite---the sample is far too small. - Use rate-based metrics only with sufficient sample. Reporting a "100% aerial duel success rate" from 3 duels is meaningless. - Consider Bayesian approaches that incorporate prior information (Section 3.5). - Volume-based metrics (total passes, total shots) stabilize much faster than rate-based metrics (pass completion %, conversion rate).

Best Practice: When presenting a rate statistic, always report the denominator alongside it. "15% conversion rate (120 shots)" is far more informative than "15% conversion rate" alone. The denominator gives the reader essential context about reliability.

3.4.3 Regression to the Mean

Extreme observations tend to be followed by less extreme ones---not because of any causal mechanism, but because extreme values are partly due to luck.

Example: A player with 25 goals from 15 xG (massively overperforming) is likely to score closer to their xG next season, not because they got worse, but because they were partly lucky.

Mathematical Basis:

If X and Y are imperfectly correlated (r < 1), then: $$ E[Y|X = x] = \bar{y} + r \times \frac{s_y}{s_x}(x - \bar{x}) $$

The prediction is "regressed" toward the mean by a factor related to the correlation. If r = 0.3 (typical for conversion rates year-to-year), then a player who is 10 percentage points above average this year is expected to be only $0.3 \times 10 = 3$ percentage points above average next year.

Why Regression to the Mean Occurs. Consider a measured performance as the sum of skill and luck:

$$ \text{Observed Performance} = \text{True Ability} + \text{Luck} $$

Extreme observed performances require either extreme ability, extreme luck, or both. Since extreme luck is unlikely to be repeated, the next observation will tend to be closer to the mean. The lower the signal-to-noise ratio (i.e., the more luck matters relative to skill), the more dramatic the regression.

def expected_future_performance(
    current_value: float,
    sample_mean: float,
    reliability: float  # correlation with future performance
) -> float:
    """
    Estimate future performance accounting for regression to mean.

    Parameters
    ----------
    current_value : float
        Observed performance
    sample_mean : float
        Average performance in the population
    reliability : float
        How much current predicts future (0 to 1)

    Returns
    -------
    float
        Expected future performance
    """
    return sample_mean + reliability * (current_value - sample_mean)

# Example: Player with 20% conversion, league average 12%, reliability 0.3
expected = expected_future_performance(0.20, 0.12, 0.3)
print(f"Expected future conversion: {expected:.1%}")  # ~14.4%

Common Pitfall: Regression to the mean is often misinterpreted as "the player got worse" or "the team improved." It is neither---it is a statistical artifact of measurement noise. A player whose conversion rate drops from 25% to 14% may not have changed at all; they may simply have been lucky the first year and normally lucky the second year.


3.5 Bayesian Thinking for Soccer Analytics

Classical (frequentist) statistics dominates introductory courses, but many of the most powerful tools in modern soccer analytics are built on Bayesian reasoning. Bayesian methods are particularly well-suited to soccer because they handle small samples gracefully, allow us to incorporate prior knowledge, and provide intuitive probabilistic statements about unknown quantities.

This section introduces Bayesian thinking from the ground up, using soccer examples throughout. By the end, you will understand when and why to reach for Bayesian tools instead of---or alongside---frequentist ones.

3.5.1 Bayes' Theorem Revisited: A Soccer Framework

We introduced Bayes' theorem in Section 3.2.5. Let us now treat it not just as a formula but as a framework for learning from data.

$$ P(\theta \mid \text{data}) = \frac{P(\text{data} \mid \theta) \times P(\theta)}{P(\text{data})} $$

In plain language:

Term Name Soccer Meaning
$P(\theta)$ Prior What we believed about $\theta$ before seeing new data
$P(\text{data} \mid \theta)$ Likelihood How probable the observed data is, given $\theta$
$P(\theta \mid \text{data})$ Posterior Our updated belief about $\theta$ after seeing data
$P(\text{data})$ Evidence Overall probability of the data (a normalizing constant)

The evidence term can be computed as:

$$ P(\text{data}) = \int P(\text{data}|\theta) P(\theta) \, d\theta $$

This integral ensures the posterior is a proper probability distribution that integrates to 1.

Updating belief about team quality after each match

Suppose we want to estimate a newly promoted team's true goal-scoring rate $\lambda$ (goals per match). Before the season begins, historical data on promoted teams tells us that $\lambda$ is typically around 1.1 goals per match with moderate uncertainty. That is our prior.

After matchday 1 the team scores 3 goals. After matchday 2 they score 0. After matchday 3 they score 2. With each result, Bayes' theorem updates our estimate:

$$ P(\lambda \mid \text{3, 0, 2 goals}) \propto P(\text{3, 0, 2} \mid \lambda) \times P(\lambda) $$

The posterior shifts from the prior toward the observed data, but it does not abandon the prior entirely---especially when the sample is small. This is exactly the behavior we want when evaluating a team after only three matches.

Intuition: Think of Bayesian updating like a scout who arrives at a club with a reputation report (the prior). After watching a few matches (the data), the scout revises their assessment. The more matches they watch, the less the original report matters. But after only two or three games, the report still carries significant weight.

3.5.2 Prior and Posterior Distributions

In Bayesian statistics, unknown parameters are not fixed numbers---they are random variables described by probability distributions.

The Prior Distribution

The prior encodes what we know (or assume) before collecting new data. Priors can be:

  • Informative: Based on historical data or domain expertise. Example: "Premier League strikers convert roughly 12% of shots, with a standard deviation of about 3%."
  • Weakly informative: Constraining the parameter to reasonable ranges without being too specific. Example: "A conversion rate is between 0% and 40%."
  • Non-informative (flat): Expressing near-total ignorance. Example: "Any conversion rate between 0% and 100% is equally plausible." (This is rarely appropriate in soccer, because we almost always have some prior knowledge.)

Soccer analogy: An informative prior is like a comprehensive scouting dossier on an opponent---it shapes your game plan before kickoff. A non-informative prior is like playing an opponent you have never seen or heard of.

The Posterior Distribution

After observing data, the posterior distribution combines prior knowledge with the evidence:

$$ \text{Posterior} \propto \text{Likelihood} \times \text{Prior} $$

As more data accumulates, the posterior concentrates around the true value, and the influence of the prior fades. This is sometimes called washing out the prior.

Example with the Beta-Binomial model:

For estimating a probability (such as a conversion rate), a natural choice is:

  • Prior: $\theta \sim \text{Beta}(\alpha, \beta)$
  • Likelihood: $k$ goals in $n$ shots $\sim \text{Binomial}(n, \theta)$
  • Posterior: $\theta \mid k, n \sim \text{Beta}(\alpha + k, \beta + n - k)$

The Beta distribution is the conjugate prior for the binomial likelihood, meaning the posterior will also be a Beta distribution---this makes the math tractable. The mean of a Beta($\alpha$, $\beta$) distribution is:

$$ E[\theta] = \frac{\alpha}{\alpha + \beta} $$

And the variance is:

$$ \text{Var}(\theta) = \frac{\alpha \beta}{(\alpha + \beta)^2(\alpha + \beta + 1)} $$

If our prior for a player's conversion rate is $\text{Beta}(6, 44)$ (encoding a belief centered near 12%), and the player then scores 5 goals from 30 shots, the posterior is:

$$ \theta \mid \text{data} \sim \text{Beta}(6 + 5, 44 + 30 - 5) = \text{Beta}(11, 69) $$

The posterior mean is $\frac{11}{11 + 69} = 0.1375$, or about 13.75%---a compromise between the prior (12%) and the observed rate (16.7%).

from scipy import stats
import numpy as np

# Prior: Beta(6, 44) — centered near 12%
alpha_prior, beta_prior = 6, 44

# Observed data: 5 goals from 30 shots
goals, shots = 5, 30

# Posterior: Beta(alpha_prior + goals, beta_prior + shots - goals)
alpha_post = alpha_prior + goals
beta_post = beta_prior + shots - goals

posterior = stats.beta(alpha_post, beta_post)

print(f"Prior mean:     {alpha_prior / (alpha_prior + beta_prior):.3f}")
print(f"Observed rate:  {goals / shots:.3f}")
print(f"Posterior mean: {posterior.mean():.3f}")
print(f"90% Credible interval: ({posterior.ppf(0.05):.3f}, {posterior.ppf(0.95):.3f})")

3.5.3 Sequential Bayesian Updating

One of the most powerful aspects of Bayesian inference is sequential updating. As new data arrives, the posterior from the previous analysis becomes the prior for the next update.

Example: Tracking a Striker Across a Season

Period New Goals New Shots Cumulative Posterior Posterior Mean
Prior -- -- Beta(6, 44) 12.0%
After 30 shots 5 30 Beta(11, 69) 13.8%
After 60 shots 4 30 Beta(15, 95) 13.6%
After 90 shots 6 30 Beta(21, 119) 15.0%
After 120 shots 7 30 Beta(28, 142) 16.5%

As more data accumulates, the posterior becomes increasingly driven by the data and less by the prior. The credible intervals also narrow. After 120 shots with 22 goals total (18.3% observed rate), the posterior mean of 16.5% is closer to the data than to the prior, reflecting the growing evidence that this player might genuinely be an above-average finisher.

3.5.4 Bayesian Updating for xG Models

Expected Goals (xG) models, which we will build in Chapter 7, assign a probability of scoring to each shot based on features like distance, angle, body part, and game state. Bayesian thinking can enhance xG modeling in several ways.

Incorporating prior information into model parameters

When fitting a logistic regression xG model, a frequentist approach estimates coefficients purely from data. A Bayesian approach places priors on those coefficients. For example, we know that shot distance should have a negative effect on scoring probability---placing a prior that encodes this domain knowledge helps regularize the model and prevents overfitting on small datasets.

$$ \beta_{\text{distance}} \sim \mathcal{N}(-0.1, 0.05^2) $$

This prior says: "We expect the distance coefficient to be around $-0.1$, and we would be surprised if it were positive."

Updating team-level xG baselines

Different leagues and eras produce different baseline scoring rates. A Bayesian hierarchical model can share information across leagues while allowing each league its own parameters:

$$ \lambda_{\text{league}_j} \sim \mathcal{N}(\mu_{\text{global}}, \sigma^2_{\text{global}}) $$

Each league's goal rate is "shrunk" toward the global mean, with the degree of shrinkage depending on the amount of data from that league.

Best Practice: Bayesian xG models are especially valuable when working with sparse data---lower-league competitions, women's soccer where historical datasets are smaller, or when modeling rare event types like headers from set pieces.

3.5.5 Credible Intervals vs. Confidence Intervals

Both credible intervals and confidence intervals express uncertainty, but they answer fundamentally different questions.

Frequentist 95% confidence interval:

"If we repeated this experiment many times, 95% of the intervals constructed this way would contain the true parameter value."

This does not say there is a 95% probability the parameter lies in this particular interval.

Bayesian 95% credible interval:

"Given the observed data and our prior, there is a 95% probability that the true parameter lies in this interval."

This is a direct probability statement about the parameter---exactly what most analysts intuitively want.

$$ P(a \leq \theta \leq b \mid \text{data}) = 0.95 $$

Soccer example: After 20 matches, a team has accumulated 1.85 xG per match on average.

  • Frequentist CI: (1.52, 2.18). "If we re-sampled 20-match stretches, 95% of CIs would contain the true rate."
  • Bayesian credible interval: (1.55, 2.14). "There is a 95% probability the team's true xG rate is between 1.55 and 2.14."

The Bayesian statement is more natural for decision-making. A sporting director can say: "I am 95% confident this team's underlying attacking output is between 1.55 and 2.14 xG per match."

Advanced: When sample sizes are large and priors are weak, credible intervals and confidence intervals often coincide numerically. The differences become most important with small samples---exactly the situation soccer analysts frequently face.

3.5.6 Practical Example: Estimating a Player's True Scoring Rate

Scenario: A club is considering signing a striker who scored 8 goals from 35 shots (22.9% conversion) in the first half of the season in a mid-table Ligue 1 side. Is this player genuinely a 23% converter, or is this small-sample noise?

Step 1: Define the prior

From historical data, we know that the population of strikers in Europe's top five leagues has a conversion rate that follows approximately $\text{Beta}(5, 35)$, centered near 12.5%.

Step 2: Compute the posterior

$$ \text{Posterior} = \text{Beta}(5 + 8, 35 + 35 - 8) = \text{Beta}(13, 62) $$

Step 3: Interpret the results

$$ \text{Posterior mean} = \frac{13}{13 + 62} = 0.173 \approx 17.3\% $$

The Bayesian estimate (17.3%) is substantially lower than the raw observed rate (22.9%). The prior has "shrunk" the estimate toward the population average, reflecting our reasonable skepticism that the player is truly a 23% converter based on just 35 shots.

Step 4: Quantify uncertainty

from scipy import stats

posterior = stats.beta(13, 62)

print(f"Posterior mean:   {posterior.mean():.1%}")
print(f"Posterior median: {posterior.median():.1%}")
print(f"90% Credible interval: ({posterior.ppf(0.05):.1%}, {posterior.ppf(0.95):.1%})")
print(f"P(rate > 0.20):  {1 - posterior.cdf(0.20):.1%}")
print(f"P(rate > 0.15):  {1 - posterior.cdf(0.15):.1%}")

Output:

Posterior mean:   17.3%
Posterior median: 16.9%
90% Credible interval: (10.8%, 24.8%)
P(rate > 0.20):  22.5%
P(rate > 0.15):  68.3%

Decision-making interpretation: There is roughly a 68% chance the player's true conversion rate exceeds 15% (above average), but only a 23% chance it exceeds 20% (elite). The wide credible interval (10.8% to 24.8%) tells us we still face substantial uncertainty---35 shots is simply not enough to pin down a conversion rate precisely.

Intuition: A Bayesian approach lets analysts communicate uncertainty honestly. Rather than telling the sporting director "this player converts at 23%," we can say: "Our best estimate of his true rate is around 17%, and we are 90% sure it falls between 11% and 25%. We need to see more data before committing."

3.5.7 Comparing Two Players with Bayesian Methods

Instead of asking "Is Player A's conversion rate statistically significantly different from Player B's?" (a frequentist question), we can ask "What is the probability that Player A's conversion rate exceeds Player B's?" (a Bayesian question).

If Player A has posterior Beta(25, 150) and Player B has posterior Beta(18, 130), we can estimate P(A > B) through simulation:

import numpy as np

np.random.seed(42)
a_samples = np.random.beta(25, 150, 100000)
b_samples = np.random.beta(18, 130, 100000)
prob_a_better = (a_samples > b_samples).mean()
print(f"P(Player A better than Player B) = {prob_a_better:.3f}")

This gives a direct, interpretable answer---exactly what a sporting director wants when deciding between two signings.

3.5.8 Choosing Priors

The choice of prior is both the strength and the controversy of Bayesian methods. Here are practical guidelines for soccer analytics:

  1. Use league-average rates as prior means. For conversion rates, use the league's overall conversion rate. For goal-scoring rates, use the league average. This is defensible and data-driven.

  2. Set the prior strength to reflect your confidence. A Beta(6, 44) prior (equivalent to 50 prior "observations") says you have moderate confidence. A Beta(12, 88) (100 prior observations) says you have strong confidence. The prior strength should reflect how much historical data informs your baseline belief.

  3. Always perform sensitivity analysis. Check whether your conclusions change substantially under different reasonable priors. If they do, the data is insufficient to draw strong conclusions regardless of your prior.

  4. Be transparent about your priors. Report the prior distribution in your analysis so others can evaluate your assumptions.

3.5.9 When to Use Bayesian vs. Frequentist Approaches

Neither paradigm is universally superior. The choice depends on the problem at hand.

Favor Bayesian methods when:

Situation Why Bayesian Helps
Small sample sizes Priors prevent overfitting and extreme estimates
Strong prior knowledge exists Domain expertise can be formally incorporated
You need probability statements about parameters Credible intervals are more intuitive than CIs
Sequential updating is natural New data can be folded in incrementally
Hierarchical data structures Bayesian hierarchical models share strength across groups
Decision-making under uncertainty Posterior distributions feed naturally into cost-benefit analyses

Favor frequentist methods when:

Situation Why Frequentist Works Well
Large sample sizes Priors become irrelevant; methods converge
Objectivity is paramount No subjective prior to defend
Simple, well-defined tests z-tests, t-tests, chi-squared tests are fast and familiar
Regulatory or publication standards Many journals and organizations expect frequentist results
Computational constraints Point estimates and CIs are cheap to compute

In soccer analytics practice:

Most professional analytics departments use a pragmatic mix. Frequentist methods work well for large-scale analyses with ample data (e.g., analyzing passing completion rates across an entire league season). Bayesian methods shine when evaluating individual players from limited data, building hierarchical models across competitions, or communicating uncertainty to non-technical stakeholders like coaches and directors.

Best Practice: The Bayesian framework is not about choosing a "better" philosophy---it is about having the right tool for the right problem. In soccer, where small samples, prior knowledge, and decision-making under uncertainty are the norm, Bayesian thinking should be a core part of every analyst's toolkit.


3.6 Correlation and Causation

3.6.1 Understanding Correlation

Correlation measures the linear relationship between two variables.

Pearson Correlation Coefficient:

$$ r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \sum_{i=1}^{n}(y_i - \bar{y})^2}} $$

This can also be written as:

$$ r = \frac{n\sum x_i y_i - \sum x_i \sum y_i}{\sqrt{[n\sum x_i^2 - (\sum x_i)^2][n\sum y_i^2 - (\sum y_i)^2]}} $$

  • r = 1: Perfect positive correlation
  • r = -1: Perfect negative correlation
  • r = 0: No linear correlation

Interpretation Guidelines: - |r| < 0.3: Weak - 0.3 ≤ |r| < 0.6: Moderate - |r| ≥ 0.6: Strong

The Coefficient of Determination ($R^2$) is the square of the correlation coefficient and represents the proportion of variance in one variable explained by the other:

$$ R^2 = r^2 $$

If r = 0.85 between xG and goals, then $R^2 = 0.72$, meaning 72% of the variation in goal totals can be explained by xG.

3.6.2 Correlation Examples in Soccer

Strong Positive Correlation: - xG and actual goals (r ≈ 0.85): This validates xG as a predictive metric - Shots and xG (r ≈ 0.75): Teams that shoot more generate more xG - Points and goal difference (r ≈ 0.95): Nearly perfectly correlated in league play

Moderate Correlation: - Possession and points (r ≈ 0.4): Possession helps, but is far from decisive - Passing accuracy and points (r ≈ 0.45): Better passing teams tend to perform better

Weak Correlation: - Corners and goals (r ≈ 0.15): Corners rarely lead to goals - Fouls committed and goals conceded (r ≈ 0.1): Almost no relationship

Intuition: The weakness of the correlation between possession and points (r ≈ 0.4) is one of the most important findings in soccer analytics. It tells us that knowing a team's possession percentage explains only about 16% ($0.4^2$) of the variation in their point totals. The other 84% depends on other factors: quality of chances created, finishing, defending, set pieces, and luck.

3.6.3 Spearman's Rank Correlation

When relationships are monotonic but not necessarily linear, or when data contains outliers, Spearman's rank correlation is more appropriate:

$$ r_s = 1 - \frac{6\sum d_i^2}{n(n^2 - 1)} $$

where $d_i$ is the difference between the ranks of corresponding values. Spearman's correlation is simply the Pearson correlation applied to the ranked data.

Soccer Example: If we rank teams by xG and by league position, Spearman's correlation tells us how well xG predicts finishing order, without assuming the relationship is linear. This is useful because the relationship between xG and points may not be strictly linear---teams at the extreme top may convert xG into points more efficiently than mid-table teams.

3.6.4 The Causation Problem

Common Pitfall: Correlation does not imply causation!

Why correlations can be misleading:

  1. Reverse Causation: A causes B, but we think B causes A - Example: Winning teams have higher possession... but do they win because of possession, or do they have possession because they're winning (and opponents must attack)?

  2. Confounding Variables: C causes both A and B - Example: Teams with high pressing intensity also score more goals. But both might be caused by overall team quality (a confounder). Rich clubs buy better players who both press harder and finish better.

  3. Selection Bias: The sample isn't representative - Example: Studying only promoted teams might show correlations that don't generalize to the broader population.

  4. Coincidence: Spurious correlations happen by chance - Example: Nick Cage films correlate with swimming pool drownings (real finding, obviously spurious).

3.6.5 Moving Beyond Correlation

To establish causation, we need: - Temporal precedence: Cause before effect - Plausible mechanism: A theory for how X affects Y - Ruling out alternatives: Other explanations eliminated - Ideally: Experimental evidence (hard in soccer)

Practical Approaches: - Use domain knowledge to evaluate mechanisms - Control for confounders in regression - Use natural experiments when available (e.g., rule changes, managerial sackings) - Use causal inference frameworks (difference-in-differences, instrumental variables) - Be humble about causal claims


3.7 Introduction to Regression

3.7.1 Simple Linear Regression

Regression models the relationship between variables, allowing prediction and understanding of associations.

The Model:

$$ Y = \beta_0 + \beta_1 X + \epsilon $$

Where: - Y: Dependent variable (what we're predicting) - X: Independent variable (predictor) - $\beta_0$: Intercept (Y when X = 0) - $\beta_1$: Slope (change in Y per unit change in X) - $\epsilon$: Error term (what the model doesn't explain), assumed $\epsilon \sim N(0, \sigma^2)$

Estimation by Ordinary Least Squares (OLS):

OLS finds the values of $\beta_0$ and $\beta_1$ that minimize the sum of squared residuals:

$$ \min_{\beta_0, \beta_1} \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2 $$

Taking partial derivatives and setting them to zero yields the "normal equations":

$$ \frac{\partial}{\partial \beta_0} \sum (y_i - \beta_0 - \beta_1 x_i)^2 = -2\sum(y_i - \beta_0 - \beta_1 x_i) = 0 $$

$$ \frac{\partial}{\partial \beta_1} \sum (y_i - \beta_0 - \beta_1 x_i)^2 = -2\sum x_i(y_i - \beta_0 - \beta_1 x_i) = 0 $$

Solving these gives:

$$ \hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} = \frac{S_{xy}}{S_{xx}} $$

$$ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} $$

Note the relationship between the slope and the correlation coefficient:

$$ \hat{\beta}_1 = r \times \frac{s_y}{s_x} $$

This shows that the regression slope is the correlation scaled by the ratio of standard deviations.

3.7.2 Soccer Example: xG and Goals

Let's model the relationship between team xG and actual goals.

import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm

# Sample data: Team season xG and actual goals
data = pd.DataFrame({
    'team': ['Team A', 'Team B', 'Team C', 'Team D', 'Team E',
             'Team F', 'Team G', 'Team H', 'Team I', 'Team J'],
    'xG': [65.2, 48.3, 72.1, 55.8, 61.4, 43.2, 78.5, 52.1, 58.9, 67.3],
    'goals': [68, 45, 75, 52, 65, 48, 82, 50, 55, 70]
})

# Fit regression
X = sm.add_constant(data['xG'])  # Add intercept
model = sm.OLS(data['goals'], X).fit()

print(model.summary())

Interpreting Output:

  • R-squared: Proportion of variance explained (e.g., 0.92 = 92%). An R-squared of 0.92 means 92% of the variation in actual goals is explained by xG, with 8% attributable to finishing skill, luck, and other factors.
  • Coefficients:
  • Intercept (const): Expected goals when xG = 0. This is usually not meaningful but is required for the model.
  • xG coefficient: Change in goals per unit increase in xG. A coefficient of 1.05 means that for every additional xG, a team scores about 1.05 actual goals on average.
  • P-values: Statistical significance of each coefficient.
  • Confidence intervals: Range of plausible coefficient values.

3.7.3 Multiple Regression

With multiple predictors:

$$ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_k X_k + \epsilon $$

Application: A Simple xG Model Framework

A basic xG model might use multiple features to predict goal probability. While xG models typically use logistic regression (since the outcome is binary: goal or no goal), the multiple regression framework illustrates the principle of controlling for multiple factors simultaneously:

$$ \text{Goals} = \beta_0 + \beta_1 \times \text{xG} + \beta_2 \times \text{shots\_on\_target} + \beta_3 \times \text{big\_chances} + \epsilon $$

# Multiple regression example: predicting team goals
X = data[['xG', 'shots_on_target', 'big_chances']]
X = sm.add_constant(X)
model = sm.OLS(data['goals'], X).fit()

Interpretation: Each coefficient represents the effect of that variable holding others constant. For example, if the coefficient for "big_chances" is 0.45, it means that each additional big chance created is associated with 0.45 additional goals, controlling for xG and shots on target.

Advanced: In multiple regression, the interpretation "holding others constant" is crucial but can be misleading when predictors are correlated. If xG and big chances are highly correlated (which they are), the individual coefficients may be unstable even if the model as a whole fits well. This is the problem of multicollinearity.

3.7.4 Regression Assumptions and Diagnostics

Linear regression assumes:

  1. Linearity: Relationship between X and Y is linear. Check with residual plots.
  2. Independence: Observations are independent. Violated when the same team appears multiple times across seasons.
  3. Homoscedasticity: Constant variance of errors. Check by plotting residuals vs. fitted values.
  4. Normality: Errors are normally distributed (for inference). Check with Q-Q plots.
  5. No multicollinearity: Predictors aren't highly correlated with each other. Check with Variance Inflation Factors (VIF).

Checking Multicollinearity with VIF:

$$ VIF_j = \frac{1}{1 - R_j^2} $$

where $R_j^2$ is the R-squared from regressing predictor $X_j$ on all other predictors. A VIF above 10 suggests problematic multicollinearity.

Example: In a model with both "xG" and "shots on target" as predictors, the VIF might be 4.5, indicating moderate correlation. Adding "shots" as a third predictor might push VIFs above 10, because shots, shots on target, and xG are all measuring related aspects of attacking output.

Checking Assumptions: - Plot residuals vs. fitted values (should show no pattern---a funnel shape suggests heteroscedasticity) - Check residual distribution with a histogram or Q-Q plot (should be approximately normal) - Calculate VIF for each predictor (should be below 5, ideally below 3)

3.7.5 Limitations in Soccer Context

Regression has specific limitations in soccer analysis:

  • Small samples: Season-level team data gives only 20 observations per league. This severely limits the number of predictors we can include (a common rule of thumb is at least 10-15 observations per predictor).
  • Non-independence: The same team appears multiple times across seasons. Repeated measurements from the same team violate the independence assumption. Mixed-effects models or clustered standard errors can address this.
  • Non-linearity: Some relationships aren't linear. For example, the relationship between possession and points might be non-linear---very low possession is bad, moderate is fine, and very high might show diminishing returns.
  • Missing variables: Important factors we can't measure (tactical instructions, player motivation, injury severity) create omitted variable bias.

Best Practice: Use regression as one tool among many. Validate findings with domain knowledge and alternative methods. When reporting regression results, always include diagnostic plots and discuss potential violations of assumptions.


3.8 Bootstrap and Resampling Methods

3.8.1 Why Resampling?

Classical statistical methods assume specific distributional forms (e.g., normality) for inference. When these assumptions are questionable, or when sample sizes are too small for asymptotic approximations to work well, resampling methods provide a powerful alternative.

In soccer analytics, we frequently encounter situations where: - The data is not normally distributed (e.g., goal counts, xG values) - Sample sizes are small (e.g., a player's 15 matches) - The statistic of interest has no simple formula for its standard error (e.g., the ratio of two metrics, the median, or a percentile)

3.8.2 The Bootstrap

The bootstrap, introduced by Bradley Efron in 1979, estimates the sampling distribution of a statistic by resampling from the observed data with replacement.

Algorithm: 1. From the original sample of size n, draw a sample of size n with replacement 2. Calculate the statistic of interest on the bootstrap sample 3. Repeat steps 1-2 a large number of times (e.g., 10,000) 4. The distribution of the bootstrap statistics approximates the sampling distribution

Bootstrap Confidence Interval (Percentile Method):

The $100(1-\alpha)\%$ bootstrap confidence interval is simply the $\alpha/2$ and $1-\alpha/2$ percentiles of the bootstrap distribution.

Example: Bootstrap CI for a Player's xG per 90

import numpy as np

def bootstrap_ci(data, stat_func=np.mean, n_boot=10000, ci=0.95):
    """
    Compute bootstrap confidence interval.

    Parameters
    ----------
    data : array-like
        Observed data
    stat_func : callable
        Statistic to compute (default: mean)
    n_boot : int
        Number of bootstrap samples
    ci : float
        Confidence level

    Returns
    -------
    tuple
        (lower, upper) bounds of confidence interval
    """
    rng = np.random.default_rng(42)
    boot_stats = np.array([
        stat_func(rng.choice(data, size=len(data), replace=True))
        for _ in range(n_boot)
    ])

    alpha = 1 - ci
    lower = np.percentile(boot_stats, 100 * alpha / 2)
    upper = np.percentile(boot_stats, 100 * (1 - alpha / 2))

    return lower, upper

# Player's xG per 90 across 15 matches
xg_per_90 = np.array([0.32, 0.55, 0.18, 0.72, 0.41, 0.65, 0.28,
                        0.49, 0.33, 0.58, 0.22, 0.67, 0.44, 0.51, 0.39])

lower, upper = bootstrap_ci(xg_per_90)
print(f"Bootstrap 95% CI: ({lower:.3f}, {upper:.3f})")
print(f"Sample mean: {np.mean(xg_per_90):.3f}")

Intuition: The bootstrap treats the observed sample as a "miniature population." By resampling from it thousands of times, we approximate what would happen if we could actually collect thousands of new samples from the real population. The beauty is that this works for almost any statistic---means, medians, ratios, percentiles, or custom metrics---without needing to derive mathematical formulas.

3.8.3 Bootstrap for Comparing Two Groups

The bootstrap is particularly useful for comparing groups when parametric assumptions may not hold.

Example: Comparing Home vs. Away xG

home_xg = np.array([1.8, 2.1, 1.5, 2.4, 1.9, 2.0, 1.6, 2.2, 1.7, 2.3])
away_xg = np.array([1.2, 1.5, 0.9, 1.8, 1.1, 1.4, 1.0, 1.6, 1.3, 1.0])

observed_diff = np.mean(home_xg) - np.mean(away_xg)

# Bootstrap the difference
rng = np.random.default_rng(42)
n_boot = 10000
boot_diffs = []
for _ in range(n_boot):
    boot_home = rng.choice(home_xg, size=len(home_xg), replace=True)
    boot_away = rng.choice(away_xg, size=len(away_xg), replace=True)
    boot_diffs.append(np.mean(boot_home) - np.mean(boot_away))

boot_diffs = np.array(boot_diffs)
ci_lower = np.percentile(boot_diffs, 2.5)
ci_upper = np.percentile(boot_diffs, 97.5)

print(f"Observed difference: {observed_diff:.3f}")
print(f"Bootstrap 95% CI: ({ci_lower:.3f}, {ci_upper:.3f})")
print(f"P(home xG > away xG): {(boot_diffs > 0).mean():.4f}")

3.8.4 Permutation Tests

Permutation tests assess statistical significance by randomly shuffling group labels. Under the null hypothesis (no difference between groups), the group labels are arbitrary.

Example: Is the difference in conversion rates between two players statistically significant?

Player A: 18 goals from 120 shots (15.0%) Player B: 10 goals from 100 shots (10.0%)

Under H₀, the group labels (Player A vs. Player B) are irrelevant. We pool all 220 shots and randomly assign 120 to "group A" and 100 to "group B," then compute the difference in conversion rates. Repeating this many times produces the null distribution.

def permutation_test(successes_a, n_a, successes_b, n_b, n_perms=10000):
    """Permutation test for difference in proportions."""
    observed_diff = successes_a / n_a - successes_b / n_b

    # Pool all outcomes
    pooled = np.concatenate([
        np.ones(successes_a + successes_b),
        np.zeros((n_a - successes_a) + (n_b - successes_b))
    ])

    rng = np.random.default_rng(42)
    perm_diffs = []
    for _ in range(n_perms):
        rng.shuffle(pooled)
        perm_a = pooled[:n_a].mean()
        perm_b = pooled[n_a:n_a + n_b].mean()
        perm_diffs.append(perm_a - perm_b)

    perm_diffs = np.array(perm_diffs)
    p_value = (np.abs(perm_diffs) >= np.abs(observed_diff)).mean()

    return observed_diff, p_value

diff, p_val = permutation_test(18, 120, 10, 100)
print(f"Observed difference: {diff:.3f}")
print(f"Permutation p-value: {p_val:.4f}")

Best Practice: Resampling methods make fewer assumptions than classical tests and are often more appropriate for the messy, non-normal, small-sample data typical in soccer analytics. Their main drawback is computational cost, but with modern computers, running 10,000 bootstrap iterations on a typical dataset takes well under a second.


3.9 Common Statistical Pitfalls

3.9.1 Multiple Comparisons

When testing many hypotheses, some will be "significant" by chance.

Example: Testing whether 20 different metrics differ between winning and losing teams. At α = 0.05, we expect 1 false positive even with no real differences! The probability of at least one false positive is:

$$ P(\text{at least one false positive}) = 1 - (1 - 0.05)^{20} = 1 - 0.95^{20} \approx 0.64 $$

There is a 64% chance of finding at least one "significant" difference purely by chance.

Solutions:

Bonferroni Correction: Divide the significance level by the number of tests:

$$ \alpha_{\text{adjusted}} = \frac{\alpha}{m} $$

For 20 tests at α = 0.05: $\alpha_{\text{adjusted}} = 0.0025$. This is conservative---it controls the family-wise error rate but may miss real effects.

Benjamini-Hochberg Procedure: Controls the False Discovery Rate (FDR), which is less conservative. Sort p-values from smallest to largest: $p_{(1)} \leq p_{(2)} \leq \ldots \leq p_{(m)}$. Reject all hypotheses where:

$$ p_{(i)} \leq \frac{i}{m} \times \alpha $$

Soccer Application: When scouting databases are used to screen hundreds of players across dozens of metrics, multiple comparisons corrections are essential. Without them, many "standout" players will be false positives---players who appear exceptional on one metric by chance alone.

Best Practice: When analyzing many players or many metrics, always report how many tests were conducted, not just the significant ones. Use FDR correction when screening, and pre-register your hypotheses when possible.

3.9.2 Cherry-Picking

Selecting only the data or analyses that support your conclusion.

Examples: - "Player X has scored in 8 of their last 11 away games against top-6 opposition on Saturdays" - Only showing seasons where a pattern holds - Trying many different model specifications and reporting only the one that "works"

Prevention: Pre-register hypotheses, report negative findings, be skeptical of highly specific claims. The more specific and cherry-picked a statistic sounds, the less likely it is to be meaningful.

3.9.3 Simpson's Paradox

Aggregated data can show different patterns than disaggregated data.

Soccer Example: - Player A: 20% conversion vs strong teams, 15% vs weak teams - Player B: 18% conversion vs strong teams, 13% vs weak teams - Player A is better against both types of opposition - But Player B might have higher OVERALL conversion if they play mostly weak teams!

Numerically: If Player A faces 50 strong and 50 weak opponents, they score (0.20)(50) + (0.15)(50) = 17.5 from 100 shots (17.5%). If Player B faces 20 strong and 80 weak opponents, they score (0.18)(20) + (0.13)(80) = 3.6 + 10.4 = 14 from 100 shots (14%). Here the aggregation preserves the ranking. But with different sample compositions, the paradox can reverse the overall ranking despite Player A being better in every subgroup.

Solution: Consider stratification and confounders before aggregating. When comparing players, ensure you are comparing like with like.

3.9.4 Survivorship Bias

Only analyzing successes while ignoring failures.

Example: Studying what traits successful analytics departments have without studying failed ones leads to biased conclusions---maybe all departments have those traits. Similarly, analyzing players who lasted a full season ignores those who were dropped or injured, creating a biased sample of "survivors."

3.9.5 Ecological Fallacy

Assuming patterns at group level apply to individuals.

Example: If teams with high possession tend to win, it doesn't mean a specific team will win by increasing their possession---the relationship may not be causal, and it may not apply at the individual team level. High-possession teams may simply be better teams who happen to dominate the ball.

3.9.6 Base Rate Neglect

Ignoring the prior probability of an event when evaluating evidence.

Example: A "talent identification model" correctly identifies 90% of players who become elite (sensitivity) and correctly rejects 95% of players who don't become elite (specificity). Sounds excellent. But if only 1% of all players become elite (the base rate):

$$ P(\text{elite} | \text{positive test}) = \frac{0.90 \times 0.01}{0.90 \times 0.01 + 0.05 \times 0.99} = \frac{0.009}{0.009 + 0.0495} \approx 0.154 $$

Only 15.4% of players flagged as future elite by the model actually become elite. The low base rate overwhelms the model's accuracy. This is a direct application of Bayes' theorem and illustrates why scouting models must be evaluated in context.

Common Pitfall: Base rate neglect is one of the most pervasive errors in sports analytics and talent identification. A model can be highly accurate and still produce mostly false positives if the event it is predicting is rare. Always ask: "What is the base rate?"


3.10 Chapter Summary

Key Concepts

  1. Descriptive statistics summarize data through measures of central tendency (mean, median) and spread (variance, standard deviation, IQR). Always examine the shape of distributions, including skewness and kurtosis.

  2. Probability quantifies uncertainty using rules for combining events, conditional probability, the law of total probability, and distributions (Binomial, Poisson, Normal). The Poisson distribution is particularly important for modeling soccer goals.

  3. Statistical inference allows us to draw conclusions about populations from samples using confidence intervals and hypothesis tests. Always report effect sizes alongside p-values.

  4. Sample size and stabilization are critical---many soccer metrics require large samples to become reliable. Regression to the mean is a pervasive phenomenon in soccer performance data.

  5. Bayesian thinking provides a principled framework for updating beliefs with new data, incorporating prior knowledge, and making probability statements about unknown parameters---especially valuable with soccer's small samples.

  6. Correlation measures association but does not imply causation. Multiple mechanisms can create correlations, and understanding the distinction is essential for responsible analysis.

  7. Regression models relationships between variables, enabling prediction and estimation of effects. Multiple regression extends this to multiple predictors, but assumptions and multicollinearity require careful attention.

  8. Bootstrap and resampling methods provide distribution-free alternatives to classical inference, making them particularly useful for the non-standard data common in soccer analytics.

  9. Statistical pitfalls including multiple comparisons, cherry-picking, Simpson's paradox, survivorship bias, and base rate neglect require constant vigilance.

Key Formulas

Concept Formula
Mean $\bar{x} = \frac{1}{n}\sum x_i$
Standard Deviation $s = \sqrt{\frac{1}{n-1}\sum(x_i - \bar{x})^2}$
Standard Error $SE = \frac{s}{\sqrt{n}}$
95% Confidence Interval $\bar{x} \pm t \times SE$
Poisson Probability $P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}$
Bayes' Theorem $P(\theta \mid \text{data}) = \frac{P(\text{data} \mid \theta) P(\theta)}{P(\text{data})}$
Beta-Binomial Posterior $\text{Beta}(\alpha + k, \beta + n - k)$
Correlation $r = \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum(x_i-\bar{x})^2\sum(y_i-\bar{y})^2}}$
Linear Regression $Y = \beta_0 + \beta_1 X + \epsilon$
Cohen's d $d = \frac{\bar{x}_1 - \bar{x}_2}{s_{pooled}}$

Decision Framework

When analyzing soccer data statistically:

├── Is sample size adequate?
│   ├── No → Use Bayesian methods with informative priors, or acknowledge uncertainty
│   └── Yes → Proceed with frequentist or Bayesian analysis
├── Are you making causal claims?
│   ├── Yes → What's the mechanism? What confounders exist?
│   └── No → Proceed with correlation/association language
├── Are you testing multiple hypotheses?
│   ├── Yes → Adjust for multiple comparisons (Bonferroni or FDR)
│   └── No → Standard testing
├── Are distributional assumptions met?
│   ├── No → Use bootstrap or permutation tests
│   └── Yes → Classical methods are appropriate
├── Do you have strong prior information?
│   ├── Yes → Consider Bayesian methods for more stable estimates
│   └── No → Frequentist methods with appropriate caution
└── Are results practically significant?
    ├── No → Don't overinterpret
    └── Yes → Report effect sizes alongside p-values

What's Next

In Chapter 4: Python Programming for Soccer Analytics, we'll put these statistical concepts into practice using Python. You'll learn to implement the calculations from this chapter efficiently using pandas, NumPy, and scipy.

Before moving on, complete the exercises and quiz to solidify your statistical foundations.


Chapter 3 Exercises → exercises.md

Chapter 3 Quiz → quiz.md

Case Study: Evaluating a "Hot Streak" → case-study-01.md

Case Study: Does Possession Cause Winning? → case-study-02.md


Chapter 3 Complete