Key Takeaways: Statistical Foundations for Soccer Analysis
Core Concepts
1. Descriptive Statistics
- Mean, median, and mode serve different purposes; median is preferred for skewed distributions (e.g., transfer fees, salaries)
- Standard deviation measures spread; lower SD indicates more consistent performance
- Z-scores allow comparison across different metrics by standardizing to a common scale
- Always report both central tendency AND spread when summarizing data
2. Probability Foundations
- Poisson distribution is ideal for modeling goal counts—discrete events occurring at a roughly constant rate
- Binomial distribution applies to fixed-trial scenarios (e.g., penalty conversion over n attempts)
- The complement rule (P(not A) = 1 - P(A)) is useful for "at least one" probability questions
- Independence must be verified, not assumed—many soccer events are conditionally dependent
3. Statistical Inference
- Confidence intervals describe estimation precision, not probability of containing the true value
- P-values indicate the probability of observing data this extreme if the null hypothesis were true
- Statistical significance ≠ practical importance—large samples can make tiny effects "significant"
- Always pair hypothesis tests with effect size estimates
4. Sample Size and Stabilization
| Metric | Stabilization Point | Key Insight |
|---|---|---|
| Save percentage | 1,000+ shots faced | Goalkeeper eval requires patience |
| Conversion rate | 700+ shots | Single-season data often insufficient |
| Pass completion | 300+ passes | Stabilizes relatively quickly |
| xG per shot | 200+ shots | Shot quality stabilizes faster than outcomes |
- Larger samples = narrower confidence intervals = more reliable estimates
- Early-season statistics should be heavily regressed toward prior expectations
5. Regression to the Mean
- Extreme performances contain both skill AND luck; subsequent performances will likely be less extreme
- This is a mathematical phenomenon, not a force—it happens because extreme observations are unlikely to be repeated
- Formula for regression: Expected future = (sample mean × weight) + (population mean × (1-weight))
- Weight increases with sample size and decreases with measurement noise
6. Correlation and Regression
- Correlation measures linear association strength and direction (range: -1 to +1)
- Correlation ≠ causation—confounding, reverse causation, and coincidence are common
- R² indicates proportion of variance explained, not prediction accuracy
- Multiple regression allows controlling for confounders but cannot prove causation
Critical Formulas
Standard Error
$$SE = \frac{s}{\sqrt{n}}$$
95% Confidence Interval
$$\bar{x} \pm 1.96 \times SE$$
Poisson Probability
$$P(X = k) = \frac{e^{-\lambda} \lambda^k}{k!}$$
Sample Size for Desired Precision
$$n = \left(\frac{z \cdot \sigma}{E}\right)^2$$
Correlation Coefficient
$$r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}}$$
Common Pitfalls to Avoid
1. Small Sample Conclusions
Wrong: "He's scored 5 from 10 shots (50%)—elite finisher!" Right: "50% conversion on 10 shots gives 95% CI of 19%-81%; we can't conclude much yet."
2. Misinterpreting P-Values
Wrong: "p = 0.03 means 97% chance the treatment works" Right: "p = 0.03 means there's a 3% chance of seeing this result if the null were true"
3. Ignoring Base Rates
Wrong: "He scored in 8 straight games—he's on fire!" Right: "With 500 players, we'd expect ~9 such streaks purely by chance"
4. Confusing Correlation with Causation
Wrong: "Teams with more possession win more, so we should possess the ball more" Right: "Better teams have more possession AND win more; possession may not be causal"
5. Ignoring Regression to the Mean
Wrong: "He outperformed xG by 8 goals; sign him immediately!" Right: "Large xG overperformance usually indicates luck; expect regression"
Decision Framework
When Evaluating a Player
- Check sample size: Is there enough data for reliable conclusions?
- Calculate confidence intervals: What's the range of plausible true values?
- Consider base rates: How does this compare to similar players?
- Look for regression signals: Is performance above/below expectation?
- Account for context: What confounders might explain the pattern?
When Analyzing Relationships
- Start with scatter plots: Visualize before calculating
- Calculate correlation: Quantify the relationship strength
- Test significance: Is the relationship real or noise?
- Consider alternative explanations: Confounding? Reverse causation?
- Assess practical importance: Does the effect size matter?
Python Quick Reference
import numpy as np
from scipy import stats
# Descriptive statistics
data = [10, 15, 12, 18, 14, 25]
mean = np.mean(data)
median = np.median(data)
std = np.std(data, ddof=1) # Sample SD
# Confidence interval for mean
se = std / np.sqrt(len(data))
ci_95 = (mean - 1.96*se, mean + 1.96*se)
# Confidence interval for proportion
p, n = 0.75, 80
se_prop = np.sqrt(p * (1-p) / n)
ci_prop = (p - 1.96*se_prop, p + 1.96*se_prop)
# Poisson probability
from scipy.stats import poisson
prob_2_goals = poisson.pmf(2, mu=1.5) # P(X=2) when lambda=1.5
# Hypothesis test for proportion
from statsmodels.stats.proportion import proportions_ztest
stat, pval = proportions_ztest(60, 80, value=0.70)
# Correlation
x = [58, 52, 62, 45, 55]
y = [75, 68, 82, 52, 70]
r, p = stats.pearsonr(x, y)
Chapter Summary in One Paragraph
Soccer analytics requires moving beyond intuition to rigorous statistical thinking. Descriptive statistics summarize data but must include measures of spread, not just averages. Probability distributions—especially Poisson for goals—allow us to quantify uncertainty and assess whether observed patterns exceed chance expectations. Statistical inference via confidence intervals and hypothesis tests provides frameworks for making decisions under uncertainty, but we must always distinguish statistical from practical significance. Sample size determines reliability; most soccer metrics require hundreds of observations to stabilize, making single-season player evaluation inherently uncertain. Regression to the mean ensures that extreme performances will moderate, and ignoring this leads to systematic evaluation errors. Finally, correlation quantifies relationships but never proves causation—confounding variables, reverse causation, and coincidence must always be considered before drawing causal conclusions.