> "Football is a simple game. Twenty-two men chase a ball for ninety minutes and at the end, the Germans always win."
Learning Objectives
- Implement the Dixon-Coles model from scratch, including the bivariate Poisson framework with low-scoring correlation adjustment
- Build an expected goals (xG) model using logistic regression on shot-level data
- Understand and analyze Asian handicap betting markets, including quarter-ball lines
- Adjust predictive models for league-specific characteristics such as promotion/relegation, squad depth, and tactical style
- Model international and tournament soccer despite small sample sizes and infrequent matches
In This Chapter
Chapter 19: Modeling Soccer
"Football is a simple game. Twenty-two men chase a ball for ninety minutes and at the end, the Germans always win." --- Gary Lineker
Lineker's quip, however dated, captures something essential about soccer modeling: the sport produces so few goals per match that individual game outcomes are dominated by randomness. A team that creates three times as many high-quality chances as its opponent can still lose 0--1 on a single counterattack. This inherent low-scoring nature makes soccer both frustrating and rewarding for the quantitative bettor. Frustrating because models are punished by variance in the short run; rewarding because the betting markets---particularly in the hundreds of leagues worldwide---offer deep, liquid opportunities for those who model the sport correctly.
This chapter develops the core quantitative framework for soccer modeling. We begin with the foundational Dixon-Coles model, which has been the starting point for serious soccer modelers since its publication in 1997. We then extend into expected goals (xG), the revolution in soccer analytics that has transformed how we evaluate team and player performance. From there, we move to the practical domain of Asian handicap markets, which dominate professional soccer betting. Finally, we address the complexities of league-specific adjustments and tournament modeling, which distinguish the competent modeler from the exceptional one.
Throughout this chapter, we assume familiarity with Poisson distributions (Chapter 7) and regression methods (Chapter 9). If those topics are not fresh, a brief review before proceeding will be well worth your time.
Chapter Overview
Soccer presents a unique set of modeling challenges compared to the North American sports covered in earlier chapters. The low-scoring nature of the game means that goals are approximately Poisson-distributed, which provides a natural starting point for modeling. However, the simplistic independent Poisson model---where each team's goals are treated as independent random variables---fails to capture important features of real match data, particularly the correlation between scores at low goal counts.
The Dixon-Coles model (1997) addressed this limitation with an elegant correction factor that adjusts the probabilities of scorelines like 0--0, 1--0, 0--1, and 1--1. This model, along with its extensions, remains the workhorse of soccer modeling nearly three decades after its introduction.
Expected goals (xG) represents a more recent revolution. Rather than modeling match outcomes directly, xG evaluates the quality of individual chances, providing a granular measure of team performance that is far more predictive than raw goal tallies. Building an xG model requires shot-level data and classification techniques, typically logistic regression or gradient-boosted trees.
On the market side, soccer betting is dominated by Asian handicap lines rather than the fixed-odds 1X2 markets familiar to casual bettors. Understanding the mechanics of Asian handicaps---including the often-confusing quarter-ball lines---is essential for anyone serious about soccer betting.
Finally, the global nature of soccer introduces modeling challenges that do not exist in, say, the NFL. A model calibrated for the English Premier League may perform poorly when applied to the Argentinian Primera Division. International tournaments introduce yet another layer of complexity, with small sample sizes, inconsistent team compositions, and host-nation effects all demanding careful treatment.
19.1 The Dixon-Coles Model
Historical Context and Motivation
In 1997, Mark Dixon and Stuart Coles published "Modelling Association Football Scores and Inefficiencies in the Football Betting Market" in the Journal of the Royal Statistical Society, Series C. The paper proposed what would become the most influential statistical model in soccer analytics. Their motivation was both academic and practical: the standard independent Poisson model for soccer scores, while a reasonable first approximation, systematically mispriced certain low-scoring outcomes.
The independent Poisson model works as follows. Each team $i$ is assigned an attack strength parameter $\alpha_i$ and a defense strength parameter $\beta_i$. When team $i$ plays team $j$ at home, the expected number of goals scored by the home team is:
$$\lambda_{ij} = \alpha_i \cdot \beta_j \cdot \gamma$$
where $\gamma$ is a home-advantage parameter. The expected number of goals scored by the away team is:
$$\mu_{ij} = \alpha_j \cdot \beta_i$$
Under the independent Poisson assumption, the probability of a specific scoreline $(x, y)$ is:
$$P(X_{ij} = x, Y_{ij} = y) = \frac{e^{-\lambda_{ij}} \lambda_{ij}^x}{x!} \cdot \frac{e^{-\mu_{ij}} \mu_{ij}^y}{y!}$$
This model captures the broad structure of soccer scoring well. Teams with strong attacks and teams facing weak defenses tend to score more goals, and the Poisson distribution matches the empirical distribution of goal counts reasonably well. However, Dixon and Coles identified a systematic problem: the independent model underestimates the frequency of draws (particularly 0--0 and 1--1) and overestimates certain low-scoring decisive results (1--0 and 0--1).
The Correlation Adjustment
Dixon and Coles introduced a correction factor $\tau(x, y, \lambda, \mu, \rho)$ that multiplies the independent Poisson probabilities for low-scoring outcomes. The parameter $\rho$ (rho) captures the correlation between home and away goals at low scores:
$$\tau(x, y, \lambda, \mu, \rho) = \begin{cases} 1 - \lambda \mu \rho & \text{if } x = 0 \text{ and } y = 0 \\ 1 + \lambda \rho & \text{if } x = 0 \text{ and } y = 1 \\ 1 + \mu \rho & \text{if } x = 1 \text{ and } y = 0 \\ 1 - \rho & \text{if } x = 1 \text{ and } y = 1 \\ 1 & \text{otherwise} \end{cases}$$
The full Dixon-Coles probability of a scoreline $(x, y)$ is therefore:
$$P(X_{ij} = x, Y_{ij} = y) = \tau(x, y, \lambda_{ij}, \mu_{ij}, \rho) \cdot \frac{e^{-\lambda_{ij}} \lambda_{ij}^x}{x!} \cdot \frac{e^{-\mu_{ij}} \mu_{ij}^y}{y!}$$
When $\rho < 0$ (the typical case in practice), the adjustment increases the probability of 0--0 and 1--1 draws (since $1 - \lambda\mu\rho > 1$ when $\rho < 0$) and decreases the probability of 1--0 and 0--1 scorelines. This matches the empirical observation that low-scoring draws occur more frequently than an independent Poisson model predicts.
The magnitude of $\rho$ is typically small (often in the range $-0.10$ to $-0.15$), but its effect on match outcome probabilities is non-trivial, particularly for matches between evenly matched teams where the draw probability is significant.
Parameter Estimation via Maximum Likelihood
The model has $2n + 2$ free parameters for a league with $n$ teams: $n$ attack parameters $\alpha_1, \ldots, \alpha_n$, $n$ defense parameters $\beta_1, \ldots, \beta_n$, a home advantage parameter $\gamma$, and the correlation parameter $\rho$. To avoid over-parameterization, we impose the constraint:
$$\sum_{i=1}^{n} \alpha_i = n$$
This anchors the attack parameters so that the average attack strength is 1, and the remaining parameters are identified relative to this average.
Given a dataset of observed match results $\{(x_k, y_k, i_k, j_k, t_k)\}$ where $x_k$ is home goals, $y_k$ is away goals, $i_k$ is the home team, $j_k$ is the away team, and $t_k$ is the match date, the log-likelihood function is:
$$\ell(\boldsymbol{\theta}) = \sum_{k} \left[ \log \tau(x_k, y_k, \lambda_k, \mu_k, \rho) + x_k \log \lambda_k - \lambda_k - \log(x_k!) + y_k \log \mu_k - \mu_k - \log(y_k!) \right]$$
where $\boldsymbol{\theta} = (\alpha_1, \ldots, \alpha_n, \beta_1, \ldots, \beta_n, \gamma, \rho)$ is the vector of parameters, $\lambda_k = \alpha_{i_k} \cdot \beta_{j_k} \cdot \gamma$, and $\mu_k = \alpha_{j_k} \cdot \beta_{i_k}$.
Time-Decay Weighting
Dixon and Coles recognized that recent matches are more informative than older matches for predicting future outcomes. They introduced a time-decay weighting function that down-weights older observations:
$$\phi(t) = e^{-\xi (T - t)}$$
where $T$ is the current date, $t$ is the match date, and $\xi > 0$ is the decay rate. The weighted log-likelihood becomes:
$$\ell(\boldsymbol{\theta}) = \sum_{k} \phi(t_k) \left[ \log \tau(x_k, y_k, \lambda_k, \mu_k, \rho) + x_k \log \lambda_k - \lambda_k + y_k \log \mu_k - \mu_k - \log(x_k!) - \log(y_k!) \right]$$
The decay parameter $\xi$ is typically chosen via cross-validation. Dixon and Coles used a half-life of roughly one year for English football, corresponding to $\xi \approx 0.0019$ per day (since $e^{-0.0019 \times 365} \approx 0.50$). However, the optimal decay rate varies by league. Leagues with higher player turnover (such as the MLS) may benefit from faster decay, while more stable leagues may warrant slower decay.
Python Implementation from Scratch
The following implementation builds the full Dixon-Coles model. We use SciPy's optimization routines for maximum likelihood estimation.
"""
Dixon-Coles model for soccer match prediction.
Implements the bivariate Poisson model with correlation adjustment
as described in Dixon and Coles (1997).
"""
import numpy as np
import pandas as pd
from scipy.optimize import minimize
from scipy.stats import poisson
from typing import Dict, Tuple, Optional
import warnings
class DixonColesModel:
"""
Full Dixon-Coles model with time-decay weighting.
Parameters
----------
xi : float
Time-decay rate per day. Default 0.0019 (~1 year half-life).
max_goals : int
Maximum number of goals to consider in probability matrix.
"""
def __init__(self, xi: float = 0.0019, max_goals: int = 8):
self.xi = xi
self.max_goals = max_goals
self.params: Optional[Dict[str, float]] = None
self.teams: Optional[list] = None
@staticmethod
def tau(x: int, y: int, lam: float, mu: float, rho: float) -> float:
"""
Dixon-Coles correlation adjustment factor.
Adjusts probabilities for scorelines 0-0, 0-1, 1-0, and 1-1.
"""
if x == 0 and y == 0:
return 1.0 - lam * mu * rho
elif x == 0 and y == 1:
return 1.0 + lam * rho
elif x == 1 and y == 0:
return 1.0 + mu * rho
elif x == 1 and y == 1:
return 1.0 - rho
else:
return 1.0
def _build_param_vector(self, teams: list) -> np.ndarray:
"""Initialize parameter vector: attacks, defenses, home, rho."""
n = len(teams)
# attacks (n) + defenses (n) + home (1) + rho (1)
params = np.zeros(2 * n + 2)
params[:n] = 0.0 # log attack strengths (exp will give 1.0)
params[n:2*n] = 0.0 # log defense strengths
params[2*n] = 0.2 # log home advantage
params[2*n + 1] = -0.05 # rho
return params
def _unpack_params(self, param_vec: np.ndarray, teams: list) -> Dict:
"""Convert parameter vector to named dictionary."""
n = len(teams)
params = {}
for i, team in enumerate(teams):
params[f"attack_{team}"] = np.exp(param_vec[i])
params[f"defense_{team}"] = np.exp(param_vec[n + i])
params["home"] = np.exp(param_vec[2 * n])
params["rho"] = param_vec[2 * n + 1]
return params
def _neg_log_likelihood(
self,
param_vec: np.ndarray,
home_teams: np.ndarray,
away_teams: np.ndarray,
home_goals: np.ndarray,
away_goals: np.ndarray,
weights: np.ndarray,
team_to_idx: Dict[str, int],
n_teams: int,
) -> float:
"""Compute negative weighted log-likelihood."""
# Unpack parameters
log_attack = param_vec[:n_teams]
log_defense = param_vec[n_teams:2*n_teams]
log_home = param_vec[2*n_teams]
rho = param_vec[2*n_teams + 1]
# Enforce constraint: mean of log_attack = 0
# (equivalent to geometric mean of attack = 1)
log_attack_centered = log_attack - np.mean(log_attack)
nll = 0.0
for k in range(len(home_goals)):
hi = team_to_idx[home_teams[k]]
ai = team_to_idx[away_teams[k]]
x = int(home_goals[k])
y = int(away_goals[k])
lam = np.exp(log_attack_centered[hi] + log_defense[ai] + log_home)
mu = np.exp(log_attack_centered[ai] + log_defense[hi])
# Poisson log-likelihoods
log_p_home = poisson.logpmf(x, lam)
log_p_away = poisson.logpmf(y, mu)
# Correlation adjustment
tau_val = self.tau(x, y, lam, mu, rho)
if tau_val <= 0:
tau_val = 1e-10 # numerical safety
log_lik = np.log(tau_val) + log_p_home + log_p_away
nll -= weights[k] * log_lik
return nll
def fit(self, df: pd.DataFrame, current_date: Optional[str] = None):
"""
Fit the Dixon-Coles model to match data.
Parameters
----------
df : DataFrame
Must contain columns: 'date', 'home_team', 'away_team',
'home_goals', 'away_goals'.
current_date : str, optional
Reference date for time decay (default: max date in data).
"""
df = df.copy()
df['date'] = pd.to_datetime(df['date'])
if current_date is None:
current_date = df['date'].max()
else:
current_date = pd.to_datetime(current_date)
# Time decay weights
days_ago = (current_date - df['date']).dt.days.values.astype(float)
weights = np.exp(-self.xi * days_ago)
# Team indexing
self.teams = sorted(
set(df['home_team'].unique()) | set(df['away_team'].unique())
)
team_to_idx = {t: i for i, t in enumerate(self.teams)}
n = len(self.teams)
# Initial parameters
x0 = self._build_param_vector(self.teams)
# Optimize
result = minimize(
self._neg_log_likelihood,
x0,
args=(
df['home_team'].values,
df['away_team'].values,
df['home_goals'].values,
df['away_goals'].values,
weights,
team_to_idx,
n,
),
method='L-BFGS-B',
options={'maxiter': 1000, 'disp': False},
)
if not result.success:
warnings.warn(f"Optimization did not converge: {result.message}")
self.params = self._unpack_params(result.x, self.teams)
return self
def predict_scoreline_probs(
self, home_team: str, away_team: str
) -> np.ndarray:
"""
Compute probability matrix for all scorelines up to max_goals.
Returns
-------
probs : ndarray of shape (max_goals+1, max_goals+1)
probs[i, j] = P(home scores i, away scores j)
"""
if self.params is None:
raise ValueError("Model must be fitted before prediction.")
attack_h = self.params[f"attack_{home_team}"]
defense_h = self.params[f"defense_{home_team}"]
attack_a = self.params[f"attack_{away_team}"]
defense_a = self.params[f"defense_{away_team}"]
home_adv = self.params["home"]
rho = self.params["rho"]
lam = attack_h * defense_a * home_adv
mu = attack_a * defense_h
mg = self.max_goals + 1
probs = np.zeros((mg, mg))
for i in range(mg):
for j in range(mg):
p_ind = poisson.pmf(i, lam) * poisson.pmf(j, mu)
tau_val = self.tau(i, j, lam, mu, rho)
probs[i, j] = p_ind * tau_val
return probs
def predict_match_odds(
self, home_team: str, away_team: str
) -> Dict[str, float]:
"""
Predict 1X2 probabilities for a match.
Returns dict with keys 'home', 'draw', 'away'.
"""
probs = self.predict_scoreline_probs(home_team, away_team)
home_win = np.sum(np.tril(probs, k=-1)) # home > away
draw = np.sum(np.diag(probs))
away_win = np.sum(np.triu(probs, k=1)) # away > home
# Normalize to account for truncation at max_goals
total = home_win + draw + away_win
return {
'home': home_win / total,
'draw': draw / total,
'away': away_win / total,
}
def get_team_strengths(self) -> pd.DataFrame:
"""Return a DataFrame of attack and defense strengths."""
if self.params is None:
raise ValueError("Model must be fitted before querying.")
records = []
for team in self.teams:
records.append({
'team': team,
'attack': self.params[f"attack_{team}"],
'defense': self.params[f"defense_{team}"],
})
return pd.DataFrame(records).sort_values(
'attack', ascending=False
).reset_index(drop=True)
# --- Demonstration ---
if __name__ == "__main__":
# Create sample Premier League data
np.random.seed(42)
teams = [
"Arsenal", "Aston Villa", "Chelsea", "Liverpool",
"Man City", "Man United", "Newcastle", "Tottenham",
]
# Generate synthetic match results for demonstration
dates = pd.date_range("2024-01-01", periods=56, freq="7D")
matches = []
match_idx = 0
for home in teams:
for away in teams:
if home != away:
matches.append({
'date': dates[match_idx % len(dates)],
'home_team': home,
'away_team': away,
'home_goals': np.random.poisson(1.5),
'away_goals': np.random.poisson(1.1),
})
match_idx += 1
df = pd.DataFrame(matches)
# Fit model
model = DixonColesModel(xi=0.0019)
model.fit(df)
# Predict a match
odds = model.predict_match_odds("Liverpool", "Chelsea")
print("Liverpool vs Chelsea (1X2 probabilities):")
print(f" Home win: {odds['home']:.3f}")
print(f" Draw: {odds['draw']:.3f}")
print(f" Away win: {odds['away']:.3f}")
print()
# Show team strengths
strengths = model.get_team_strengths()
print("Team Strengths:")
print(strengths.to_string(index=False))
print()
print(f"Correlation parameter (rho): {model.params['rho']:.4f}")
print(f"Home advantage: {model.params['home']:.4f}")
Interpreting the Model Output
The fitted Dixon-Coles model produces several interpretable outputs:
-
Attack strengths ($\alpha_i$): Values greater than 1.0 indicate teams with above-average attacking ability. A team with $\alpha = 1.30$ is expected to score 30% more goals than the average team, holding the opponent fixed.
-
Defense strengths ($\beta_i$): These are inverted in interpretation---a higher $\beta_i$ means the team concedes more goals. Some implementations reparameterize to make lower values represent stronger defenses.
-
Home advantage ($\gamma$): Typically around 1.20--1.40 for major European leagues, meaning the home team scores 20--40% more than they would in a neutral venue.
-
Correlation ($\rho$): Usually small and negative ($-0.05$ to $-0.15$), confirming the tendency for low-scoring matches to cluster toward draws more than independence predicts.
The scoreline probability matrix is particularly useful. For a match between two evenly matched sides with $\lambda = 1.4$ and $\mu = 1.1$, the most likely individual scoreline is typically 1--1 or 1--0, but the home win is the most likely outcome category because it aggregates all scorelines where the home team scores more.
Market Insight: The Value in Correct Scores
The scoreline probability matrix from Dixon-Coles can be directly compared to correct-score market odds. Because correct-score markets carry higher margins (often 20--40% overround), there can be significant value when your model identifies specific scorelines that the market underprices. The 0--0 draw is frequently mispriced because recreational bettors underestimate its probability.
19.2 Expected Goals (xG) in Soccer
What xG Measures
Expected goals (xG) is a metric that quantifies the quality of scoring chances by assigning a probability to each shot based on its characteristics. A penalty kick might have an xG value of 0.76, meaning that historically, 76% of penalties with similar characteristics result in goals. A long-range effort from 30 yards might carry an xG of 0.03. By summing the xG values of all shots taken by a team in a match, we obtain the team's total match xG---a measure of how many goals the team "deserved" to score based on the quality and quantity of their chances.
The conceptual breakthrough of xG is the separation of chance creation from finishing quality. Goals are noisy. A team can create five excellent chances (total xG of 2.5) and score none, while their opponent converts a single speculative effort (xG of 0.05) and wins 1--0. Over a single match, actual goals are a poor measure of team quality. Over a full season, xG converges with actual goals, but xG provides a much faster signal of true team strength.
For the bettor, xG is valuable because:
- xG correlates with future performance more strongly than past goals do, especially early in a season.
- xG differential (xG created minus xG conceded) is one of the best single-number predictors of future league standings.
- Teams that significantly overperform their xG (scoring more goals than expected) tend to regress, creating betting opportunities on the "under" or against them on the money line.
- Teams that significantly underperform their xG represent potential value bets.
Building an xG Model from Shot Data
An xG model is fundamentally a binary classification problem: for each shot, we predict the probability that it results in a goal. The features typically include:
| Feature | Description | Typical Effect |
|---|---|---|
| Distance to goal | Euclidean distance from shot location to center of goal | Negative (farther = lower xG) |
| Angle to goal | Angle subtended by the goal from the shot location | Positive (wider angle = higher xG) |
| Body part | Foot, head, or other | Headers typically lower xG |
| Shot type | Open play, set piece, penalty, free kick | Penalties highest xG |
| Assist type | Through ball, cross, pull-back, corner, none | Through balls typically higher xG |
| Previous action | Dribble, pass, rebound, direct play | Rebounds higher xG |
| Number of defenders | Defensive pressure at time of shot | More defenders = lower xG |
| Goalkeeper position | Distance of GK from goal line | GK off line = higher xG |
| Game state | Scoreline at time of shot | Leading teams may have lower-quality shots |
| Fast break | Whether the shot came from a counter-attack | Counter-attacks higher xG |
The most basic xG model uses only distance and angle. More sophisticated models incorporate all of the above features and more, often using tracking data that captures player positions at the moment of the shot.
Logistic Regression Approach
Logistic regression is the natural starting point for xG modeling. The response variable $Y$ is binary (1 = goal, 0 = no goal), and we model the log-odds of scoring as a linear function of features:
$$\log \frac{P(Y=1|\mathbf{x})}{1 - P(Y=1|\mathbf{x})} = \beta_0 + \beta_1 x_{\text{dist}} + \beta_2 x_{\text{angle}} + \beta_3 x_{\text{head}} + \cdots$$
The predicted probability of a goal is then:
$$\hat{P}(Y=1|\mathbf{x}) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots)}} = \text{xG}$$
The model is fit by maximizing the log-likelihood:
$$\ell(\boldsymbol{\beta}) = \sum_{i=1}^{N} \left[ y_i \log \hat{p}_i + (1 - y_i) \log(1 - \hat{p}_i) \right]$$
where $\hat{p}_i$ is the predicted probability for shot $i$.
Python Code: Building an xG Model
"""
Expected Goals (xG) model using logistic regression.
Builds a shot-level model from features including distance,
angle, body part, and assist type.
"""
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.calibration import calibration_curve
from sklearn.metrics import (
brier_score_loss,
log_loss,
roc_auc_score,
)
import matplotlib.pyplot as plt
from typing import Tuple
def compute_shot_features(
x: np.ndarray, y: np.ndarray, goal_x: float = 120.0, goal_y: float = 40.0
) -> Tuple[np.ndarray, np.ndarray]:
"""
Compute distance and angle to goal from shot coordinates.
Parameters
----------
x, y : ndarray
Shot coordinates (pitch length 120, width 80).
goal_x : float
x-coordinate of the goal line.
goal_y : float
y-coordinate of the center of the goal.
Returns
-------
distance : ndarray
Distance from shot location to center of goal.
angle : ndarray
Angle subtended by the goal (in radians).
"""
# Distance to center of goal
distance = np.sqrt((goal_x - x)**2 + (goal_y - y)**2)
# Angle subtended by goal posts (goal width = 8 yards = ~7.32m)
# Goal posts at (goal_x, goal_y - 3.66) and (goal_x, goal_y + 3.66)
goal_width = 7.32 # scaled to pitch coordinates
# Use law of cosines to find angle subtended
d_post1 = np.sqrt((goal_x - x)**2 + (goal_y - goal_width/2 - y)**2)
d_post2 = np.sqrt((goal_x - x)**2 + (goal_y + goal_width/2 - y)**2)
cos_angle = (d_post1**2 + d_post2**2 - goal_width**2) / (
2 * d_post1 * d_post2
)
cos_angle = np.clip(cos_angle, -1, 1) # numerical safety
angle = np.arccos(cos_angle)
return distance, angle
def build_xg_model(shots_df: pd.DataFrame) -> dict:
"""
Build and evaluate an xG model from shot-level data.
Parameters
----------
shots_df : DataFrame
Must contain columns: 'x', 'y', 'is_goal', 'body_part',
'situation' (open_play, set_piece, penalty, free_kick).
Returns
-------
dict with keys: 'model', 'feature_names', 'metrics'.
"""
df = shots_df.copy()
# Compute geometric features
distance, angle = compute_shot_features(df['x'].values, df['y'].values)
df['distance'] = distance
df['angle'] = angle
df['log_distance'] = np.log(distance + 1)
df['distance_sq'] = distance ** 2
# Encode categorical features
df['is_header'] = (df['body_part'] == 'head').astype(int)
df['is_penalty'] = (df['situation'] == 'penalty').astype(int)
df['is_free_kick'] = (df['situation'] == 'free_kick').astype(int)
df['is_set_piece'] = (df['situation'] == 'set_piece').astype(int)
# Feature matrix
feature_cols = [
'distance', 'angle', 'log_distance', 'distance_sq',
'is_header', 'is_penalty', 'is_free_kick', 'is_set_piece',
]
X = df[feature_cols].values
y = df['is_goal'].values
# Fit logistic regression
model = LogisticRegression(
penalty='l2',
C=1.0,
max_iter=1000,
solver='lbfgs',
)
model.fit(X, y)
# Predictions
y_pred_proba = model.predict_proba(X)[:, 1]
# Evaluation metrics
brier = brier_score_loss(y, y_pred_proba)
logloss = log_loss(y, y_pred_proba)
auc = roc_auc_score(y, y_pred_proba)
# Cross-validated Brier score
cv_brier = -cross_val_score(
model, X, y, cv=5, scoring='neg_brier_score'
).mean()
# Feature importances (coefficients)
coef_df = pd.DataFrame({
'feature': feature_cols,
'coefficient': model.coef_[0],
}).sort_values('coefficient', key=abs, ascending=False)
print("=== xG Model Summary ===")
print(f" Samples: {len(y)}")
print(f" Goal rate: {y.mean():.4f}")
print(f" Brier score: {brier:.4f}")
print(f" CV Brier score: {cv_brier:.4f}")
print(f" Log loss: {logloss:.4f}")
print(f" ROC AUC: {auc:.4f}")
print(f"\nFeature Coefficients:")
print(coef_df.to_string(index=False))
return {
'model': model,
'feature_names': feature_cols,
'metrics': {
'brier': brier,
'cv_brier': cv_brier,
'log_loss': logloss,
'auc': auc,
},
}
def predict_xg(
model_dict: dict,
x: float,
y: float,
body_part: str = 'foot',
situation: str = 'open_play',
) -> float:
"""Predict xG for a single shot."""
distance, angle = compute_shot_features(
np.array([x]), np.array([y])
)
features = np.array([[
distance[0],
angle[0],
np.log(distance[0] + 1),
distance[0] ** 2,
int(body_part == 'head'),
int(situation == 'penalty'),
int(situation == 'free_kick'),
int(situation == 'set_piece'),
]])
return model_dict['model'].predict_proba(features)[0, 1]
# --- Demonstration with synthetic data ---
if __name__ == "__main__":
np.random.seed(42)
n_shots = 5000
# Generate synthetic shot data
x_coord = np.random.uniform(85, 120, n_shots)
y_coord = np.random.uniform(15, 65, n_shots)
distance, angle = compute_shot_features(x_coord, y_coord)
body_parts = np.random.choice(
['foot', 'head'], n_shots, p=[0.75, 0.25]
)
situations = np.random.choice(
['open_play', 'set_piece', 'penalty', 'free_kick'],
n_shots, p=[0.70, 0.15, 0.05, 0.10]
)
# Generate goal outcomes based on realistic probabilities
base_prob = 0.05 + 0.25 * np.exp(-distance / 15) + 0.15 * angle
base_prob[body_parts == 'head'] *= 0.7
base_prob[situations == 'penalty'] = 0.76
base_prob = np.clip(base_prob, 0.01, 0.95)
is_goal = np.random.binomial(1, base_prob)
shots_df = pd.DataFrame({
'x': x_coord,
'y': y_coord,
'is_goal': is_goal,
'body_part': body_parts,
'situation': situations,
})
# Build model
result = build_xg_model(shots_df)
# Predict specific shots
print("\n=== Sample Predictions ===")
# Penalty
xg_pen = predict_xg(result, x=109, y=40, situation='penalty')
print(f"Penalty kick: xG = {xg_pen:.3f}")
# Close range
xg_close = predict_xg(result, x=115, y=40, body_part='foot')
print(f"Close range (foot): xG = {xg_close:.3f}")
# Header from cross
xg_header = predict_xg(result, x=113, y=38, body_part='head')
print(f"Header (close): xG = {xg_header:.3f}")
# Long range
xg_long = predict_xg(result, x=95, y=40, body_part='foot')
print(f"Long range (foot): xG = {xg_long:.3f}")
Calibration and Evaluation
A well-calibrated xG model should satisfy the property that among all shots assigned an xG of, say, 0.15, approximately 15% actually result in goals. Calibration can be assessed visually using calibration plots and quantitatively using the Brier score.
The Brier score is defined as:
$$BS = \frac{1}{N} \sum_{i=1}^{N} (p_i - o_i)^2$$
where $p_i$ is the predicted probability and $o_i$ is the observed outcome (0 or 1). A perfect model achieves a Brier score of 0, while predicting the base rate for every shot yields the "baseline" Brier score. A well-built xG model for top-flight soccer typically achieves a Brier score in the range 0.07--0.09, compared to a baseline of approximately 0.09--0.10 (since the overall goal rate is roughly 10%).
The relatively modest improvement over the baseline reflects a fundamental truth about soccer: even the best xG models explain only a fraction of the variance in individual shot outcomes. The value of xG lies not in predicting individual shots but in aggregating across many shots to obtain a more reliable measure of team quality than raw goals.
Using xG for Betting
The practical application of xG to betting follows several paths:
-
xG-based power ratings: Replace raw goals with xG values in your Dixon-Coles model to obtain "xG-adjusted" team strengths. Teams whose attack strength is elevated by a few lucky goals will be properly discounted, while teams whose attack strength is depressed by poor finishing will be rated more accurately.
-
Regression signals: When a team's actual goals significantly exceed their xG over the first 10--15 matches of a season, bet against them (or the under). The reverse applies for teams underperforming their xG.
-
In-play xG: Real-time xG accumulation during a match provides information about which team is controlling the game. A team trailing 0--1 but winning the xG battle 2.5 to 0.8 is likely the better team and may offer value on in-play markets.
Market Insight: The xG Regression Trade
One of the most consistently profitable angles in soccer betting is identifying teams whose actual goals differ significantly from their xG. In the early weeks of a season, a team that has scored 12 goals from 6.0 xG is almost certainly due for regression. The market, influenced by recency bias and raw results, often fails to fully price in this regression. This is particularly exploitable in second-tier leagues where market efficiency is lower.
19.3 Asian Handicaps and Betting Markets
How Asian Handicaps Work
Asian handicaps (AH) are the dominant form of soccer betting in professional markets. Unlike the European 1X2 market, which offers three outcomes (home win, draw, away win), Asian handicaps eliminate the draw by applying a fractional goal advantage to one side. This simplifies the market to a two-outcome proposition and reduces the bookmaker's margin.
Asian handicaps come in three varieties:
Whole-ball handicaps (e.g., -1, -2, +1): These work like American point spreads. If you bet on a team at -1, they must win by 2 or more goals for your bet to win. If they win by exactly 1, the bet is a push (stake returned).
Half-ball handicaps (e.g., -0.5, -1.5, +0.5): There is no possibility of a push. A team at -1.5 must win by 2 or more. A team at +0.5 wins the bet if they draw or win outright.
Quarter-ball handicaps (e.g., -0.25, -0.75, +1.25): These are the most distinctive feature of Asian markets and the most confusing for newcomers. A quarter-ball line is equivalent to placing two half-bets on the adjacent half-ball lines. For example:
- A bet on Team A at -0.25 is equivalent to half the stake on -0 (draw no bet) and half on -0.5.
- A bet on Team A at -0.75 is equivalent to half on -0.5 and half on -1.0.
- A bet on Team B at +1.25 is equivalent to half on +1.0 and half on +1.5.
This split-line mechanic means you can win one half and lose the other, or win one half and push the other. The possible outcomes for a -0.25 bet are:
| Match Result | -0 Half | -0.5 Half | Overall |
|---|---|---|---|
| Home wins by 1+ | Win | Win | Full win |
| Draw | Push | Lose | Lose half stake |
| Home loses | Lose | Lose | Full loss |
Converting Asian Handicaps to Probability
To convert Asian handicap odds to implied probabilities, we need to account for the split-line structure. For a half-ball line (e.g., -0.5 at odds 1.90), the conversion is straightforward:
$$p_{\text{implied}} = \frac{1}{d}$$
For a quarter-ball line, we must decompose into the two half-bets and compute the expected return:
$$\text{EV}(\text{AH} {-0.25}) = \frac{1}{2}\left[P(\text{home win}) \cdot (d-1) + P(\text{draw}) \cdot 0 + P(\text{away win}) \cdot (-1)\right] + \frac{1}{2}\left[P(\text{home win}) \cdot (d-1) + P(\text{draw}) \cdot (-1) + P(\text{away win}) \cdot (-1)\right]$$
Simplifying, for AH -0.25 at decimal odds $d$:
$$\text{Fair odds when } p_H = P(\text{home win}), p_D = P(\text{draw}):$$
$$\text{EV} = p_H (d - 1) - p_A - \frac{1}{2} p_D$$
Setting EV = 0 and solving for the fair probability gives us the market's implied assessment.
Comparing Asian Handicaps to 1X2
The 1X2 market and Asian handicap market price the same underlying event but in different ways. In theory, the probabilities implied by the two markets should be consistent. In practice, they often diverge slightly due to:
- Different customer bases (sharp Asian bettors vs. European recreational bettors)
- Different margin structures (AH margins are typically 2--4%, 1X2 margins 5--10%)
- The draw being priced explicitly in 1X2 but eliminated in AH
This divergence creates opportunities for the astute bettor who can compare across markets.
Python Code: Asian Handicap Analysis
"""
Asian Handicap analysis tools for soccer betting.
Provides conversion between AH lines and 1X2 probabilities,
expected value calculation, and market comparison.
"""
import numpy as np
from typing import Dict, Tuple, Optional
from dataclasses import dataclass
@dataclass
class AHResult:
"""Result of an Asian Handicap bet given a match outcome."""
pnl: float # profit/loss as fraction of stake
description: str # text description of outcome
def resolve_ah_bet(
home_goals: int,
away_goals: int,
handicap: float,
odds: float,
side: str = 'home',
) -> AHResult:
"""
Resolve an Asian Handicap bet given a match result.
Parameters
----------
home_goals, away_goals : int
Actual match score.
handicap : float
Handicap applied to the chosen side (e.g., -0.5, -0.25, +1).
odds : float
Decimal odds for the bet.
side : str
'home' or 'away' - which side the handicap applies to.
Returns
-------
AHResult with PnL and description.
"""
if side == 'home':
adjusted_margin = (home_goals - away_goals) + handicap
else:
adjusted_margin = (away_goals - home_goals) + handicap
# Check if quarter-ball line
is_quarter = (handicap * 4) % 1 == 0 and (handicap * 2) % 1 != 0
if is_quarter:
# Split into two half-bets
line1 = np.floor(handicap * 2) / 2 # lower line
line2 = np.ceil(handicap * 2) / 2 # upper line
pnl1 = _resolve_single_line(home_goals, away_goals, line1, odds, side)
pnl2 = _resolve_single_line(home_goals, away_goals, line2, odds, side)
total_pnl = (pnl1 + pnl2) / 2
desc_parts = []
if pnl1 > 0:
desc_parts.append(f"Half 1 ({line1:+.1f}): WIN")
elif pnl1 == 0:
desc_parts.append(f"Half 1 ({line1:+.1f}): PUSH")
else:
desc_parts.append(f"Half 1 ({line1:+.1f}): LOSE")
if pnl2 > 0:
desc_parts.append(f"Half 2 ({line2:+.1f}): WIN")
elif pnl2 == 0:
desc_parts.append(f"Half 2 ({line2:+.1f}): PUSH")
else:
desc_parts.append(f"Half 2 ({line2:+.1f}): LOSE")
return AHResult(pnl=total_pnl, description="; ".join(desc_parts))
else:
pnl = _resolve_single_line(
home_goals, away_goals, handicap, odds, side
)
if pnl > 0:
desc = "WIN"
elif pnl == 0:
desc = "PUSH"
else:
desc = "LOSE"
return AHResult(pnl=pnl, description=desc)
def _resolve_single_line(
home_goals: int,
away_goals: int,
handicap: float,
odds: float,
side: str,
) -> float:
"""Resolve a single half-ball or whole-ball AH line."""
if side == 'home':
margin = (home_goals - away_goals) + handicap
else:
margin = (away_goals - home_goals) + handicap
if margin > 0:
return odds - 1 # profit
elif margin == 0:
return 0.0 # push
else:
return -1.0 # lose stake
def ah_to_1x2_probabilities(
handicap: float,
home_odds: float,
away_odds: float,
method: str = 'balanced',
) -> Dict[str, float]:
"""
Convert Asian Handicap odds to approximate 1X2 probabilities.
Uses the relationship between AH lines and match outcome
probabilities to infer the underlying 1X2 distribution.
Parameters
----------
handicap : float
The handicap for the home team (negative = home favored).
home_odds : float
Decimal odds for the home side of the AH.
away_odds : float
Decimal odds for the away side of the AH.
method : str
'balanced' uses the overround to estimate fair probs.
Returns
-------
Dict with approximate 'home', 'draw', 'away' probabilities.
"""
# Implied probabilities (with margin)
p_home_ah = 1.0 / home_odds
p_away_ah = 1.0 / away_odds
overround = p_home_ah + p_away_ah
# Remove margin proportionally
p_home_fair = p_home_ah / overround
p_away_fair = p_away_ah / overround
# Map AH probabilities to 1X2 based on handicap level
# This is an approximation; exact conversion requires
# knowledge of the goal distribution
if handicap == 0:
# AH 0 = draw no bet
# p_home_ah covers home win only; draw is push
# Need to estimate draw probability
draw_est = 0.25 # prior estimate, can be refined
home_win = p_home_fair * (1 - draw_est)
away_win = p_away_fair * (1 - draw_est)
return {
'home': home_win,
'draw': draw_est,
'away': away_win,
}
elif handicap == -0.5:
# Home -0.5: home wins => AH win; draw or away win => AH loss
# p_home_fair = P(home win)
# p_away_fair = P(draw) + P(away win)
return {
'home': p_home_fair,
'draw': 0.0, # absorbed into away AH
'away': p_away_fair,
}
elif handicap == -0.25:
# Split: half on 0, half on -0.5
# Requires iterative solution; approximate here
draw_est = 0.25
home_win = p_home_fair
away_win = 1.0 - home_win - draw_est
if away_win < 0:
draw_est = 1.0 - home_win - 0.05
away_win = 0.05
return {
'home': home_win,
'draw': draw_est,
'away': away_win,
}
else:
# General case: use handicap to shift probabilities
return {
'home': p_home_fair,
'draw': 0.0,
'away': p_away_fair,
}
def compare_ah_to_model(
model_probs: Dict[str, float],
handicap: float,
ah_odds: float,
side: str = 'home',
n_simulations: int = 100_000,
avg_goals: float = 2.6,
) -> Dict[str, float]:
"""
Compare model probabilities to Asian Handicap odds via simulation.
Simulates match outcomes from model probabilities and evaluates
expected profit at given AH odds.
Parameters
----------
model_probs : dict
1X2 probabilities from your model ('home', 'draw', 'away').
handicap : float
Asian handicap for the chosen side.
ah_odds : float
Decimal odds offered.
side : str
'home' or 'away'.
n_simulations : int
Number of Monte Carlo simulations.
avg_goals : float
Average total goals per match (for score simulation).
Returns
-------
Dict with 'edge', 'win_rate', 'expected_pnl'.
"""
np.random.seed(42)
p_h = model_probs['home']
p_d = model_probs['draw']
p_a = model_probs['away']
# Simulate match outcomes
outcomes = np.random.choice(
['H', 'D', 'A'],
size=n_simulations,
p=[p_h, p_d, p_a],
)
# For each outcome, generate a plausible scoreline
total_pnl = 0.0
wins = 0
for outcome in outcomes:
if outcome == 'H':
# Home win: generate margin 1, 2, 3, ... weighted by Poisson
margin = np.random.geometric(p=0.45)
h_goals = max(1, np.random.poisson(1.5))
a_goals = max(0, h_goals - margin)
elif outcome == 'D':
goals = np.random.poisson(1.1)
h_goals = goals
a_goals = goals
else:
margin = np.random.geometric(p=0.45)
a_goals = max(1, np.random.poisson(1.3))
h_goals = max(0, a_goals - margin)
result = resolve_ah_bet(h_goals, a_goals, handicap, ah_odds, side)
total_pnl += result.pnl
if result.pnl > 0:
wins += 1
avg_pnl = total_pnl / n_simulations
win_rate = wins / n_simulations
return {
'edge': avg_pnl,
'win_rate': win_rate,
'expected_pnl_per_unit': avg_pnl,
}
# --- Demonstration ---
if __name__ == "__main__":
print("=== Asian Handicap Resolution Examples ===\n")
# Example 1: Home -0.5 at 1.90, match ends 2-1
r1 = resolve_ah_bet(2, 1, handicap=-0.5, odds=1.90, side='home')
print(f"Home -0.5 @ 1.90, Score 2-1: {r1.description}, PnL: {r1.pnl:+.2f}")
# Example 2: Home -0.25 at 1.85, match ends 1-1
r2 = resolve_ah_bet(1, 1, handicap=-0.25, odds=1.85, side='home')
print(f"Home -0.25 @ 1.85, Score 1-1: {r2.description}, PnL: {r2.pnl:+.2f}")
# Example 3: Home -0.25 at 1.85, match ends 2-1
r3 = resolve_ah_bet(2, 1, handicap=-0.25, odds=1.85, side='home')
print(f"Home -0.25 @ 1.85, Score 2-1: {r3.description}, PnL: {r3.pnl:+.2f}")
# Example 4: Away +1.5 at 2.10, match ends 2-1
r4 = resolve_ah_bet(2, 1, handicap=1.5, odds=2.10, side='away')
print(f"Away +1.5 @ 2.10, Score 2-1: {r4.description}, PnL: {r4.pnl:+.2f}")
# Example 5: Home -0.75 at 2.05, match ends 1-0
r5 = resolve_ah_bet(1, 0, handicap=-0.75, odds=2.05, side='home')
print(f"Home -0.75 @ 2.05, Score 1-0: {r5.description}, PnL: {r5.pnl:+.2f}")
print("\n=== Model vs. Market Comparison ===\n")
# Your model says: Home 45%, Draw 28%, Away 27%
model_p = {'home': 0.45, 'draw': 0.28, 'away': 0.27}
# Market offers Home -0.25 at 1.95
comparison = compare_ah_to_model(
model_p, handicap=-0.25, ah_odds=1.95, side='home'
)
print(f"Model: H={model_p['home']:.0%}, D={model_p['draw']:.0%}, "
f"A={model_p['away']:.0%}")
print(f"Market: Home -0.25 @ 1.95")
print(f"Estimated edge: {comparison['edge']:+.4f}")
print(f"Win rate: {comparison['win_rate']:.3f}")
Practical Asian Handicap Strategy
The most important strategic insight for Asian handicap betting is understanding how the quarter-ball lines interact with the draw probability. Consider two matches:
- Match A: Home 50%, Draw 25%, Away 25%. Home is offered at AH -0.25.
- Match B: Home 50%, Draw 15%, Away 35%. Home is offered at AH -0.25.
In Match A, the draw probability is high, so the home team at -0.25 loses half their stake 25% of the time (on draws). In Match B, draws are rarer, and the away team is stronger, so the -0.25 line carries different risk characteristics. The same AH line does not imply the same edge in both cases---you must model the full 1X2 distribution, not just the home win probability.
Professional soccer bettors typically focus on Asian handicap markets because:
- Lower margins: Pinnacle Sports and Asian books offer AH margins of 2--3%, compared to 5--10% on 1X2.
- Higher limits: Sharp bettors can place larger bets before being limited.
- No draw exposure: Eliminating the draw simplifies bankroll management.
- Line movement information: AH lines move in response to sharp money, providing valuable signals about where the informed money is going.
19.4 League-Specific Considerations
Why One Model Does Not Fit All
Soccer is played across hundreds of leagues worldwide, and the statistical properties of these leagues vary dramatically. A model calibrated to the English Premier League will systematically mispredict outcomes in other leagues if applied without adjustment. The key dimensions of variation include:
-
Scoring rates: The average goals per game varies from approximately 2.2 (Serie A in defensive eras) to 3.2+ (the Eredivisie and some South American leagues). Your Poisson parameters must reflect the scoring environment.
-
Home advantage: Home advantage varies enormously. In pre-COVID European leagues, the home team won approximately 46% of matches. In the MLS (with its cross-continental travel), home advantage was historically even stronger. Post-COVID, home advantage decreased substantially and has only partially recovered.
-
Competitive balance: The Premier League has 4--6 genuinely elite teams; the Bundesliga is historically dominated by Bayern Munich; Ligue 1 by PSG. This affects the distribution of parameter strengths and the frequency of extreme scorelines.
-
Tactical style: The prevalence of pressing, counter-attacking, or possession-based play affects scoring patterns. High-pressing leagues tend to produce more goals from turnovers, altering the xG distribution.
-
Promotion and relegation: Unlike American sports, European leagues have promotion and relegation, meaning that 2--3 teams each season are replaced by teams from a lower division. This creates a structural challenge: newly promoted teams have no top-flight data, and relegated teams disappear from the dataset.
Adjusting for Scoring Environment
The most basic adjustment is to calibrate your model's baseline scoring rate to each league. If the Bundesliga averages 3.0 goals per game while Serie A averages 2.5, your attack and defense parameters should be interpreted relative to their league's average.
When comparing teams across leagues (e.g., for European competition), you need a cross-league adjustment factor. One approach is to use European competition results as a bridge:
$$\alpha_{i}^{\text{adj}} = \alpha_{i}^{\text{domestic}} \cdot \frac{\bar{\lambda}_{\text{reference}}}{\bar{\lambda}_{\text{league}}}$$
where $\bar{\lambda}_{\text{reference}}$ is the average scoring rate in a reference league (typically the Champions League or a composite of the top-5 leagues) and $\bar{\lambda}_{\text{league}}$ is the average in the team's domestic league.
Promotion and Relegation Impact
Newly promoted teams present a cold-start problem. With no top-flight data, the model must rely on prior information. Several approaches work:
- League-level priors: Assign newly promoted teams the average parameters of teams that were promoted in previous seasons.
- Transfer-weighted priors: Use the squad's transfer market value as a proxy for strength.
- xG from lower division: If xG data is available from the lower division, use it as a starting point, adjusting for the quality difference between divisions.
- Historical promotion performance: Teams that won the lower division tend to perform better in the top flight than playoff winners. This historical pattern can inform priors.
Python Code: League-Specific Adjustments
"""
League-specific model adjustments for soccer prediction.
Handles scoring rate differences, home advantage variation,
promotion/relegation, and cross-league comparisons.
"""
import numpy as np
import pandas as pd
from typing import Dict, List, Optional
from dataclasses import dataclass
@dataclass
class LeagueProfile:
"""Statistical profile of a soccer league."""
name: str
avg_goals_per_game: float
home_win_pct: float
draw_pct: float
away_win_pct: float
home_advantage_factor: float
n_teams: int
has_promotion_relegation: bool
relegated_per_season: int
avg_xg_per_shot: float
# League profiles based on historical data (approximate 2020-2024 averages)
LEAGUE_PROFILES = {
'premier_league': LeagueProfile(
name='English Premier League',
avg_goals_per_game=2.75,
home_win_pct=0.42,
draw_pct=0.25,
away_win_pct=0.33,
home_advantage_factor=1.28,
n_teams=20,
has_promotion_relegation=True,
relegated_per_season=3,
avg_xg_per_shot=0.11,
),
'bundesliga': LeagueProfile(
name='German Bundesliga',
avg_goals_per_game=3.10,
home_win_pct=0.43,
draw_pct=0.23,
away_win_pct=0.34,
home_advantage_factor=1.30,
n_teams=18,
has_promotion_relegation=True,
relegated_per_season=2, # plus playoff
avg_xg_per_shot=0.12,
),
'la_liga': LeagueProfile(
name='Spanish La Liga',
avg_goals_per_game=2.60,
home_win_pct=0.45,
draw_pct=0.24,
away_win_pct=0.31,
home_advantage_factor=1.35,
n_teams=20,
has_promotion_relegation=True,
relegated_per_season=3,
avg_xg_per_shot=0.10,
),
'serie_a': LeagueProfile(
name='Italian Serie A',
avg_goals_per_game=2.65,
home_win_pct=0.44,
draw_pct=0.25,
away_win_pct=0.31,
home_advantage_factor=1.33,
n_teams=20,
has_promotion_relegation=True,
relegated_per_season=3,
avg_xg_per_shot=0.10,
),
'mls': LeagueProfile(
name='Major League Soccer',
avg_goals_per_game=2.85,
home_win_pct=0.48,
draw_pct=0.22,
away_win_pct=0.30,
home_advantage_factor=1.42,
n_teams=29,
has_promotion_relegation=False,
relegated_per_season=0,
avg_xg_per_shot=0.10,
),
}
class LeagueAdjustedModel:
"""
Adjusts a base Dixon-Coles model for league-specific factors.
"""
def __init__(
self,
league: str,
profiles: Dict[str, LeagueProfile] = LEAGUE_PROFILES,
):
if league not in profiles:
raise ValueError(
f"Unknown league '{league}'. "
f"Available: {list(profiles.keys())}"
)
self.league = league
self.profile = profiles[league]
self.promoted_team_priors: Dict[str, dict] = {}
def get_scoring_adjustment(
self, reference_league: str = 'premier_league'
) -> float:
"""
Compute scoring rate adjustment relative to a reference league.
Returns multiplier to apply to attack parameters.
"""
ref_profile = LEAGUE_PROFILES[reference_league]
return (
self.profile.avg_goals_per_game / ref_profile.avg_goals_per_game
)
def set_promoted_team_priors(
self,
team: str,
lower_div_attack: float,
lower_div_defense: float,
promotion_type: str = 'champion',
) -> dict:
"""
Set prior parameters for a newly promoted team.
Parameters
----------
team : str
Team name.
lower_div_attack : float
Attack strength from lower division model.
lower_div_defense : float
Defense strength from lower division model.
promotion_type : str
'champion', 'runner_up', or 'playoff' - affects adjustment.
Returns
-------
dict with adjusted attack and defense priors.
"""
# Historical adjustment factors for promoted teams
# Champions tend to be stronger than playoff winners
adjustment_factors = {
'champion': 0.75, # retain 75% of lower-div strength
'runner_up': 0.70,
'playoff': 0.65,
}
factor = adjustment_factors.get(promotion_type, 0.70)
# Promoted teams are typically below-average in the top flight
# Scale their lower-division parameters toward the average (1.0)
adj_attack = 1.0 + (lower_div_attack - 1.0) * factor * 0.5
adj_defense = 1.0 + (lower_div_defense - 1.0) * factor * 0.6
# Promoted teams historically concede more than they score
# relative to their lower-division performance
adj_defense *= 1.15 # inflate defensive weakness
prior = {
'attack': adj_attack,
'defense': adj_defense,
'promotion_type': promotion_type,
}
self.promoted_team_priors[team] = prior
return prior
def adjust_for_squad_depth(
self,
team: str,
squad_value_millions: float,
league_avg_value: float,
european_competition: bool = False,
) -> float:
"""
Compute a squad-depth fatigue factor.
Teams with thin squads playing in European competition
tend to underperform domestically later in the season.
Returns adjustment multiplier (< 1.0 means expected decline).
"""
value_ratio = squad_value_millions / league_avg_value
if european_competition and value_ratio < 1.5:
# Thin squad in Europe: expect fatigue
fatigue_factor = 0.92 + 0.05 * min(value_ratio, 2.0)
elif european_competition and value_ratio >= 1.5:
# Deep squad handles the load
fatigue_factor = 0.98
else:
fatigue_factor = 1.0
return fatigue_factor
def cross_league_comparison(
self,
team_attack: float,
team_defense: float,
target_league: str,
) -> Dict[str, float]:
"""
Adjust team parameters for a different league context.
Useful for European competition predictions.
"""
target = LEAGUE_PROFILES[target_league]
scoring_ratio = target.avg_goals_per_game / self.profile.avg_goals_per_game
adj_attack = team_attack * scoring_ratio
adj_defense = team_defense * scoring_ratio
return {
'attack': adj_attack,
'defense': adj_defense,
'scoring_ratio': scoring_ratio,
}
def compare_leagues(leagues: List[str] = None) -> pd.DataFrame:
"""Display a comparison table of league profiles."""
if leagues is None:
leagues = list(LEAGUE_PROFILES.keys())
records = []
for key in leagues:
p = LEAGUE_PROFILES[key]
records.append({
'League': p.name,
'Avg Goals': p.avg_goals_per_game,
'Home Win%': f"{p.home_win_pct:.0%}",
'Draw%': f"{p.draw_pct:.0%}",
'Away Win%': f"{p.away_win_pct:.0%}",
'Home Factor': p.home_advantage_factor,
'Teams': p.n_teams,
'Pro/Rel': 'Yes' if p.has_promotion_relegation else 'No',
})
return pd.DataFrame(records)
# --- Demonstration ---
if __name__ == "__main__":
print("=== League Comparison ===\n")
print(compare_leagues().to_string(index=False))
print("\n=== Promoted Team Priors ===\n")
model = LeagueAdjustedModel('premier_league')
# Leicester promoted as Championship champions
prior = model.set_promoted_team_priors(
'Leicester',
lower_div_attack=1.45,
lower_div_defense=0.85,
promotion_type='champion',
)
print(f"Leicester (champion): attack={prior['attack']:.3f}, "
f"defense={prior['defense']:.3f}")
# Ipswich promoted via playoff
prior2 = model.set_promoted_team_priors(
'Ipswich',
lower_div_attack=1.20,
lower_div_defense=0.95,
promotion_type='playoff',
)
print(f"Ipswich (playoff): attack={prior2['attack']:.3f}, "
f"defense={prior2['defense']:.3f}")
print("\n=== Cross-League Adjustment ===\n")
# Liverpool in Champions League (vs Bundesliga opponent)
adj = model.cross_league_comparison(
team_attack=1.45,
team_defense=0.72,
target_league='bundesliga',
)
print(f"Liverpool (PL -> BuLi context):")
print(f" Attack: 1.450 -> {adj['attack']:.3f}")
print(f" Defense: 0.720 -> {adj['defense']:.3f}")
print(f" Scoring ratio: {adj['scoring_ratio']:.3f}")
print("\n=== Squad Depth Factor ===\n")
# Aston Villa with modest squad in Champions League
fatigue = model.adjust_for_squad_depth(
'Aston Villa',
squad_value_millions=450,
league_avg_value=550,
european_competition=True,
)
print(f"Aston Villa fatigue factor: {fatigue:.3f}")
# Man City with deep squad in Champions League
fatigue2 = model.adjust_for_squad_depth(
'Man City',
squad_value_millions=1100,
league_avg_value=550,
european_competition=True,
)
print(f"Man City fatigue factor: {fatigue2:.3f}")
The Importance of Local Knowledge
Quantitative models capture the statistical regularities of soccer, but they inevitably miss league-specific idiosyncrasies that a knowledgeable observer would notice. Examples include:
- Referee tendencies: Some leagues (e.g., Serie A) have historically higher rates of penalties awarded, which inflates xG for certain types of attacks.
- Weather and altitude: Matches played at altitude (e.g., in Bolivia or parts of Mexico) produce markedly different results. Visiting teams from lowland areas suffer measurably.
- Calendar effects: Leagues that play through the winter (England, Germany) see different fatigue patterns than leagues with a winter break (Spain pre-2021, France).
- Derby matches: Local derbies often produce results that deviate from model expectations. The emotional intensity and crowd effects can reduce the advantage of the superior team.
- End-of-season dynamics: Teams with nothing to play for in the final weeks of the season behave differently---their motivation is reduced, and they may rest key players.
A complete soccer model combines the quantitative framework developed in this chapter with a systematic way to incorporate these qualitative factors, whether through explicit model adjustments or through a disciplined Bayesian updating process where the modeler's subjective beliefs interact with the statistical output.
19.5 Tournament and International Match Modeling
The Unique Challenges of International Soccer
Modeling international soccer---whether World Cups, European Championships, or qualifying matches---presents challenges that are fundamentally different from domestic league modeling:
-
Small sample sizes: National teams play only 10--15 competitive matches per year, compared to 38+ for club teams. This means parameter estimates are noisy, and models must rely heavily on priors.
-
Team composition changes: Unlike club teams, which maintain a relatively stable roster throughout a season, national team squads can change significantly between matches. A team that qualified for the World Cup may have a different starting XI at the tournament itself.
-
Infrequent competitive matches: Friendly matches provide data but are much less informative than competitive games. Teams experiment with tactics and rest key players in friendlies.
-
Host-nation effects: The host country at a major tournament typically performs well beyond what their historical record would predict. Home crowd support, absence of travel fatigue, and motivational factors all contribute.
-
Pressure and experience: Tournament knockout matches are psychologically different from league matches. Teams with tournament experience (e.g., Germany, Brazil) may handle pressure differently than debutants.
-
Tactical conservatism: High-stakes tournament matches, especially in the knockout rounds, tend to be lower-scoring than league matches because teams adopt more cautious tactics.
International Elo Ratings
The Elo rating system, originally designed for chess, has been adapted for international soccer and provides one of the best available frameworks for rating national teams. The World Football Elo system maintains ratings for every national team, updated after each match.
The core Elo update formula is:
$$R_{\text{new}} = R_{\text{old}} + K \cdot (W - W_e)$$
where: - $R$ is the team's rating - $K$ is the weight factor for the match (higher for World Cup matches, lower for friendlies) - $W$ is the actual result (1 for win, 0.5 for draw, 0 for loss) - $W_e$ is the expected result based on current ratings
The expected result is calculated using the logistic function:
$$W_e = \frac{1}{1 + 10^{-(R_{\text{home}} - R_{\text{away}} + H) / 400}}$$
where $H$ is a home-advantage bonus (typically around 100 Elo points for competitive matches, less for neutral venues).
The K-factor varies by match importance:
| Match Type | K-Factor |
|---|---|
| World Cup finals | 60 |
| Continental championship finals | 50 |
| World Cup qualifiers | 40 |
| Continental championship qualifiers | 35 |
| Friendlies | 20 |
The goal difference is incorporated as a multiplier $G$ on the K-factor:
$$G = \begin{cases} 1 & \text{if goal difference} \leq 1 \\ 1.5 & \text{if goal difference} = 2 \\ \frac{11 + \text{goal diff}}{8} & \text{if goal difference} \geq 3 \end{cases}$$
Python Code: Tournament Modeling
"""
International soccer and tournament modeling.
Implements Elo ratings for national teams, tournament simulation,
and World Cup/Euro prediction.
"""
import numpy as np
import pandas as pd
from typing import Dict, List, Tuple, Optional
from dataclasses import dataclass, field
from itertools import combinations
@dataclass
class EloTeam:
"""National team with Elo rating and metadata."""
name: str
rating: float = 1500.0
confederation: str = 'UEFA'
matches_played: int = 0
class InternationalElo:
"""
Elo rating system for international soccer.
Implements the World Football Elo approach with goal-difference
scaling and match-importance weighting.
"""
# K-factors by match type
K_FACTORS = {
'world_cup': 60,
'continental_final': 50,
'wc_qualifier': 40,
'continental_qualifier': 35,
'nations_league': 35,
'friendly': 20,
}
HOME_ADVANTAGE = 100 # Elo points
NEUTRAL_ADVANTAGE = 0
def __init__(self, teams: Optional[Dict[str, float]] = None):
"""
Initialize with optional starting ratings.
Parameters
----------
teams : dict, optional
Mapping of team name to initial Elo rating.
"""
self.teams: Dict[str, EloTeam] = {}
if teams:
for name, rating in teams.items():
self.teams[name] = EloTeam(name=name, rating=rating)
def _ensure_team(self, name: str) -> EloTeam:
"""Create team with default rating if not exists."""
if name not in self.teams:
self.teams[name] = EloTeam(name=name)
return self.teams[name]
def expected_result(
self,
team_a: str,
team_b: str,
neutral: bool = False,
) -> float:
"""
Expected result for team_a (1=win, 0.5=draw, 0=loss).
"""
a = self._ensure_team(team_a)
b = self._ensure_team(team_b)
h = self.NEUTRAL_ADVANTAGE if neutral else self.HOME_ADVANTAGE
diff = a.rating - b.rating + h
return 1.0 / (1.0 + 10 ** (-diff / 400.0))
def goal_diff_multiplier(self, goal_diff: int) -> float:
"""Compute goal-difference scaling factor."""
gd = abs(goal_diff)
if gd <= 1:
return 1.0
elif gd == 2:
return 1.5
else:
return (11 + gd) / 8.0
def update(
self,
home_team: str,
away_team: str,
home_goals: int,
away_goals: int,
match_type: str = 'friendly',
neutral: bool = False,
) -> Tuple[float, float]:
"""
Update Elo ratings after a match.
Returns
-------
Tuple of (new_home_rating, new_away_rating).
"""
home = self._ensure_team(home_team)
away = self._ensure_team(away_team)
# Actual results
if home_goals > away_goals:
w_home, w_away = 1.0, 0.0
elif home_goals == away_goals:
w_home, w_away = 0.5, 0.5
else:
w_home, w_away = 0.0, 1.0
# Expected results
we_home = self.expected_result(home_team, away_team, neutral)
we_away = 1.0 - we_home
# K-factor and goal difference
k = self.K_FACTORS.get(match_type, 20)
g = self.goal_diff_multiplier(home_goals - away_goals)
# Update
home.rating += k * g * (w_home - we_home)
away.rating += k * g * (w_away - we_away)
home.matches_played += 1
away.matches_played += 1
return home.rating, away.rating
def predict_match(
self,
home_team: str,
away_team: str,
neutral: bool = False,
avg_total_goals: float = 2.3,
) -> Dict[str, float]:
"""
Predict 1X2 probabilities from Elo ratings.
Uses the Elo expected result to generate approximate
1X2 probabilities via a calibrated mapping.
"""
we = self.expected_result(home_team, away_team, neutral)
# Map expected result to 1X2 using empirical calibration
# Based on historical international matches
draw_base = 0.24 # base draw probability
draw_adj = draw_base - 0.3 * abs(we - 0.5)
draw_adj = max(0.10, min(0.35, draw_adj))
home_win = we * (1 - draw_adj)
away_win = (1 - we) * (1 - draw_adj)
total = home_win + draw_adj + away_win
return {
'home': home_win / total,
'draw': draw_adj / total,
'away': away_win / total,
}
def get_rankings(self, top_n: int = 20) -> pd.DataFrame:
"""Return top N teams by Elo rating."""
records = [
{'Team': t.name, 'Rating': t.rating, 'Matches': t.matches_played}
for t in self.teams.values()
]
df = pd.DataFrame(records).sort_values(
'Rating', ascending=False
).reset_index(drop=True)
df.index += 1
df.index.name = 'Rank'
return df.head(top_n)
class TournamentSimulator:
"""
Simulate a World Cup or European Championship.
Handles group stage, knockout rounds, and overall
tournament outcome probabilities.
"""
def __init__(
self,
elo_system: InternationalElo,
avg_goals: float = 2.3,
):
self.elo = elo_system
self.avg_goals = avg_goals
def simulate_match(
self,
team_a: str,
team_b: str,
neutral: bool = True,
allow_draw: bool = True,
) -> Tuple[int, int]:
"""
Simulate a single match.
For knockout matches (allow_draw=False), extra time and
penalties are simulated if the match is drawn.
"""
probs = self.elo.predict_match(team_a, team_b, neutral)
# Use Poisson simulation with rating-implied expected goals
we = self.elo.expected_result(team_a, team_b, neutral)
# Distribute total goals based on strength
lam_a = self.avg_goals * we
lam_b = self.avg_goals * (1 - we)
goals_a = np.random.poisson(lam_a)
goals_b = np.random.poisson(lam_b)
if not allow_draw and goals_a == goals_b:
# Extra time: lower scoring rate
et_a = np.random.poisson(lam_a * 0.3)
et_b = np.random.poisson(lam_b * 0.3)
goals_a += et_a
goals_b += et_b
if goals_a == goals_b:
# Penalties: approximately 50-50 with slight edge
if np.random.random() < we * 0.55 + 0.225:
goals_a += 1
else:
goals_b += 1
return goals_a, goals_b
def simulate_group(
self,
group_teams: List[str],
n_sims: int = 10_000,
) -> pd.DataFrame:
"""
Simulate a 4-team group stage.
Returns advancement probabilities for each team.
"""
advancement = {team: 0 for team in group_teams}
group_winners = {team: 0 for team in group_teams}
for _ in range(n_sims):
points = {team: 0 for team in group_teams}
gd = {team: 0 for team in group_teams}
for t1, t2 in combinations(group_teams, 2):
g1, g2 = self.simulate_match(t1, t2, neutral=True)
gd[t1] += g1 - g2
gd[t2] += g2 - g1
if g1 > g2:
points[t1] += 3
elif g1 == g2:
points[t1] += 1
points[t2] += 1
else:
points[t2] += 3
# Rank by points, then goal difference
ranking = sorted(
group_teams,
key=lambda t: (points[t], gd[t]),
reverse=True,
)
# Top 2 advance
advancement[ranking[0]] += 1
advancement[ranking[1]] += 1
group_winners[ranking[0]] += 1
records = []
for team in group_teams:
records.append({
'Team': team,
'Elo': self.elo.teams[team].rating,
'Advance%': advancement[team] / n_sims * 100,
'Win Group%': group_winners[team] / n_sims * 100,
})
return pd.DataFrame(records).sort_values(
'Advance%', ascending=False
).reset_index(drop=True)
def simulate_tournament(
self,
groups: Dict[str, List[str]],
n_sims: int = 10_000,
) -> pd.DataFrame:
"""
Simulate a full tournament and return win probabilities.
Parameters
----------
groups : dict
Mapping of group name to list of 4 teams.
n_sims : int
Number of tournament simulations.
Returns
-------
DataFrame with tournament win probability for each team.
"""
all_teams = [t for teams in groups.values() for t in teams]
wins = {team: 0 for team in all_teams}
finals = {team: 0 for team in all_teams}
semis = {team: 0 for team in all_teams}
for _ in range(n_sims):
# Group stage
group_results = {}
for group_name, teams in groups.items():
points = {team: 0 for team in teams}
gd = {team: 0 for team in teams}
for t1, t2 in combinations(teams, 2):
g1, g2 = self.simulate_match(t1, t2, neutral=True)
gd[t1] += g1 - g2
gd[t2] += g2 - g1
if g1 > g2:
points[t1] += 3
elif g1 == g2:
points[t1] += 1
points[t2] += 1
else:
points[t2] += 3
ranking = sorted(
teams,
key=lambda t: (points[t], gd[t]),
reverse=True,
)
group_results[group_name] = (ranking[0], ranking[1])
# Simplified knockout: QF, SF, Final
group_names = sorted(groups.keys())
# Quarter-finals (simplified bracket)
qf_teams = []
for gn in group_names:
qf_teams.extend(group_results[gn])
# Pair them: 1A vs 2B, 1B vs 2A, etc. (simplified)
sf_teams = []
for i in range(0, len(qf_teams) - 1, 2):
g1, g2 = self.simulate_match(
qf_teams[i], qf_teams[i+1],
neutral=True, allow_draw=False
)
winner = qf_teams[i] if g1 > g2 else qf_teams[i+1]
sf_teams.append(winner)
semis[winner] += 1
# Semi-finals
finalists = []
for i in range(0, len(sf_teams) - 1, 2):
g1, g2 = self.simulate_match(
sf_teams[i], sf_teams[i+1],
neutral=True, allow_draw=False
)
winner = sf_teams[i] if g1 > g2 else sf_teams[i+1]
finalists.append(winner)
finals[winner] += 1
# Final
if len(finalists) >= 2:
g1, g2 = self.simulate_match(
finalists[0], finalists[1],
neutral=True, allow_draw=False
)
champion = finalists[0] if g1 > g2 else finalists[1]
wins[champion] += 1
records = []
for team in all_teams:
records.append({
'Team': team,
'Elo': self.elo.teams[team].rating,
'Win%': wins[team] / n_sims * 100,
'Final%': finals.get(team, 0) / n_sims * 100,
'Semi%': semis.get(team, 0) / n_sims * 100,
})
return pd.DataFrame(records).sort_values(
'Win%', ascending=False
).reset_index(drop=True)
# --- Demonstration ---
if __name__ == "__main__":
# Initialize Elo with approximate current ratings
initial_ratings = {
'France': 2050, 'Brazil': 2000, 'Argentina': 2080,
'England': 1980, 'Spain': 2020, 'Germany': 1960,
'Netherlands': 1940, 'Portugal': 1970, 'Italy': 1920,
'Belgium': 1900, 'Croatia': 1880, 'Uruguay': 1870,
'Colombia': 1850, 'USA': 1810, 'Mexico': 1800,
'Japan': 1830, 'South Korea': 1780, 'Australia': 1720,
'Saudi Arabia': 1680, 'Qatar': 1640, 'Canada': 1750,
'Morocco': 1870, 'Senegal': 1820, 'Ghana': 1700,
}
elo = InternationalElo(initial_ratings)
print("=== International Elo Rankings ===\n")
print(elo.get_rankings(10).to_string())
print("\n=== Match Prediction ===\n")
pred = elo.predict_match('England', 'Germany', neutral=True)
print(f"England vs Germany (neutral):")
print(f" England: {pred['home']:.1%}")
print(f" Draw: {pred['draw']:.1%}")
print(f" Germany: {pred['away']:.1%}")
print("\n=== Group Stage Simulation ===\n")
sim = TournamentSimulator(elo)
group_a = ['Qatar', 'Netherlands', 'Senegal', 'USA']
group_result = sim.simulate_group(group_a, n_sims=10_000)
print("Group A:")
print(group_result.to_string(index=False))
print("\n=== Mini Tournament Simulation ===\n")
groups = {
'A': ['Qatar', 'Netherlands', 'Senegal', 'USA'],
'B': ['England', 'USA', 'Japan', 'Ghana'],
'C': ['Argentina', 'Mexico', 'Saudi Arabia', 'Australia'],
'D': ['France', 'Germany', 'Spain', 'Morocco'],
}
# Note: USA appears in two groups in this toy example.
# In a real tournament, each team appears once.
# Fixing for demonstration:
groups = {
'A': ['Netherlands', 'Senegal', 'Qatar', 'Canada'],
'B': ['England', 'USA', 'Japan', 'Ghana'],
'C': ['Argentina', 'Mexico', 'Saudi Arabia', 'Australia'],
'D': ['France', 'Germany', 'Spain', 'Morocco'],
}
results = sim.simulate_tournament(groups, n_sims=5_000)
print(results.head(10).to_string(index=False))
Practical Considerations for Tournament Betting
Tournament betting markets offer several structural advantages for the informed bettor:
-
Outright markets open early: Futures odds on World Cup and Euro winners are posted months before the tournament. Early prices often reflect outdated information and can offer value if your model incorporates recent form and squad updates.
-
Group-stage dead rubbers: In the final round of group matches, some teams have already qualified and may rotate their squads. This is predictable and often underpriced by the market.
-
Extra time and penalty dynamics: In knockout matches, the market prices the 90-minute result. Understanding how specific teams perform in extra time and penalty situations provides an additional edge.
-
Host-nation adjustment: The historical advantage for host nations at World Cups is substantial---approximately 0.5 goals per game more than their rating would predict. Markets typically price this in but may not fully calibrate it, especially for hosts that lack historical tournament data.
-
Confederation strength: Teams from weaker confederations (e.g., AFC, CONCACAF) tend to be overrated by Elo in cross-confederation play at major tournaments because their qualifying matches inflate their ratings without truly testing them against top-level opposition.
Market Insight: The Group Stage Value Window
The most exploitable window in tournament betting occurs during the group stage. After the first round of matches, the market overreacts to single-game results. A team that loses its opening match by a narrow margin often sees its odds for second-match-day overadjusted. Conversely, a team that wins a fortunate 1--0 may be overvalued. Your pre-tournament Elo model, combined with a single-game xG update, provides a more stable assessment than the market's knee-jerk reaction.
19.6 Chapter Summary
This chapter developed the quantitative toolkit for modeling soccer, a sport whose low-scoring nature makes it particularly amenable to Poisson-based approaches but also particularly challenging due to the dominance of randomness in individual match outcomes.
Key Concepts
The Dixon-Coles Model extends the basic independent Poisson framework with a correlation adjustment factor $\tau$ that corrects for the observed excess frequency of low-scoring draws. The model's parameters---attack strength, defense strength, home advantage, and the correlation parameter $\rho$---are estimated via maximum likelihood with time-decay weighting to prioritize recent form. This model remains the foundation of soccer prediction nearly three decades after its introduction.
Expected Goals (xG) provides a shot-level measure of chance quality that separates chance creation from finishing. A logistic regression model using features such as distance to goal, angle, body part, and situation type produces calibrated probabilities that, when aggregated, give a more predictive measure of team quality than raw goals. xG is particularly valuable for identifying regression candidates: teams whose actual goals significantly differ from their xG are prime targets for contrarian betting.
Asian Handicaps are the dominant professional soccer betting market. Understanding the mechanics of whole-ball, half-ball, and quarter-ball lines---and how to convert between AH odds and 1X2 probabilities---is essential. Quarter-ball lines, which split the stake across two adjacent half-ball lines, are the most distinctive and often confusing feature of Asian markets.
League-Specific Adjustments are necessary because soccer's statistical properties vary dramatically across leagues. Scoring rates, home advantage, competitive balance, tactical style, and the presence or absence of promotion/relegation all affect model calibration. Cross-league comparison for European competition requires explicit adjustment factors.
Tournament and International Modeling presents unique challenges: small samples, variable team compositions, and high-stakes knockout formats. International Elo ratings provide a robust framework for rating national teams, and Monte Carlo simulation is the standard approach for generating tournament outcome probabilities.
Key Formulas
The Dixon-Coles probability of a scoreline $(x, y)$:
$$P(X = x, Y = y) = \tau(x, y, \lambda, \mu, \rho) \cdot \frac{e^{-\lambda} \lambda^x}{x!} \cdot \frac{e^{-\mu} \mu^y}{y!}$$
The xG logistic regression model:
$$\text{xG} = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_{\text{dist}} + \beta_2 x_{\text{angle}} + \cdots)}}$$
The Elo expected result:
$$W_e = \frac{1}{1 + 10^{-(R_A - R_B + H)/400}}$$
The time-decay weight:
$$\phi(t) = e^{-\xi(T - t)}$$
Practical Guidelines
-
Always validate your Dixon-Coles model against closing line probabilities. If your model consistently disagrees with the market by large margins, the model is more likely wrong than the market.
-
Use xG as a complement to your goal-based model, not a replacement. The optimal approach blends goal-based and xG-based team ratings.
-
In Asian handicap markets, always model the full 1X2 distribution, not just the binary AH outcome. The draw probability is crucial for evaluating quarter-ball lines.
-
When modeling a new league, start with at least two full seasons of historical data and validate out-of-sample before betting.
-
For tournament betting, run at least 50,000 simulations to get stable probability estimates for outcomes with low base rates (like a specific team winning the tournament).
Looking Ahead
Chapter 20 shifts from the world's most popular sport to a distinctly American domain: college sports. While the modeling principles are similar---Poisson-like scoring models, strength of schedule adjustments, and market efficiency analysis---the college sports landscape introduces unique challenges including massive team pools, the impact of coaching changes, and recruiting data as a predictive feature. The transition from modeling 20 teams in a European league to 130+ teams in college football requires fundamentally different approaches to parameter estimation and model validation.
Review Questions
-
Explain why the independent Poisson model systematically misprices low-scoring soccer matches, and describe how the Dixon-Coles correction factor addresses this.
-
What is the typical sign and magnitude of the correlation parameter $\rho$ in the Dixon-Coles model, and what does this imply about the relationship between home and away goals at low scores?
-
A team has scored 15 goals from 9.2 xG through their first 10 matches. What does this suggest about their future scoring rate, and how would you incorporate this information into your betting strategy?
-
Explain how a quarter-ball Asian handicap (e.g., -0.25) is resolved for each possible match outcome category (home win, draw, away win).
-
Why do newly promoted teams present a cold-start problem for Poisson-based models, and describe two methods for generating reasonable prior parameters for these teams.
-
Compare and contrast the challenges of modeling domestic league matches versus international tournament matches. What additional data sources or adjustments are needed for tournament modeling?
Exercises
-
Dixon-Coles Implementation (Programming): Using the DixonColesModel class from Section 19.1, fit a model to a full season of Premier League data (available from football-data.co.uk). Compare the model's predicted 1X2 probabilities to the market closing prices. Compute the model's Brier score and compare it to the Brier score of the market-implied probabilities.
-
xG Calibration (Programming): Build an xG model using the logistic regression approach from Section 19.2 with data from StatsBomb Open Data. Create a calibration plot and compute the Brier score. Then, replace raw goals with xG values in the Dixon-Coles model and compare the predictive accuracy of the xG-adjusted model to the goals-based model.
-
Asian Handicap Value Finder (Programming): Write a program that takes 1X2 probabilities from your model and compares them to Asian handicap odds from a sportsbook. The program should identify all bets where the estimated edge exceeds a specified threshold (e.g., 3%).
-
League Comparison (Analysis): Select three leagues from different countries. For each, compute the average goals per game, the home win percentage, the draw percentage, and the away win percentage over the last five seasons. Discuss how these differences would affect the parameters of a Dixon-Coles model fitted to each league.
-
Tournament Simulation (Programming): Using the TournamentSimulator class from Section 19.5, simulate the most recent World Cup or European Championship draw. Compare your model's pre-tournament win probabilities to the pre-tournament betting odds. After the tournament, evaluate whether the model or the market produced better-calibrated probabilities.
Further Reading
- Dixon, M.J. and Coles, S.G. (1997). "Modelling Association Football Scores and Inefficiencies in the Football Betting Market." Journal of the Royal Statistical Society: Series C, 46(2), 265--280.
- Maher, M.J. (1982). "Modelling Association Football Scores." Statistica Neerlandica, 36(3), 109--118.
- Caley, M. (2015). "Premier League Projections and New Expected Goals." Cartilage Free Captain (blog).
- Eggels, H. (2016). "Expected Goals in Soccer: Explaining Match Results using Predictive Analytics." Master's thesis, Eindhoven University of Technology.
- Hvattum, L.M. and Arntzen, H. (2010). "Using ELO Ratings for Match Result Prediction in Association Football." International Journal of Forecasting, 26(3), 460--470.