Case Study 1: Predicting the Premier League — A Poisson Regression Approach

Overview

In this case study, we build a complete match prediction system for the English Premier League using the Dixon-Coles Poisson regression framework. We estimate team attack and defense parameters from historical match data, generate match-by-match probabilities, and simulate the remainder of a season to produce probabilistic league table forecasts. The system is evaluated against both bookmaker odds and naive baselines.

Motivation

Before the 2022-23 season, Arsenal led the Premier League by 8 points at the halfway mark. Pundits were divided: some declared the title race over, while others pointed to Arsenal's lack of experience in title challenges. A principled predictive model can cut through this noise by quantifying exactly how likely different outcomes are, given the current form and strength of each team.

Our goal is to build a model that can answer questions such as: - What is Arsenal's probability of winning the league from this position? - What is the probability of each team finishing in the top 4? - What is the most likely final points total for each team?

Data Description

We work with a synthetic dataset that mirrors the structure of real Premier League data. Each row represents a single match with the following columns:

Column Description
date Match date
home_team Home team name
away_team Away team name
home_goals Goals scored by home team
away_goals Goals scored by away team
matchday Matchday number (1-38)

Step 1: Exploratory Analysis

Before modeling, we examine the basic properties of the data.

Home advantage: In a typical Premier League season, the home team wins approximately 46% of matches, draws occur 27% of the time, and the away team wins 27%. The average home goals per match is about 1.55, while the average away goals is about 1.20.

import pandas as pd
import numpy as np

# Basic statistics
home_win_pct = (df["home_goals"] > df["away_goals"]).mean()
draw_pct = (df["home_goals"] == df["away_goals"]).mean()
away_win_pct = (df["home_goals"] < df["away_goals"]).mean()

print(f"Home Win: {home_win_pct:.1%}")
print(f"Draw:     {draw_pct:.1%}")
print(f"Away Win: {away_win_pct:.1%}")
print(f"Avg Home Goals: {df['home_goals'].mean():.2f}")
print(f"Avg Away Goals: {df['away_goals'].mean():.2f}")

Goal distribution: We verify that goals approximately follow a Poisson distribution by comparing observed frequencies with Poisson expectations:

from scipy.stats import poisson, chisquare

lambda_home = df["home_goals"].mean()
observed = df["home_goals"].value_counts().sort_index()
expected = [poisson.pmf(k, lambda_home) * len(df) for k in observed.index]

chi2, p_value = chisquare(observed.values, f_exp=expected)
print(f"Chi-squared test: statistic={chi2:.2f}, p-value={p_value:.4f}")

If the p-value is above 0.05, we cannot reject the Poisson assumption -- a necessary (but not sufficient) condition for our modeling approach.

Step 2: The Dixon-Coles Model

We implement the full Dixon-Coles model with time weighting.

Parameter Estimation

The model estimates the following parameters for each of the 20 teams: - Attack parameter $a_i$: How strong the team is at scoring relative to the league average. - Defense parameter $d_i$: How strong the team is at preventing goals relative to the league average. - Home advantage $\beta$: A global parameter capturing the home-field effect. - Correlation $\rho$: The Dixon-Coles correction for low-scoring game dependencies.

from scipy.optimize import minimize

def fit_dixon_coles(matches_df, teams, xi=0.005):
    """Fit Dixon-Coles model with time decay."""
    n_teams = len(teams)
    team_idx = {team: i for i, team in enumerate(teams)}

    home_idx = matches_df["home_team"].map(team_idx).values
    away_idx = matches_df["away_team"].map(team_idx).values
    home_goals = matches_df["home_goals"].values
    away_goals = matches_df["away_goals"].values

    # Time weights
    max_date = matches_df["date"].max()
    days_ago = (max_date - matches_df["date"]).dt.days.values
    weights = np.exp(-xi * days_ago)

    def neg_log_likelihood(params):
        alpha = params[0]
        home_adv = params[1]
        rho = params[2]
        attack = params[3:3 + n_teams]
        defense = params[3 + n_teams:3 + 2 * n_teams]

        # Sum-to-zero constraint via penalty
        penalty = 100 * (attack.sum()**2 + defense.sum()**2)

        lam = np.exp(alpha + home_adv + attack[home_idx] - defense[away_idx])
        mu = np.exp(alpha + attack[away_idx] - defense[home_idx])

        # Dixon-Coles tau correction
        tau = np.ones(len(matches_df))
        mask_00 = (home_goals == 0) & (away_goals == 0)
        mask_10 = (home_goals == 1) & (away_goals == 0)
        mask_01 = (home_goals == 0) & (away_goals == 1)
        mask_11 = (home_goals == 1) & (away_goals == 1)

        tau[mask_00] = 1 - rho * lam[mask_00] * mu[mask_00]
        tau[mask_10] = 1 + rho * mu[mask_10]
        tau[mask_01] = 1 + rho * lam[mask_01]
        tau[mask_11] = 1 - rho

        log_lik = weights * (
            home_goals * np.log(lam) - lam
            + away_goals * np.log(mu) - mu
            + np.log(np.maximum(tau, 1e-10))
        )

        return -np.sum(log_lik) + penalty

    # Initial parameters
    x0 = np.zeros(3 + 2 * n_teams)
    x0[0] = np.log(1.35)  # alpha ~ log of average goals
    x0[1] = 0.25           # home advantage

    result = minimize(neg_log_likelihood, x0, method="L-BFGS-B",
                      bounds=[(None, None), (None, None), (-0.5, 0.5)]
                      + [(None, None)] * (2 * n_teams))

    return result

Results Interpretation

After fitting, we extract the team parameters and rank teams by their combined offensive-defensive strength:

teams_sorted = sorted(
    zip(teams, attack_params, defense_params),
    key=lambda x: x[1] - x[2],  # attack - defense = overall strength
    reverse=True
)

for team, att, dfn in teams_sorted[:5]:
    print(f"{team:20s}  Attack: {att:+.3f}  Defense: {dfn:+.3f}  "
          f"Net: {att - dfn:+.3f}")

A typical output for a strong team might show attack = +0.35 and defense = -0.15, meaning they score about $e^{0.35} \approx 42\%$ more than average and concede about $e^{0.15} \approx 16\%$ less than average.

Step 3: Match Probability Calculation

Given the fitted parameters, we compute the probability of each scoreline for any match:

from scipy.stats import poisson

def match_probabilities(lam, mu, max_goals=8):
    """Compute scoreline probabilities from Poisson parameters."""
    probs = np.zeros((max_goals + 1, max_goals + 1))
    for i in range(max_goals + 1):
        for j in range(max_goals + 1):
            probs[i, j] = poisson.pmf(i, lam) * poisson.pmf(j, mu)

    home_win = np.tril(probs, -1).sum()
    draw = np.trace(probs)
    away_win = np.triu(probs, 1).sum()

    return {"home_win": home_win, "draw": draw, "away_win": away_win,
            "scoreline_probs": probs}

Step 4: Season Simulation

The most powerful application of the model is Monte Carlo simulation of the remaining season. We simulate 10,000 completions of the season and tabulate the results:

def simulate_season(remaining_fixtures, team_params, n_sims=10000):
    """Simulate remaining season fixtures."""
    all_tables = []

    for _ in range(n_sims):
        points = {}
        for _, fixture in remaining_fixtures.iterrows():
            home = fixture["home_team"]
            away = fixture["away_team"]

            lam = compute_lambda(home, away, team_params)
            mu = compute_mu(home, away, team_params)

            home_goals = np.random.poisson(lam)
            away_goals = np.random.poisson(mu)

            if home_goals > away_goals:
                points[home] = points.get(home, 0) + 3
            elif home_goals == away_goals:
                points[home] = points.get(home, 0) + 1
                points[away] = points.get(away, 0) + 1
            else:
                points[away] = points.get(away, 0) + 3

        all_tables.append(points)

    return all_tables

From the simulations, we can compute: - Title probability: Fraction of simulations where each team finishes first. - Top-4 probability: Fraction where each team finishes in the top 4 (Champions League). - Relegation probability: Fraction where each team finishes in the bottom 3. - Points distribution: Full distribution of final points totals for each team.

Step 5: Model Evaluation

We evaluate the model using out-of-sample predictions on the second half of the season:

Log-Loss Comparison

Model Log-Loss
Naive baseline (historical frequencies) 1.095
Basic Poisson (no correction) 1.035
Dixon-Coles (no time weighting) 1.022
Dixon-Coles (with time weighting) 1.010
Bookmaker consensus 0.990

The Dixon-Coles model with time weighting comes within 2% of the bookmaker consensus, which aggregates vast amounts of information including team news, market dynamics, and expert opinion.

Calibration

We assess calibration by binning predicted home win probabilities and comparing with observed frequencies. The model shows slight overconfidence for underdogs (predicted 15% home win, observed 18%) and slight underconfidence for favorites (predicted 65% home win, observed 62%).

Key Findings

  1. Home advantage has declined in recent seasons, particularly post-COVID. Our model estimates a home advantage parameter of about 0.22, corresponding to roughly 0.3 extra expected goals per match for the home team.

  2. Team strength parameters stabilize after approximately 10-12 matches. Early-season predictions carry wide uncertainty bands.

  3. The Dixon-Coles correction is small but statistically significant. The estimated $\rho \approx -0.04$ slightly increases the probability of draws relative to the pure independent Poisson model.

  4. Time weighting with a half-life of approximately 40 matches (roughly one season) provides the best out-of-sample performance.

  5. Season simulation reveals that mid-season point leads are less decisive than commonly believed. An 8-point lead at the halfway stage corresponds to approximately a 75% title probability, not the near-certainty that many assume.

Limitations and Extensions

  • The model treats teams as having fixed strength within a season (modulated only by time weighting). A state-space model with match-by-match updates would be more responsive.
  • Player-level information (injuries, suspensions, transfers) is not directly incorporated. Adding key-player availability as a covariate would improve accuracy.
  • The independence assumption between home and away goals, even with the Dixon-Coles correction, is an approximation. A bivariate Poisson or copula-based model could better capture the joint distribution.

Discussion Questions

  1. How would you modify the model to account for a team that makes major signings in the January transfer window?
  2. What additional data sources could improve the model beyond match results?
  3. How would you extend the model to handle a cup competition (knockout format) rather than a league?
  4. If a sporting director wanted to know "What is the probability we finish in the top 4 if we win our next 3 matches?", how would you modify the simulation to answer this conditional question?

Code Reference

The complete implementation is available in code/case-study-code.py and code/example-01-match-prediction.py.