Case Study: Poisson Modeling of Soccer — Predicting Match Outcomes

Introduction

The Poisson distribution is the workhorse of soccer analytics. Since the pioneering work of Moroney (1956) and Reep and Benjamin (1968), analysts have recognized that the number of goals a team scores in a match closely follows a Poisson process. This case study walks through the construction of a complete Poisson-based soccer prediction model, from estimating team-level parameters to generating match probabilities and identifying value bets against the market.

We will build the model step by step using real-world logic and Python code. By the end of this case study, you will have a functioning system that estimates attack and defense ratings for every team in a league, predicts scorelines for any fixture, computes probabilities for the most popular betting markets (1X2, Over/Under, Both Teams to Score), and compares those probabilities to bookmaker odds to find value.

Why Poisson? The Statistical Foundation

The Poisson distribution models the number of independent events occurring within a fixed interval when the events happen at a known average rate. For soccer goals, the "interval" is a 90-minute match and the "rate" is the team's average goal-scoring output.

The Poisson probability mass function is:

P(X = k) = (lambda^k * e^(-lambda)) / k!

where lambda is the expected rate (mean goals) and k is the specific number of goals.

Three key assumptions underlie the model:

Independence: Goals are scored independently of each other. One goal does not make the next more or less likely.
Constant rate: The scoring rate is constant throughout the match.
No simultaneous events: Two goals cannot occur at exactly the same instant.

These assumptions are approximately, though not perfectly, satisfied in soccer. We will revisit their limitations later.

Data Preparation

For this case study, we simulate a dataset that mirrors a typical European league season: 20 teams, each playing 38 matches (home and away against every other team), for a total of 380 matches. In practice, you would load real data from sources such as football-data.co.uk or the Understat API.

"""
Poisson Soccer Model — Data Preparation
"""
import numpy as np
import pandas as pd
from scipy.stats import poisson
from itertools import product

np.random.seed(42)

# Simulate team strengths
teams = [f"Team_{i+1:02d}" for i in range(20)]
true_attack = np.random.normal(1.0, 0.25, size=20)
true_defense = np.random.normal(1.0, 0.20, size=20)
home_advantage = 1.25  # Home teams score ~25% more

# Generate all fixtures (each pair plays home and away)
fixtures = []
for i, home in enumerate(teams):
    for j, away in enumerate(teams):
        if i != j:
            home_rate = true_attack[i] * true_defense[j] * home_advantage
            away_rate = true_attack[j] * true_defense[i]
            home_goals = np.random.poisson(home_rate)
            away_goals = np.random.poisson(away_rate)
            fixtures.append({
                "home_team": home,
                "away_team": away,
                "home_goals": home_goals,
                "away_goals": away_goals,
            })

matches = pd.DataFrame(fixtures)
print(f"Total matches: {len(matches)}")
print(matches.head(10))

This produces 380 matches with realistic scorelines. The true_attack and true_defense arrays represent each team's latent offensive and defensive strength, which our model will attempt to recover from the observed results.

Estimating Team Parameters

The standard Poisson regression approach models the expected goals for each team in a match as a function of the team's attack strength, the opponent's defense strength, and a home advantage factor. Specifically:

E[Home Goals] = alpha_home_attack * beta_away_defense * gamma
E[Away Goals] = alpha_away_attack * beta_home_defense

where gamma is the home advantage multiplier.

We estimate these parameters by maximizing the Poisson log-likelihood. In practice, we use a log-linear model and solve via standard Poisson regression.

"""
Poisson Soccer Model — Parameter Estimation
"""
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Reshape data for regression: each match contributes two rows
# (one for home goals, one for away goals)
home_df = matches[["home_team", "away_team", "home_goals"]].copy()
home_df.columns = ["team", "opponent", "goals"]
home_df["is_home"] = 1

away_df = matches[["away_team", "home_team", "away_goals"]].copy()
away_df.columns = ["team", "opponent", "goals"]
away_df["is_home"] = 0

model_data = pd.concat([home_df, away_df], ignore_index=True)

# Fit Poisson regression
formula = "goals ~ C(team) + C(opponent) + is_home"
poisson_model = smf.glm(
    formula=formula,
    data=model_data,
    family=sm.families.Poisson()
).fit()

print(poisson_model.summary())

The fitted model gives us coefficients for each team's attacking strength (the team factor), each team's defensive weakness (the opponent factor, with higher values meaning a weaker defense), and the home advantage effect.

Extracting Attack and Defense Ratings

"""
Poisson Soccer Model — Extract Ratings
"""

def extract_ratings(model, teams, baseline_team):
    """Extract attack and defense ratings from fitted Poisson model."""
    params = model.params

    # Baseline (intercept) represents the reference team's log-rate
    baseline = params["Intercept"]
    home_adv = params["is_home"]

    attack_ratings = {}
    defense_ratings = {}

    for team in teams:
        # Attack rating
        attack_key = f"C(team)[T.{team}]"
        attack_coef = params.get(attack_key, 0.0)
        attack_ratings[team] = np.exp(baseline + attack_coef)

        # Defense rating (how many goals opponents score against this team)
        defense_key = f"C(opponent)[T.{team}]"
        defense_coef = params.get(defense_key, 0.0)
        defense_ratings[team] = np.exp(defense_coef)

    return attack_ratings, defense_ratings, np.exp(home_adv)

attack, defense, home_adv = extract_ratings(poisson_model, teams, teams[0])

ratings_df = pd.DataFrame({
    "Team": teams,
    "Attack": [attack[t] for t in teams],
    "Defense": [defense[t] for t in teams],
})
ratings_df = ratings_df.sort_values("Attack", ascending=False)
print("\nTeam Ratings (sorted by attack strength):")
print(ratings_df.to_string(index=False))
print(f"\nHome advantage multiplier: {home_adv:.4f}")

Predicting Match Outcomes

With attack and defense ratings in hand, we can predict any fixture. The expected goals for each team are calculated, and then the Poisson distribution generates the full scoreline probability matrix.

"""
Poisson Soccer Model — Match Prediction
"""

def predict_match(
    home_attack: float,
    home_defense: float,
    away_attack: float,
    away_defense: float,
    home_advantage: float,
    league_avg_goals: float = 1.0,
    max_goals: int = 8
) -> dict:
    """
    Predict match outcome probabilities using independent Poisson model.

    Returns dictionary with scoreline matrix, 1X2 probs, O/U probs, BTTS.
    """
    # Expected goals
    home_lambda = home_attack * away_defense * home_advantage
    away_lambda = away_attack * home_defense

    # Scoreline probability matrix
    home_probs = [poisson.pmf(g, home_lambda) for g in range(max_goals + 1)]
    away_probs = [poisson.pmf(g, away_lambda) for g in range(max_goals + 1)]

    score_matrix = np.outer(home_probs, away_probs)

    # 1X2 probabilities
    home_win = np.sum(np.tril(score_matrix, -1))  # Below diagonal
    draw = np.trace(score_matrix)
    away_win = np.sum(np.triu(score_matrix, 1))    # Above diagonal

    # Note: numpy outer gives [home_goals, away_goals]
    # tril(k=-1) is where row > col, meaning home_goals > away_goals
    # So we need to be careful with orientation

    # Recalculate explicitly
    home_win_prob = 0.0
    draw_prob = 0.0
    away_win_prob = 0.0

    for hg in range(max_goals + 1):
        for ag in range(max_goals + 1):
            p = score_matrix[hg, ag]
            if hg > ag:
                home_win_prob += p
            elif hg == ag:
                draw_prob += p
            else:
                away_win_prob += p

    # Over/Under
    total_probs = {}
    for hg in range(max_goals + 1):
        for ag in range(max_goals + 1):
            total = hg + ag
            total_probs[total] = total_probs.get(total, 0) + score_matrix[hg, ag]

    over_under = {}
    for line in [0.5, 1.5, 2.5, 3.5, 4.5]:
        over = sum(p for t, p in total_probs.items() if t > line)
        over_under[f"Over {line}"] = over
        over_under[f"Under {line}"] = 1 - over

    # Both Teams to Score
    btts_no = sum(score_matrix[0, :]) + sum(score_matrix[:, 0]) - score_matrix[0, 0]
    btts_yes = 1 - btts_no

    return {
        "home_lambda": home_lambda,
        "away_lambda": away_lambda,
        "score_matrix": score_matrix,
        "home_win": home_win_prob,
        "draw": draw_prob,
        "away_win": away_win_prob,
        "over_under": over_under,
        "btts_yes": btts_yes,
        "btts_no": btts_no,
    }


# Example prediction
home_team = "Team_01"
away_team = "Team_10"
result = predict_match(
    home_attack=attack[home_team],
    home_defense=defense[home_team],
    away_attack=attack[away_team],
    away_defense=defense[away_team],
    home_advantage=home_adv,
)

print(f"\n{home_team} vs {away_team}")
print(f"Expected Goals — Home: {result['home_lambda']:.2f}, Away: {result['away_lambda']:.2f}")
print(f"\n1X2 Probabilities:")
print(f"  Home Win: {result['home_win']:.4f} (Fair odds: {1/result['home_win']:.2f})")
print(f"  Draw:     {result['draw']:.4f} (Fair odds: {1/result['draw']:.2f})")
print(f"  Away Win: {result['away_win']:.4f} (Fair odds: {1/result['away_win']:.2f})")
print(f"\nOver/Under:")
for market, prob in result["over_under"].items():
    print(f"  {market}: {prob:.4f}")
print(f"\nBTTS Yes: {result['btts_yes']:.4f}")
print(f"BTTS No:  {result['btts_no']:.4f}")

Visualizing the Scoreline Matrix

A heatmap of the scoreline probability matrix provides an intuitive view of the most likely outcomes.

"""
Poisson Soccer Model — Visualization
"""
import matplotlib.pyplot as plt
import seaborn as sns

def plot_scoreline_matrix(score_matrix: np.ndarray, home_team: str, away_team: str):
    """Plot scoreline probability matrix as a heatmap."""
    max_goals = min(6, score_matrix.shape[0])
    display_matrix = score_matrix[:max_goals, :max_goals]

    fig, ax = plt.subplots(figsize=(8, 6))
    sns.heatmap(
        display_matrix,
        annot=True,
        fmt=".3f",
        cmap="YlOrRd",
        xticklabels=range(max_goals),
        yticklabels=range(max_goals),
        ax=ax,
    )
    ax.set_xlabel(f"{away_team} Goals")
    ax.set_ylabel(f"{home_team} Goals")
    ax.set_title(f"Scoreline Probabilities: {home_team} vs {away_team}")
    plt.tight_layout()
    plt.savefig("scoreline_heatmap.png", dpi=150)
    plt.show()

plot_scoreline_matrix(result["score_matrix"], home_team, away_team)

Identifying Value Bets

The power of the Poisson model lies in comparing its probability estimates to bookmaker odds. When our model assigns a higher probability to an outcome than the implied probability from the market odds, we have a potential value bet.

"""
Poisson Soccer Model — Value Bet Identification
"""

def find_value_bets(
    model_probs: dict,
    market_odds: dict,
    min_edge: float = 0.05
) -> list:
    """
    Compare model probabilities to market odds and identify value bets.

    Args:
        model_probs: dict with keys like 'home_win', 'draw', 'away_win',
                     'Over 2.5', 'btts_yes', etc.
        market_odds: dict with same keys, values are decimal odds.
        min_edge: minimum edge (model_prob - implied_prob) to flag.

    Returns:
        List of dicts describing value bets found.
    """
    value_bets = []

    # Calculate total implied probability (overround)
    total_implied = sum(1 / odds for odds in market_odds.values())

    for market, model_prob in model_probs.items():
        if market not in market_odds:
            continue

        decimal_odds = market_odds[market]
        raw_implied = 1 / decimal_odds

        # Adjust for overround (normalize)
        adjusted_implied = raw_implied / total_implied

        edge = model_prob - adjusted_implied

        if edge >= min_edge:
            # Kelly criterion for optimal bet sizing
            kelly_fraction = edge / (decimal_odds - 1) if decimal_odds > 1 else 0

            value_bets.append({
                "market": market,
                "model_prob": model_prob,
                "implied_prob_raw": raw_implied,
                "implied_prob_adjusted": adjusted_implied,
                "decimal_odds": decimal_odds,
                "edge": edge,
                "kelly_fraction": kelly_fraction,
            })

    return sorted(value_bets, key=lambda x: -x["edge"])


# Example: compare model to hypothetical market odds
market_odds = {
    "home_win": 2.10,
    "draw": 3.40,
    "away_win": 3.80,
    "Over 2.5": 1.85,
    "Under 2.5": 2.00,
    "btts_yes": 1.80,
    "btts_no": 2.05,
}

model_probs = {
    "home_win": result["home_win"],
    "draw": result["draw"],
    "away_win": result["away_win"],
    "Over 2.5": result["over_under"]["Over 2.5"],
    "Under 2.5": result["over_under"]["Under 2.5"],
    "btts_yes": result["btts_yes"],
    "btts_no": result["btts_no"],
}

value_bets = find_value_bets(model_probs, market_odds, min_edge=0.03)

print("\nValue Bets Found:")
print("-" * 80)
for bet in value_bets:
    print(f"  {bet['market']:12s} | Model: {bet['model_prob']:.4f} | "
          f"Implied: {bet['implied_prob_adjusted']:.4f} | "
          f"Edge: {bet['edge']:.4f} | "
          f"Kelly: {bet['kelly_fraction']:.4f}")

Model Validation

A model is only useful if its predictions are well-calibrated. We validate by checking whether events we predict at probability p actually occur about p fraction of the time.

"""
Poisson Soccer Model — Validation
"""

def validate_model(matches_df, attack, defense, home_adv):
    """Validate Poisson model on historical matches."""
    predictions = []

    for _, row in matches_df.iterrows():
        ht = row["home_team"]
        at = row["away_team"]

        pred = predict_match(
            home_attack=attack[ht],
            home_defense=defense[ht],
            away_attack=attack[at],
            away_defense=defense[at],
            home_advantage=home_adv,
        )

        # Determine actual outcome
        if row["home_goals"] > row["away_goals"]:
            actual = "home_win"
        elif row["home_goals"] == row["away_goals"]:
            actual = "draw"
        else:
            actual = "away_win"

        predictions.append({
            "home_win_prob": pred["home_win"],
            "draw_prob": pred["draw"],
            "away_win_prob": pred["away_win"],
            "actual": actual,
            "over_2_5_prob": pred["over_under"]["Over 2.5"],
            "total_goals": row["home_goals"] + row["away_goals"],
        })

    pred_df = pd.DataFrame(predictions)

    # Brier score for 1X2
    brier = 0
    for _, row in pred_df.iterrows():
        for outcome in ["home_win", "draw", "away_win"]:
            actual_val = 1 if row["actual"] == outcome else 0
            brier += (row[f"{outcome}_prob"] - actual_val) ** 2
    brier /= (3 * len(pred_df))

    # Calibration check
    print(f"\nModel Validation Results:")
    print(f"Brier Score: {brier:.4f}")

    # Check calibration in probability bins
    for outcome in ["home_win", "draw", "away_win"]:
        col = f"{outcome}_prob"
        actuals = (pred_df["actual"] == outcome).astype(float)

        bins = [0, 0.2, 0.3, 0.4, 0.5, 0.6, 0.8, 1.0]
        for i in range(len(bins) - 1):
            mask = (pred_df[col] >= bins[i]) & (pred_df[col] < bins[i+1])
            if mask.sum() > 0:
                predicted_avg = pred_df.loc[mask, col].mean()
                actual_avg = actuals[mask].mean()
                n = mask.sum()
                print(f"  {outcome:10s} bin [{bins[i]:.1f}, {bins[i+1]:.1f}): "
                      f"Predicted={predicted_avg:.3f}, Actual={actual_avg:.3f}, N={n}")

    return pred_df

validation_results = validate_model(matches, attack, defense, home_adv)

Limitations and Extensions

The independent Poisson model has well-documented limitations that a serious bettor must understand:

1. Independence Assumption. The model assumes the two teams' goal outputs are independent. In reality, there is weak negative correlation: when a team scores, the losing team may push forward, creating chances for both sides. Conversely, a team with a lead may adopt defensive tactics, suppressing both teams' rates. The Dixon-Coles (1997) model addresses this by introducing a dependence parameter for low-scoring matches (0-0, 1-0, 0-1, 1-1).

2. Constant Rate Assumption. Goal-scoring rates are not constant through a match. Teams may tire, change tactics, or respond to the scoreline. Late goals are more common when a trailing team is chasing the game. Time-varying Poisson models or marked point processes can capture these dynamics.

3. Equidispersion. The Poisson distribution constrains the variance to equal the mean. In practice, goal data is often slightly overdispersed (variance exceeds the mean) due to unobserved heterogeneity. The Negative Binomial distribution relaxes this constraint by introducing a dispersion parameter.

4. Key Absences and Team News. The model uses season-long averages and does not account for injuries, suspensions, rotation, or motivation. A team resting key players for a Champions League match will have different scoring rates than their season average.

5. Small Sample Sizes. With only 38 matches per season, parameter estimates carry substantial uncertainty. Bayesian approaches or hierarchical models can regularize estimates and produce more stable ratings.

Despite these limitations, the Poisson model remains a strong baseline. Many professional bettors use it as a foundation, layering additional adjustments (form, team news, market signals) on top of the base Poisson probabilities.

Conclusion

This case study demonstrated the complete pipeline for Poisson-based soccer modeling:

Data preparation — structuring match results for regression.
Parameter estimation — fitting a Poisson regression to extract attack ratings, defense ratings, and home advantage.
Prediction — computing scoreline probabilities and derived market probabilities (1X2, Over/Under, BTTS).
Value identification — comparing model probabilities to market odds and flagging positive-edge opportunities.
Validation — checking calibration and computing Brier scores.

The Poisson model is not perfect, but it is transparent, mathematically grounded, and provides a rigorous baseline against which more complex models can be measured. For the aspiring sports bettor, understanding and implementing this model is an essential skill.

Key Takeaways:

The Poisson distribution is a natural model for soccer goals because goals are rare, discrete events occurring at a roughly constant rate.
A log-linear Poisson regression extracts attack, defense, and home advantage parameters from historical data.
The independent Poisson model generates a complete scoreline probability matrix from which all major betting markets can be priced.
Value bets emerge when model probabilities exceed implied probabilities from market odds by a meaningful margin.
The model's limitations (independence, constant rate, equidispersion) are well understood and can be addressed with extensions like Dixon-Coles, Negative Binomial, or time-varying models.

End of Case Study 1