> "The question is not whether your model is wrong --- all models are wrong. The question is whether your model is useful, and that requires honest, rigorous evaluation."
Learning Objectives
- Evaluate probabilistic predictions using proper scoring rules including Brier score, log loss, and their decompositions
- Design and implement backtesting frameworks that simulate realistic betting scenarios with vigorish and transaction costs
- Apply walk-forward validation techniques including expanding window, sliding window, and purged cross-validation
- Construct and interpret calibration plots and reliability diagrams for assessing probability quality
- Compare competing models using statistical tests and information criteria to make rigorous selection decisions
In This Chapter
Chapter 30: Model Evaluation and Selection
"The question is not whether your model is wrong --- all models are wrong. The question is whether your model is useful, and that requires honest, rigorous evaluation." --- Adapted from George E. P. Box
Chapter Overview
You have engineered features (Chapter 28), trained models (Chapter 29), and generated predictions. But predictions are worthless until you can answer three critical questions: How good are these predictions? Are they good enough to bet on? And which of my models is best?
These questions are deceptively difficult. A model that predicts the correct winner 65% of the time sounds impressive, but if the closing line already implies 64% and you pay 4.76% vigorish, your 65% model loses money. A model with 0.20 Brier score might be excellent for NBA totals but mediocre for NFL game outcomes. And a model that appears to outperform your baseline on last season's data might simply have overfit --- the apparent advantage might vanish on genuinely new data.
Model evaluation for sports betting requires specialized techniques that account for the unique characteristics of our domain: outcomes are probabilistic (not deterministic), data is temporal (not i.i.d.), transaction costs are substantial (the vig), and the baseline we must beat is the market itself (not random chance). Standard machine learning evaluation metrics --- accuracy, precision, recall, F1 --- are inadequate. We need proper scoring rules that evaluate the full probability distribution, backtesting frameworks that simulate realistic betting conditions, validation schemes that respect temporal ordering, calibration analysis that assesses whether predicted probabilities are honest, and statistical tests that distinguish genuine model differences from noise.
This chapter provides the complete toolkit for rigorous model evaluation and selection. Every technique is implemented in Python with worked examples using realistic sports prediction scenarios.
In this chapter, you will learn to: - Evaluate probabilistic predictions with Brier score, log loss, and their decompositions into calibration, resolution, and uncertainty components - Build backtesting frameworks that simulate historical betting with realistic vig, bankroll constraints, and transaction costs - Implement walk-forward validation schemes that prevent temporal data leakage in sports prediction - Construct calibration plots and compute expected calibration error to assess probability quality - Compare models rigorously using the Diebold-Mariano test, information criteria, and cross-validated scoring rules
30.1 Brier Score and Log Loss for Probabilistic Predictions
Why Accuracy Is Insufficient
Consider two models predicting NBA game outcomes:
- Model A predicts every game at 55% for the home team. It achieves 58% accuracy (roughly matching the home-court advantage in the NBA).
- Model B predicts probabilities ranging from 25% to 85%, producing well-calibrated probability estimates. It also achieves 58% accuracy.
By accuracy alone, these models are identical. But Model B is far more useful for betting: it identifies games where the true probability is far from 50% and can generate large expected value bets when the market disagrees. Model A provides no actionable information beyond the baseline home-court advantage.
The problem with accuracy is that it evaluates only the binary decision (did you pick the right side?) while discarding all information about confidence (how confident was your prediction?). For betting, confidence is everything --- it determines bet sizing and whether a wager has positive expected value.
Proper Scoring Rules
A proper scoring rule is a function $S(p, y)$ that evaluates a probability forecast $p$ against an outcome $y$ such that the expected score is optimized when the forecast equals the true probability. In other words, a proper scoring rule cannot be "gamed" by reporting a probability different from your true belief.
Formally, a scoring rule $S$ is strictly proper if for all true probability distributions $q$:
$$\mathbb{E}_q[S(q, Y)] < \mathbb{E}_q[S(p, Y)] \quad \forall p \neq q$$
This property ensures that the forecaster's optimal strategy is to report their honest probability estimate --- exactly what we want.
Brier Score
The Brier score is the mean squared error of probability forecasts:
$$\text{BS} = \frac{1}{n} \sum_{i=1}^{n} (p_i - y_i)^2$$
where $p_i$ is the predicted probability and $y_i \in \{0, 1\}$ is the binary outcome. The Brier score ranges from 0 (perfect predictions) to 1 (maximally wrong predictions). A forecast of 50% for every game yields a Brier score of 0.25 (since $(0.5 - y)^2 = 0.25$ regardless of the outcome).
The Brier score can be decomposed into three additive components that provide diagnostic insight:
$$\text{BS} = \text{Reliability} - \text{Resolution} + \text{Uncertainty}$$
where:
- Reliability (lower is better): Measures how close predicted probabilities are to observed frequencies. A perfectly calibrated model has reliability = 0.
$$\text{Reliability} = \frac{1}{n} \sum_{k=1}^{K} n_k (\bar{p}_k - \bar{y}_k)^2$$
- Resolution (higher is better): Measures how much predicted probabilities vary from the base rate. A model that always predicts the base rate has resolution = 0.
$$\text{Resolution} = \frac{1}{n} \sum_{k=1}^{K} n_k (\bar{y}_k - \bar{y})^2$$
- Uncertainty (constant for a given dataset): Measures the inherent difficulty of the prediction problem. It equals $\bar{y}(1 - \bar{y})$, which is maximized at 0.25 when the base rate is 50%.
$$\text{Uncertainty} = \bar{y}(1 - \bar{y})$$
Here, the forecasts are binned into $K$ groups, $n_k$ is the number of forecasts in bin $k$, $\bar{p}_k$ is the average predicted probability in bin $k$, $\bar{y}_k$ is the average outcome in bin $k$, and $\bar{y}$ is the overall base rate.
Brier Skill Score
The Brier Skill Score (BSS) compares a model's Brier score to that of a reference forecast (typically the base rate):
$$\text{BSS} = 1 - \frac{\text{BS}_{\text{model}}}{\text{BS}_{\text{reference}}}$$
A BSS of 0 means the model is no better than the reference. A BSS of 1 means perfect predictions. Negative BSS means the model is worse than the reference.
For sports betting, a useful reference forecast is the market-implied probability derived from closing lines. A positive BSS relative to the market implies your model contains information beyond what the market has already priced in --- a necessary (but not sufficient) condition for profitable betting.
Log Loss
Log loss (also called logarithmic loss or cross-entropy loss) is another proper scoring rule that penalizes confident wrong predictions more severely than the Brier score:
$$\text{LogLoss} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right]$$
Log loss is unbounded: a prediction of $p = 0$ for an outcome that occurs receives infinite penalty. In practice, predictions are clipped to a range like $[0.001, 0.999]$ to avoid this.
The key difference between Brier score and log loss is how they weight errors at the extremes. A prediction of 0.99 for an event that does not occur receives a Brier penalty of $(0.99)^2 = 0.98$ but a log loss penalty of $-\log(0.01) = 4.61$. Log loss is therefore more sensitive to overconfident predictions --- which is highly relevant in sports betting, where overconfidence leads to large losses on individual bets.
Log Loss Decomposition
Log loss can also be decomposed, though less cleanly than the Brier score. The calibration component of log loss measures the KL divergence between predicted and observed probabilities within each bin:
$$\text{CalibrationLoss} = \sum_{k=1}^{K} \frac{n_k}{n} D_{\text{KL}}(\bar{y}_k \| \bar{p}_k)$$
where $D_{\text{KL}}(q \| p) = q \log(q/p) + (1-q) \log((1-q)/(1-p))$ is the Kullback-Leibler divergence.
import numpy as np
import pandas as pd
from typing import Optional
import matplotlib.pyplot as plt
def compute_brier_score(
predictions: np.ndarray,
outcomes: np.ndarray,
) -> float:
"""
Compute the Brier score for probabilistic predictions.
The Brier score is the mean squared error between predicted
probabilities and binary outcomes. Lower is better (0 = perfect,
0.25 = uninformative baseline at 50%).
Args:
predictions: Array of predicted probabilities in [0, 1].
outcomes: Array of binary outcomes in {0, 1}.
Returns:
The Brier score.
Example:
>>> preds = np.array([0.7, 0.3, 0.8, 0.5])
>>> actual = np.array([1, 0, 1, 0])
>>> compute_brier_score(preds, actual)
0.0575
"""
return float(np.mean((predictions - outcomes) ** 2))
def compute_log_loss(
predictions: np.ndarray,
outcomes: np.ndarray,
epsilon: float = 1e-15,
) -> float:
"""
Compute the log loss (binary cross-entropy) for probabilistic predictions.
Clips predictions to [epsilon, 1 - epsilon] to avoid log(0).
More sensitive to overconfident wrong predictions than Brier score.
Args:
predictions: Array of predicted probabilities in [0, 1].
outcomes: Array of binary outcomes in {0, 1}.
epsilon: Clipping threshold to prevent log(0).
Returns:
The log loss.
"""
p = np.clip(predictions, epsilon, 1 - epsilon)
return float(-np.mean(
outcomes * np.log(p) + (1 - outcomes) * np.log(1 - p)
))
def brier_score_decomposition(
predictions: np.ndarray,
outcomes: np.ndarray,
n_bins: int = 10,
) -> dict[str, float]:
"""
Decompose the Brier score into reliability, resolution, and uncertainty.
This decomposition provides diagnostic insight:
- Reliability (lower is better): Are predicted probabilities well-calibrated?
- Resolution (higher is better): Do predictions vary meaningfully?
- Uncertainty (constant): How inherently difficult is the problem?
The relationship is: BS = Reliability - Resolution + Uncertainty.
Args:
predictions: Array of predicted probabilities.
outcomes: Array of binary outcomes.
n_bins: Number of bins for grouping predictions.
Returns:
Dictionary with keys: 'brier_score', 'reliability', 'resolution',
'uncertainty', and per-bin statistics.
"""
n = len(predictions)
base_rate = outcomes.mean()
uncertainty = base_rate * (1 - base_rate)
# Bin predictions
bin_edges = np.linspace(0, 1, n_bins + 1)
bin_indices = np.digitize(predictions, bin_edges[1:-1])
reliability = 0.0
resolution = 0.0
bin_stats = []
for k in range(n_bins):
mask = bin_indices == k
n_k = mask.sum()
if n_k == 0:
continue
avg_pred = predictions[mask].mean()
avg_outcome = outcomes[mask].mean()
reliability += n_k * (avg_pred - avg_outcome) ** 2
resolution += n_k * (avg_outcome - base_rate) ** 2
bin_stats.append({
"bin": k,
"n": int(n_k),
"avg_pred": float(avg_pred),
"avg_outcome": float(avg_outcome),
"reliability_contrib": float(n_k * (avg_pred - avg_outcome) ** 2 / n),
})
reliability /= n
resolution /= n
brier_score = compute_brier_score(predictions, outcomes)
return {
"brier_score": brier_score,
"reliability": float(reliability),
"resolution": float(resolution),
"uncertainty": float(uncertainty),
"decomposition_check": float(reliability - resolution + uncertainty),
"bin_stats": bin_stats,
}
def brier_skill_score(
predictions: np.ndarray,
outcomes: np.ndarray,
reference_predictions: Optional[np.ndarray] = None,
) -> float:
"""
Compute the Brier Skill Score relative to a reference forecast.
BSS = 1 - BS_model / BS_reference.
If no reference is provided, uses the base rate (climatological
forecast) as the reference. For sports betting, passing
market-implied probabilities as the reference answers the question:
"Does my model add value beyond the market?"
Args:
predictions: Model's predicted probabilities.
outcomes: Binary outcomes.
reference_predictions: Reference forecast probabilities. If None,
uses the base rate (constant prediction).
Returns:
The Brier Skill Score. Positive = better than reference.
"""
bs_model = compute_brier_score(predictions, outcomes)
if reference_predictions is None:
# Use base rate as reference
base_rate = outcomes.mean()
bs_reference = base_rate * (1 - base_rate)
else:
bs_reference = compute_brier_score(reference_predictions, outcomes)
if bs_reference == 0:
return 0.0
return float(1 - bs_model / bs_reference)
Worked Example: Comparing Two NBA Prediction Models
def scoring_rules_example():
"""
Compare two NBA game prediction models using proper scoring rules.
Model A: A simple model based on season win percentage.
Model B: A neural network with engineered features.
Reference: Market-implied probabilities from closing lines.
This example demonstrates how proper scoring rules reveal
differences in model quality that raw accuracy cannot detect.
"""
np.random.seed(42)
n_games = 500
# Simulate game outcomes (home team wins ~58% of the time)
true_probs = np.random.beta(3.5, 2.5, n_games)
outcomes = (np.random.random(n_games) < true_probs).astype(float)
# Model A: Simple model (predicts based on win pct, less varied)
model_a_preds = np.clip(
true_probs + np.random.normal(0, 0.15, n_games), 0.3, 0.7,
)
# Model B: Better model (more accurate, wider range)
model_b_preds = np.clip(
true_probs + np.random.normal(0, 0.08, n_games), 0.1, 0.9,
)
# Market: Very accurate (efficient market hypothesis)
market_preds = np.clip(
true_probs + np.random.normal(0, 0.05, n_games), 0.15, 0.85,
)
# Compute accuracy (binary threshold at 0.5)
acc_a = ((model_a_preds > 0.5) == outcomes).mean()
acc_b = ((model_b_preds > 0.5) == outcomes).mean()
acc_market = ((market_preds > 0.5) == outcomes).mean()
print("=== Accuracy (Binary) ===")
print(f" Model A: {acc_a:.3f}")
print(f" Model B: {acc_b:.3f}")
print(f" Market: {acc_market:.3f}")
# Compute Brier scores
bs_a = compute_brier_score(model_a_preds, outcomes)
bs_b = compute_brier_score(model_b_preds, outcomes)
bs_market = compute_brier_score(market_preds, outcomes)
print("\n=== Brier Score (lower is better) ===")
print(f" Model A: {bs_a:.4f}")
print(f" Model B: {bs_b:.4f}")
print(f" Market: {bs_market:.4f}")
# Compute log loss
ll_a = compute_log_loss(model_a_preds, outcomes)
ll_b = compute_log_loss(model_b_preds, outcomes)
ll_market = compute_log_loss(market_preds, outcomes)
print("\n=== Log Loss (lower is better) ===")
print(f" Model A: {ll_a:.4f}")
print(f" Model B: {ll_b:.4f}")
print(f" Market: {ll_market:.4f}")
# Brier Skill Score vs. market
bss_a = brier_skill_score(model_a_preds, outcomes, market_preds)
bss_b = brier_skill_score(model_b_preds, outcomes, market_preds)
print("\n=== Brier Skill Score vs. Market ===")
print(f" Model A: {bss_a:.4f}")
print(f" Model B: {bss_b:.4f}")
# Decomposition for Model B
decomp = brier_score_decomposition(model_b_preds, outcomes)
print("\n=== Brier Score Decomposition (Model B) ===")
print(f" Reliability: {decomp['reliability']:.4f}")
print(f" Resolution: {decomp['resolution']:.4f}")
print(f" Uncertainty: {decomp['uncertainty']:.4f}")
print(f" Check (REL-RES+UNC): {decomp['decomposition_check']:.4f}")
Intuition: The Brier score decomposition tells you exactly what is wrong with your model. High reliability means your model is poorly calibrated --- when it says 70%, the outcome happens less or more than 70% of the time. Low resolution means your model does not discriminate well --- its predictions are clustered near the base rate and do not distinguish between likely and unlikely outcomes. A good betting model needs both: well-calibrated probabilities (reliability near zero) AND strong discrimination (high resolution).
30.2 Backtesting Strategies for Betting Models
Why Backtesting Matters
A model with excellent Brier score and calibration is not automatically a profitable betting model. Profitability depends on:
- The market's efficiency: If the market-implied probabilities are already well-calibrated, your model needs to be better than the market to generate edge.
- The vigorish: You must overcome the sportsbook's margin on every bet. At standard -110/-110 juice, you pay approximately 4.76% in transaction costs.
- Bet selection: You should only bet when your model's probability diverges sufficiently from the market price. Betting on every game dilutes edge with vig.
- Bankroll management: Even positive-EV strategies can go bankrupt with improper sizing.
Backtesting simulates your model's betting strategy on historical data, accounting for all these factors. It answers the critical question: If I had used this model and this strategy over the past N seasons, how much money would I have made or lost?
Avoiding Look-Ahead Bias
The most dangerous error in backtesting is look-ahead bias --- using information that would not have been available at the time of each prediction. Common sources include:
- Feature leakage: Using season-ending statistics to predict mid-season games.
- Model training leakage: Training on data from games you later "predict" in the backtest.
- Odds leakage: Using closing odds when the model would have bet at opening odds.
- Selection bias: Choosing to backtest a strategy only after seeing that it would have worked.
A properly constructed backtest retrains the model at each time step using only data available up to that point, makes predictions before observing outcomes, and records bets at odds that were actually available.
Backtesting Framework
from dataclasses import dataclass, field
from typing import Callable
@dataclass
class BetRecord:
"""Record of a single simulated bet."""
date: str
game_id: str
side: str
model_prob: float
market_prob: float
odds: float # Decimal odds
stake: float
outcome: int # 1 = win, 0 = loss
profit: float
bankroll_after: float
@dataclass
class BacktestResult:
"""Complete results of a backtesting simulation."""
bets: list[BetRecord] = field(default_factory=list)
initial_bankroll: float = 0.0
final_bankroll: float = 0.0
total_bets: int = 0
winning_bets: int = 0
total_staked: float = 0.0
total_profit: float = 0.0
roi: float = 0.0
max_drawdown: float = 0.0
sharpe_ratio: float = 0.0
def summary(self) -> str:
"""Generate a human-readable summary of backtest results."""
win_rate = self.winning_bets / self.total_bets if self.total_bets > 0 else 0
return (
f"Backtest Summary\n"
f"{'=' * 40}\n"
f"Total bets: {self.total_bets}\n"
f"Win rate: {win_rate:.1%}\n"
f"Total staked: ${self.total_staked:,.2f}\n"
f"Total profit: ${self.total_profit:,.2f}\n"
f"ROI: {self.roi:.2%}\n"
f"Initial bankroll: ${self.initial_bankroll:,.2f}\n"
f"Final bankroll: ${self.final_bankroll:,.2f}\n"
f"Max drawdown: {self.max_drawdown:.2%}\n"
f"Sharpe ratio: {self.sharpe_ratio:.2f}\n"
)
class BettingBacktester:
"""
Backtesting framework for sports betting models.
Simulates historical betting with realistic constraints including
vigorish, bet sizing limits, and bankroll tracking. Supports
configurable bet selection criteria and staking strategies.
The framework enforces temporal integrity: at each time step, only
data available before that time step can be used for predictions.
Args:
initial_bankroll: Starting bankroll in dollars.
min_edge: Minimum required edge (model_prob - implied_prob)
to trigger a bet.
max_bet_fraction: Maximum fraction of bankroll to wager on
a single bet.
staking_method: 'flat' for fixed stakes, 'kelly' for Kelly
criterion, 'fractional_kelly' for a fraction of Kelly.
kelly_fraction: Fraction of Kelly stake to use (only relevant
if staking_method is 'fractional_kelly').
"""
def __init__(
self,
initial_bankroll: float = 10000.0,
min_edge: float = 0.03,
max_bet_fraction: float = 0.05,
staking_method: str = "fractional_kelly",
kelly_fraction: float = 0.25,
):
self.initial_bankroll = initial_bankroll
self.min_edge = min_edge
self.max_bet_fraction = max_bet_fraction
self.staking_method = staking_method
self.kelly_fraction = kelly_fraction
def _calculate_stake(
self,
bankroll: float,
model_prob: float,
decimal_odds: float,
) -> float:
"""
Calculate the stake for a bet using the configured staking method.
Args:
bankroll: Current bankroll.
model_prob: Model's estimated win probability.
decimal_odds: Decimal odds offered.
Returns:
Dollar amount to stake.
"""
max_stake = bankroll * self.max_bet_fraction
if self.staking_method == "flat":
return min(100.0, max_stake)
elif self.staking_method in ("kelly", "fractional_kelly"):
# Kelly criterion: f* = (bp - q) / b
# where b = decimal_odds - 1, p = model_prob, q = 1 - p
b = decimal_odds - 1
p = model_prob
q = 1 - p
if b <= 0:
return 0.0
kelly_fraction_optimal = (b * p - q) / b
if kelly_fraction_optimal <= 0:
return 0.0
if self.staking_method == "fractional_kelly":
kelly_fraction_optimal *= self.kelly_fraction
stake = bankroll * kelly_fraction_optimal
return min(stake, max_stake)
else:
raise ValueError(f"Unknown staking method: {self.staking_method}")
def _implied_probability_from_odds(self, decimal_odds: float) -> float:
"""Convert decimal odds to implied probability."""
return 1.0 / decimal_odds
def run(
self,
games: pd.DataFrame,
model_prob_col: str = "model_prob",
market_odds_col: str = "decimal_odds",
outcome_col: str = "outcome",
date_col: str = "date",
game_id_col: str = "game_id",
side_col: str = "side",
) -> BacktestResult:
"""
Run the backtest on historical game data.
Processes games in chronological order, deciding whether to bet
on each game based on the edge criteria, calculating stakes,
and tracking bankroll evolution.
Args:
games: DataFrame with one row per betting opportunity, sorted
by date. Must contain model probabilities, market odds,
and outcomes.
model_prob_col: Column with model's predicted probability.
market_odds_col: Column with decimal odds available.
outcome_col: Column with binary outcome (1 = win, 0 = loss).
date_col: Column with game date.
game_id_col: Column with game identifier.
side_col: Column indicating which side is being considered.
Returns:
BacktestResult with complete simulation results.
"""
games = games.sort_values(date_col).copy()
bankroll = self.initial_bankroll
bets = []
peak_bankroll = bankroll
max_drawdown = 0.0
for _, game in games.iterrows():
model_prob = game[model_prob_col]
decimal_odds = game[market_odds_col]
implied_prob = self._implied_probability_from_odds(decimal_odds)
edge = model_prob - implied_prob
# Only bet if edge exceeds minimum threshold
if edge < self.min_edge:
continue
# Calculate stake
stake = self._calculate_stake(bankroll, model_prob, decimal_odds)
if stake <= 0 or stake > bankroll:
continue
# Resolve bet
outcome = int(game[outcome_col])
if outcome == 1:
profit = stake * (decimal_odds - 1)
else:
profit = -stake
bankroll += profit
# Track drawdown
peak_bankroll = max(peak_bankroll, bankroll)
drawdown = (peak_bankroll - bankroll) / peak_bankroll
max_drawdown = max(max_drawdown, drawdown)
bets.append(BetRecord(
date=str(game[date_col]),
game_id=str(game[game_id_col]),
side=str(game.get(side_col, "")),
model_prob=float(model_prob),
market_prob=float(implied_prob),
odds=float(decimal_odds),
stake=float(stake),
outcome=outcome,
profit=float(profit),
bankroll_after=float(bankroll),
))
# Compute summary statistics
result = BacktestResult(
bets=bets,
initial_bankroll=self.initial_bankroll,
final_bankroll=bankroll,
total_bets=len(bets),
winning_bets=sum(1 for b in bets if b.outcome == 1),
total_staked=sum(b.stake for b in bets),
total_profit=bankroll - self.initial_bankroll,
max_drawdown=max_drawdown,
)
if result.total_staked > 0:
result.roi = result.total_profit / result.total_staked
# Compute Sharpe ratio (annualized)
if len(bets) > 1:
daily_returns = pd.Series([b.profit / b.stake for b in bets])
if daily_returns.std() > 0:
result.sharpe_ratio = (
daily_returns.mean() / daily_returns.std()
* np.sqrt(252) # Annualize
)
return result
def plot_backtest_results(result: BacktestResult) -> None:
"""
Visualize backtest results with bankroll trajectory and drawdown.
Creates a two-panel plot showing:
1. Bankroll over time with bets colored by outcome.
2. Running drawdown from peak bankroll.
"""
if not result.bets:
print("No bets to plot.")
return
dates = [b.date for b in result.bets]
bankrolls = [result.initial_bankroll] + [b.bankroll_after for b in result.bets]
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 8), sharex=True)
# Bankroll trajectory
ax1.plot(range(len(bankrolls)), bankrolls, "b-", linewidth=1)
ax1.axhline(
y=result.initial_bankroll, color="gray", linestyle="--",
alpha=0.5, label="Starting bankroll",
)
ax1.set_ylabel("Bankroll ($)")
ax1.set_title("Backtest Results: Bankroll Trajectory")
ax1.legend()
# Drawdown
peak = np.maximum.accumulate(bankrolls)
drawdown = (peak - bankrolls) / peak
ax2.fill_between(range(len(drawdown)), drawdown, alpha=0.3, color="red")
ax2.plot(range(len(drawdown)), drawdown, "r-", linewidth=1)
ax2.set_ylabel("Drawdown")
ax2.set_xlabel("Bet Number")
ax2.set_title(f"Drawdown (Max: {result.max_drawdown:.1%})")
plt.tight_layout()
plt.show()
Common Pitfall: The most subtle form of look-ahead bias in betting backtests is model selection bias. If you try 50 different model configurations and report the backtest results of the best one, you are implicitly using future information (the knowledge that this configuration would have performed best). The solution is to split your historical data into three periods: training, validation (for model selection), and test (for final backtest evaluation). The test period should be touched only once, after all model decisions are finalized.
30.3 Walk-Forward Validation
Why Standard Cross-Validation Fails for Sports Data
Standard $k$-fold cross-validation randomly shuffles observations into folds and trains on $k-1$ folds to predict the remaining fold. This is valid when observations are independent and identically distributed (i.i.d.). Sports data violates this assumption in two fundamental ways:
-
Temporal dependence: A team's performance in game $t$ is correlated with its performance in games $t-1$, $t-2$, etc. Random shuffling can place game $t$ in the training set and game $t-1$ in the test set, allowing the model to learn from future information.
-
Non-stationarity: The data-generating process changes over time. Teams improve, decline, change personnel, and adjust strategies. A model trained on 2020 data may be poorly suited for 2024 data.
If you apply standard cross-validation to sports data, you will get optimistically biased performance estimates. The model will appear to perform better than it actually would in production, because it has access to future information during training.
Walk-Forward Validation Schemes
Walk-forward validation respects temporal ordering by always training on past data and evaluating on future data. There are several variants:
Expanding Window
In expanding window validation, the training set grows over time. At each step, the model is trained on all data from the beginning through time $t$, and evaluated on data from time $t+1$ through $t+h$:
- Step 1: Train on months 1--12, evaluate on month 13
- Step 2: Train on months 1--13, evaluate on month 14
- Step 3: Train on months 1--14, evaluate on month 15
This approach maximizes the training set size at each step but assumes that older data remains relevant.
Sliding Window
In sliding window validation, the training set has a fixed size $w$. At each step, the model is trained on the most recent $w$ observations and evaluated on the next $h$ observations:
- Step 1: Train on months 1--12, evaluate on month 13
- Step 2: Train on months 2--13, evaluate on month 14
- Step 3: Train on months 3--14, evaluate on month 15
This approach discards old data that may no longer be relevant, but uses a smaller training set.
Purged Cross-Validation
Purged cross-validation adapts standard cross-validation for time series by introducing a gap between training and test folds. This gap ensures that the training set does not contain observations that are temporally close to (and therefore correlated with) test observations.
For sports data, the purge gap should be at least as wide as the longest feature lookback window. If your features include 20-game rolling averages, the purge gap should be at least 20 games, ensuring that no training observation's features overlap with any test observation's features.
from sklearn.model_selection import BaseCrossValidator
class WalkForwardValidator:
"""
Walk-forward validation for time series sports data.
Implements both expanding window and sliding window validation
schemes with configurable window sizes and step lengths.
Args:
method: 'expanding' or 'sliding'.
initial_train_size: Number of observations in the initial
training set.
test_size: Number of observations in each test fold.
step_size: Number of observations to advance between folds.
max_train_size: Maximum training set size (only for 'sliding').
gap: Number of observations to skip between train and test
(purge gap to prevent leakage from features with lookback).
"""
def __init__(
self,
method: str = "expanding",
initial_train_size: int = 500,
test_size: int = 100,
step_size: int = 50,
max_train_size: Optional[int] = None,
gap: int = 0,
):
self.method = method
self.initial_train_size = initial_train_size
self.test_size = test_size
self.step_size = step_size
self.max_train_size = max_train_size
self.gap = gap
def split(self, n_samples: int) -> list[tuple[np.ndarray, np.ndarray]]:
"""
Generate train/test index splits for walk-forward validation.
Args:
n_samples: Total number of observations.
Returns:
List of (train_indices, test_indices) tuples.
"""
splits = []
test_start = self.initial_train_size + self.gap
while test_start + self.test_size <= n_samples:
# Define test indices
test_indices = np.arange(test_start, test_start + self.test_size)
# Define train indices
train_end = test_start - self.gap
if self.method == "expanding":
train_start = 0
elif self.method == "sliding":
if self.max_train_size is not None:
train_start = max(0, train_end - self.max_train_size)
else:
train_start = max(
0, train_end - self.initial_train_size
)
else:
raise ValueError(f"Unknown method: {self.method}")
train_indices = np.arange(train_start, train_end)
if len(train_indices) > 0 and len(test_indices) > 0:
splits.append((train_indices, test_indices))
test_start += self.step_size
return splits
def get_n_splits(self, n_samples: int) -> int:
"""Return the number of splits."""
return len(self.split(n_samples))
class PurgedKFoldCV:
"""
Purged K-Fold cross-validation for time series data.
Standard K-Fold with two modifications:
1. Folds are temporally ordered (not shuffled).
2. A purge gap is applied: observations within 'gap' steps of the
test fold boundary are removed from training to prevent
information leakage through overlapping feature windows.
Args:
n_splits: Number of folds.
gap: Number of observations to purge from each side of the
test fold boundary.
"""
def __init__(self, n_splits: int = 5, gap: int = 20):
self.n_splits = n_splits
self.gap = gap
def split(self, n_samples: int) -> list[tuple[np.ndarray, np.ndarray]]:
"""
Generate purged K-Fold train/test splits.
Returns:
List of (train_indices, test_indices) tuples.
"""
fold_size = n_samples // self.n_splits
splits = []
for i in range(self.n_splits):
test_start = i * fold_size
test_end = min((i + 1) * fold_size, n_samples)
test_indices = np.arange(test_start, test_end)
# Training indices: everything outside test + gap
purge_start = max(0, test_start - self.gap)
purge_end = min(n_samples, test_end + self.gap)
train_indices = np.concatenate([
np.arange(0, purge_start),
np.arange(purge_end, n_samples),
])
if len(train_indices) > 0 and len(test_indices) > 0:
splits.append((train_indices, test_indices))
return splits
def walk_forward_evaluate(
features: pd.DataFrame,
targets: pd.Series,
model_factory: Callable,
validator: WalkForwardValidator,
scoring_fn: Callable = compute_brier_score,
) -> dict:
"""
Run walk-forward evaluation with model retraining at each step.
At each validation step:
1. Trains a new model on the training fold.
2. Generates predictions on the test fold.
3. Computes the scoring metric.
This provides an honest estimate of how the model would perform
in production, where it is periodically retrained on new data.
Args:
features: Full feature DataFrame.
targets: Full target Series.
model_factory: Callable that returns a fresh, untrained model.
Must have .fit(X, y) and .predict_proba(X) methods.
validator: Walk-forward validation splitter.
scoring_fn: Function(predictions, outcomes) -> float.
Returns:
Dictionary with per-fold scores and aggregate statistics.
"""
n_samples = len(features)
splits = validator.split(n_samples)
fold_scores = []
all_predictions = []
all_targets = []
all_fold_indices = []
for fold_idx, (train_idx, test_idx) in enumerate(splits):
# Split data
X_train = features.iloc[train_idx]
y_train = targets.iloc[train_idx]
X_test = features.iloc[test_idx]
y_test = targets.iloc[test_idx]
# Train model
model = model_factory()
model.fit(X_train, y_train)
# Predict
if hasattr(model, "predict_proba"):
preds = model.predict_proba(X_test)
if preds.ndim == 2:
preds = preds[:, 1] # Probability of positive class
else:
preds = model.predict(X_test)
# Score
score = scoring_fn(preds, y_test.values)
fold_scores.append(score)
all_predictions.extend(preds)
all_targets.extend(y_test.values)
all_fold_indices.extend([fold_idx] * len(test_idx))
if (fold_idx + 1) % 5 == 0:
print(
f"Fold {fold_idx + 1}/{len(splits)} | "
f"Train: {len(train_idx)} | Test: {len(test_idx)} | "
f"Score: {score:.4f}"
)
# Aggregate results
overall_score = scoring_fn(
np.array(all_predictions), np.array(all_targets),
)
results = {
"fold_scores": fold_scores,
"mean_score": float(np.mean(fold_scores)),
"std_score": float(np.std(fold_scores)),
"overall_score": overall_score,
"n_folds": len(splits),
"total_predictions": len(all_predictions),
"all_predictions": np.array(all_predictions),
"all_targets": np.array(all_targets),
}
print(f"\nWalk-Forward Results ({validator.method}):")
print(f" Folds: {len(splits)}")
print(f" Mean score: {results['mean_score']:.4f} +/- {results['std_score']:.4f}")
print(f" Overall score: {results['overall_score']:.4f}")
return results
Intuition: Walk-forward validation answers the question that matters for betting: "If I had actually used this model in real-time, retraining periodically as new data arrived, how would it have performed?" This is fundamentally different from the question answered by standard cross-validation, which is: "How well does this model fit a randomly held-out subset of data?" The first question is about production performance; the second is about in-sample fit. For betting, only the first question matters.
30.4 Calibration Plots and Reliability Diagrams
What Calibration Means
A model is well-calibrated if its predicted probabilities match observed frequencies. When the model predicts a 70% probability, the event should occur approximately 70% of the time across all such predictions. Formally:
$$\mathbb{P}(Y = 1 \mid p(X) = p) = p \quad \forall p \in [0, 1]$$
Calibration is essential for betting because the decision to bet depends on comparing the model's probability to the market's implied probability. If your model says 65% but is systematically overconfident (events it calls 65% actually occur 58% of the time), you will make losing bets that appear profitable.
Reliability Diagrams
A reliability diagram (also called a calibration plot) visualizes calibration by plotting predicted probabilities against observed frequencies. The procedure is:
- Sort all predictions by predicted probability.
- Divide into $K$ bins (e.g., [0.0--0.1), [0.1--0.2), ..., [0.9--1.0]).
- For each bin, compute the average predicted probability and the average observed outcome.
- Plot average predicted probability (x-axis) against average observed frequency (y-axis).
A perfectly calibrated model produces points along the diagonal $y = x$. Points above the diagonal indicate underconfidence (events occur more often than predicted); points below indicate overconfidence (events occur less often than predicted).
Expected Calibration Error (ECE)
The Expected Calibration Error quantifies the deviation from perfect calibration as a single number:
$$\text{ECE} = \sum_{k=1}^{K} \frac{n_k}{n} |\bar{y}_k - \bar{p}_k|$$
where the sum is over bins, $n_k$ is the number of predictions in bin $k$, $\bar{p}_k$ is the average predicted probability in that bin, and $\bar{y}_k$ is the average observed outcome. ECE weights each bin by its size, giving more influence to bins with more predictions.
A stricter variant, the Maximum Calibration Error (MCE), takes the worst-case bin:
$$\text{MCE} = \max_{k} |\bar{y}_k - \bar{p}_k|$$
def calibration_analysis(
predictions: np.ndarray,
outcomes: np.ndarray,
n_bins: int = 10,
strategy: str = "uniform",
plot: bool = True,
model_name: str = "Model",
) -> dict:
"""
Comprehensive calibration analysis with reliability diagram.
Bins predictions, computes calibration metrics, and optionally
generates a reliability diagram with confidence intervals.
Args:
predictions: Array of predicted probabilities in [0, 1].
outcomes: Array of binary outcomes in {0, 1}.
n_bins: Number of calibration bins.
strategy: 'uniform' for equal-width bins, 'quantile' for
equal-frequency bins.
plot: Whether to generate the reliability diagram.
model_name: Label for the plot legend.
Returns:
Dictionary containing:
'ece': Expected calibration error.
'mce': Maximum calibration error.
'bin_data': Per-bin statistics.
'overall_bias': Mean prediction minus mean outcome.
"""
if strategy == "uniform":
bin_edges = np.linspace(0, 1, n_bins + 1)
elif strategy == "quantile":
bin_edges = np.percentile(
predictions, np.linspace(0, 100, n_bins + 1),
)
bin_edges[0] = 0.0
bin_edges[-1] = 1.0
else:
raise ValueError(f"Unknown strategy: {strategy}")
bin_indices = np.digitize(predictions, bin_edges[1:-1])
bin_data = []
ece = 0.0
mce = 0.0
n = len(predictions)
for k in range(n_bins):
mask = bin_indices == k
n_k = mask.sum()
if n_k == 0:
continue
avg_pred = predictions[mask].mean()
avg_outcome = outcomes[mask].mean()
abs_error = abs(avg_outcome - avg_pred)
ece += (n_k / n) * abs_error
mce = max(mce, abs_error)
# Confidence interval for observed frequency (Wilson score)
z = 1.96 # 95% CI
p_hat = avg_outcome
denominator = 1 + z ** 2 / n_k
center = (p_hat + z ** 2 / (2 * n_k)) / denominator
half_width = z * np.sqrt(
(p_hat * (1 - p_hat) + z ** 2 / (4 * n_k)) / n_k
) / denominator
ci_lower = max(0, center - half_width)
ci_upper = min(1, center + half_width)
bin_data.append({
"bin_center": float((bin_edges[k] + bin_edges[k + 1]) / 2),
"avg_pred": float(avg_pred),
"avg_outcome": float(avg_outcome),
"n": int(n_k),
"abs_error": float(abs_error),
"ci_lower": float(ci_lower),
"ci_upper": float(ci_upper),
})
if plot and bin_data:
_plot_reliability_diagram(bin_data, ece, mce, model_name)
return {
"ece": float(ece),
"mce": float(mce),
"bin_data": bin_data,
"overall_bias": float(predictions.mean() - outcomes.mean()),
}
def _plot_reliability_diagram(
bin_data: list[dict],
ece: float,
mce: float,
model_name: str,
) -> None:
"""Plot a reliability diagram with confidence intervals."""
fig, (ax1, ax2) = plt.subplots(
1, 2, figsize=(14, 6),
gridspec_kw={"width_ratios": [2, 1]},
)
avg_preds = [b["avg_pred"] for b in bin_data]
avg_outcomes = [b["avg_outcome"] for b in bin_data]
ci_lowers = [b["ci_lower"] for b in bin_data]
ci_uppers = [b["ci_upper"] for b in bin_data]
counts = [b["n"] for b in bin_data]
# Reliability diagram
ax1.plot([0, 1], [0, 1], "k--", alpha=0.5, label="Perfect calibration")
ax1.errorbar(
avg_preds, avg_outcomes,
yerr=[
[o - l for o, l in zip(avg_outcomes, ci_lowers)],
[u - o for o, u in zip(avg_outcomes, ci_uppers)],
],
fmt="o-", capsize=4, label=model_name,
)
ax1.fill_between([0, 1], [0, 1], alpha=0.05, color="gray")
ax1.set_xlabel("Mean Predicted Probability")
ax1.set_ylabel("Observed Frequency")
ax1.set_title(
f"Reliability Diagram\nECE = {ece:.4f}, MCE = {mce:.4f}"
)
ax1.legend()
ax1.set_xlim(-0.05, 1.05)
ax1.set_ylim(-0.05, 1.05)
ax1.set_aspect("equal")
# Histogram of predictions
bin_centers = [b["bin_center"] for b in bin_data]
ax2.bar(bin_centers, counts, width=0.08, alpha=0.7, color="steelblue")
ax2.set_xlabel("Predicted Probability")
ax2.set_ylabel("Count")
ax2.set_title("Prediction Distribution")
plt.tight_layout()
plt.show()
def compare_calibration(
models: dict[str, tuple[np.ndarray, np.ndarray]],
n_bins: int = 10,
) -> pd.DataFrame:
"""
Compare calibration metrics across multiple models.
Args:
models: Dictionary mapping model names to
(predictions, outcomes) tuples.
n_bins: Number of calibration bins.
Returns:
DataFrame with calibration metrics for each model.
"""
results = []
for name, (preds, outcomes) in models.items():
cal = calibration_analysis(
preds, outcomes, n_bins=n_bins, plot=False, model_name=name,
)
results.append({
"model": name,
"ece": cal["ece"],
"mce": cal["mce"],
"bias": cal["overall_bias"],
"brier_score": compute_brier_score(preds, outcomes),
"log_loss": compute_log_loss(preds, outcomes),
})
return pd.DataFrame(results).sort_values("brier_score")
Recalibration Techniques
If a model is well-discriminating but poorly calibrated, recalibration can improve probability estimates without retraining the model. Common techniques include:
-
Platt scaling: Fits a logistic regression $p_{\text{calibrated}} = \sigma(a \cdot p_{\text{raw}} + b)$ on a held-out calibration set. Simple and effective for monotonically miscalibrated models.
-
Isotonic regression: Fits a non-parametric, monotonically increasing function mapping raw probabilities to calibrated ones. More flexible than Platt scaling but requires more calibration data.
-
Beta calibration: Fits a beta distribution-based mapping that can handle more complex miscalibration patterns than Platt scaling while being more constrained than isotonic regression.
from sklearn.isotonic import IsotonicRegression
from sklearn.linear_model import LogisticRegression
class ModelRecalibrator:
"""
Recalibrate model probabilities using Platt scaling or isotonic regression.
Fits the recalibration mapping on a held-out calibration set,
then applies it to transform raw model probabilities into
better-calibrated estimates.
Important: The calibration set must be separate from both the
training set (to avoid double-counting) and the test set (to
preserve test set integrity).
Args:
method: 'platt' for Platt scaling, 'isotonic' for isotonic
regression.
"""
def __init__(self, method: str = "isotonic"):
self.method = method
self.calibrator_ = None
def fit(
self,
raw_probabilities: np.ndarray,
outcomes: np.ndarray,
) -> "ModelRecalibrator":
"""
Fit the recalibration mapping on calibration data.
Args:
raw_probabilities: Raw model output probabilities.
outcomes: True binary outcomes.
Returns:
self
"""
if self.method == "platt":
self.calibrator_ = LogisticRegression(C=1e10, solver="lbfgs")
self.calibrator_.fit(
raw_probabilities.reshape(-1, 1), outcomes,
)
elif self.method == "isotonic":
self.calibrator_ = IsotonicRegression(
y_min=0.001, y_max=0.999, out_of_bounds="clip",
)
self.calibrator_.fit(raw_probabilities, outcomes)
else:
raise ValueError(f"Unknown method: {self.method}")
return self
def transform(self, raw_probabilities: np.ndarray) -> np.ndarray:
"""
Apply the learned recalibration mapping to new probabilities.
Args:
raw_probabilities: Raw model output probabilities.
Returns:
Recalibrated probabilities.
"""
if self.calibrator_ is None:
raise RuntimeError("Calibrator not fitted. Call fit() first.")
if self.method == "platt":
return self.calibrator_.predict_proba(
raw_probabilities.reshape(-1, 1)
)[:, 1]
elif self.method == "isotonic":
return self.calibrator_.predict(raw_probabilities)
else:
raise ValueError(f"Unknown method: {self.method}")
Real-World Application: In practice, most sports prediction models are somewhat overconfident --- they assign extreme probabilities (near 0 or 1) more often than the data warrants. This is partly because models overfit to training data and partly because sports outcomes have substantial irreducible randomness. Isotonic recalibration on a held-out season of data typically reduces ECE by 30--60%, which translates directly to better betting decisions: you stop making large bets on events your model is falsely confident about.
30.5 Model Comparison and Selection Criteria
The Model Selection Problem
Given multiple candidate models --- say, a logistic regression, an XGBoost model, a feedforward neural network, and an LSTM --- how do you choose the best one? This question is more subtle than it appears. A model that achieves the best Brier score on one validation fold may not be the best on another. The difference between models may be within the range of statistical noise. And the "best" model on historical data may overfit in ways that do not generalize.
Rigorous model selection requires both information criteria (which penalize model complexity) and statistical tests (which determine whether observed performance differences are significant).
Information Criteria: AIC and BIC
Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) trade off model fit against complexity:
$$\text{AIC} = 2k - 2 \ln \hat{L}$$
$$\text{BIC} = k \ln n - 2 \ln \hat{L}$$
where $k$ is the number of model parameters, $n$ is the number of observations, and $\hat{L}$ is the maximized likelihood.
For probabilistic predictions where we use log loss as the negative log-likelihood:
$$-2 \ln \hat{L} = 2n \cdot \text{LogLoss}$$
So:
$$\text{AIC} = 2k + 2n \cdot \text{LogLoss}$$
$$\text{BIC} = k \ln n + 2n \cdot \text{LogLoss}$$
BIC penalizes model complexity more heavily than AIC (for $n > 7$), so it tends to select simpler models. For betting, where overfitting is a primary concern and prediction accuracy on new data matters more than capturing every nuance in historical data, BIC is generally the more appropriate criterion.
def compute_information_criteria(
predictions: np.ndarray,
outcomes: np.ndarray,
n_parameters: int,
epsilon: float = 1e-15,
) -> dict[str, float]:
"""
Compute AIC and BIC for a probabilistic prediction model.
Uses log loss as the negative log-likelihood per observation,
then applies the standard AIC and BIC formulas.
Args:
predictions: Model's predicted probabilities.
outcomes: True binary outcomes.
n_parameters: Number of trainable parameters in the model.
epsilon: Clipping value to prevent log(0).
Returns:
Dictionary with 'aic', 'bic', 'log_loss', and 'n_obs'.
Example:
>>> preds = np.array([0.7, 0.3, 0.8, 0.6])
>>> outcomes = np.array([1, 0, 1, 0])
>>> result = compute_information_criteria(preds, outcomes, n_parameters=10)
>>> print(f"AIC: {result['aic']:.2f}")
"""
n = len(predictions)
p = np.clip(predictions, epsilon, 1 - epsilon)
# Log-likelihood
log_likelihood = np.sum(
outcomes * np.log(p) + (1 - outcomes) * np.log(1 - p)
)
# Information criteria
aic = 2 * n_parameters - 2 * log_likelihood
bic = n_parameters * np.log(n) - 2 * log_likelihood
# AIC corrected for small sample size (AICc)
if n - n_parameters - 1 > 0:
aicc = aic + (2 * n_parameters * (n_parameters + 1)) / (n - n_parameters - 1)
else:
aicc = float("inf")
return {
"aic": float(aic),
"aicc": float(aicc),
"bic": float(bic),
"log_likelihood": float(log_likelihood),
"log_loss": float(-log_likelihood / n),
"n_obs": n,
"n_parameters": n_parameters,
}
The Diebold-Mariano Test
The Diebold-Mariano (DM) test is a statistical test for comparing the predictive accuracy of two forecasting models. It tests the null hypothesis that two models have equal expected loss:
$$H_0: \mathbb{E}[d_t] = 0 \quad \text{where} \quad d_t = L(e_{1,t}) - L(e_{2,t})$$
Here, $d_t$ is the difference in loss between models 1 and 2 at time $t$, and $L$ is the loss function (Brier score or log loss for individual predictions).
The DM test statistic is:
$$\text{DM} = \frac{\bar{d}}{\sqrt{\hat{V}(\bar{d})}}$$
where $\bar{d} = \frac{1}{n} \sum_{t=1}^{n} d_t$ is the mean loss differential and $\hat{V}(\bar{d})$ is the estimated variance of $\bar{d}$, accounting for autocorrelation using the Newey-West estimator:
$$\hat{V}(\bar{d}) = \frac{1}{n} \left[ \hat{\gamma}_0 + 2 \sum_{k=1}^{h} \left(1 - \frac{k}{h+1}\right) \hat{\gamma}_k \right]$$
where $\hat{\gamma}_k$ is the sample autocovariance of $d_t$ at lag $k$ and $h$ is the truncation lag.
Under the null hypothesis, the DM statistic follows a standard normal distribution asymptotically. A large positive DM statistic indicates that Model 2 is significantly better (lower loss) than Model 1.
from scipy import stats
def diebold_mariano_test(
predictions_1: np.ndarray,
predictions_2: np.ndarray,
outcomes: np.ndarray,
loss_fn: str = "brier",
max_lag: Optional[int] = None,
) -> dict[str, float]:
"""
Diebold-Mariano test for comparing two forecasting models.
Tests whether the difference in predictive accuracy between two
models is statistically significant, accounting for temporal
autocorrelation in the loss differential series.
A positive test statistic means Model 2 is better (lower loss).
A negative test statistic means Model 1 is better.
Args:
predictions_1: Probability forecasts from Model 1.
predictions_2: Probability forecasts from Model 2.
outcomes: True binary outcomes.
loss_fn: Loss function to use: 'brier' for squared error,
'log' for log loss.
max_lag: Maximum lag for the Newey-West variance estimator.
If None, uses int(n^(1/3)).
Returns:
Dictionary with keys:
'dm_statistic': The test statistic.
'p_value': Two-sided p-value.
'mean_diff': Mean loss difference (positive = Model 2 better).
'model_1_loss': Mean loss of Model 1.
'model_2_loss': Mean loss of Model 2.
'conclusion': Human-readable interpretation.
Example:
>>> preds_1 = np.array([0.6, 0.4, 0.7, 0.5, 0.8])
>>> preds_2 = np.array([0.65, 0.35, 0.72, 0.48, 0.82])
>>> outcomes = np.array([1, 0, 1, 0, 1])
>>> result = diebold_mariano_test(preds_1, preds_2, outcomes)
"""
n = len(outcomes)
# Compute individual losses
if loss_fn == "brier":
losses_1 = (predictions_1 - outcomes) ** 2
losses_2 = (predictions_2 - outcomes) ** 2
elif loss_fn == "log":
eps = 1e-15
p1 = np.clip(predictions_1, eps, 1 - eps)
p2 = np.clip(predictions_2, eps, 1 - eps)
losses_1 = -(outcomes * np.log(p1) + (1 - outcomes) * np.log(1 - p1))
losses_2 = -(outcomes * np.log(p2) + (1 - outcomes) * np.log(1 - p2))
else:
raise ValueError(f"Unknown loss function: {loss_fn}")
# Loss differentials
d = losses_1 - losses_2 # Positive means Model 2 is better
d_mean = d.mean()
# Newey-West variance estimate
if max_lag is None:
max_lag = int(n ** (1 / 3))
# Autocovariances
d_centered = d - d_mean
gamma_0 = np.mean(d_centered ** 2)
nw_sum = gamma_0
for k in range(1, max_lag + 1):
gamma_k = np.mean(d_centered[k:] * d_centered[:-k])
weight = 1 - k / (max_lag + 1) # Bartlett kernel
nw_sum += 2 * weight * gamma_k
variance = nw_sum / n
if variance <= 0:
return {
"dm_statistic": 0.0,
"p_value": 1.0,
"mean_diff": float(d_mean),
"model_1_loss": float(losses_1.mean()),
"model_2_loss": float(losses_2.mean()),
"conclusion": "Variance estimate non-positive; test inconclusive.",
}
# DM statistic
dm_stat = d_mean / np.sqrt(variance)
p_value = 2 * (1 - stats.norm.cdf(abs(dm_stat)))
# Interpretation
alpha = 0.05
if p_value < alpha:
if dm_stat > 0:
conclusion = (
f"Model 2 is significantly better than Model 1 "
f"(p = {p_value:.4f})"
)
else:
conclusion = (
f"Model 1 is significantly better than Model 2 "
f"(p = {p_value:.4f})"
)
else:
conclusion = (
f"No significant difference between models "
f"(p = {p_value:.4f})"
)
return {
"dm_statistic": float(dm_stat),
"p_value": float(p_value),
"mean_diff": float(d_mean),
"model_1_loss": float(losses_1.mean()),
"model_2_loss": float(losses_2.mean()),
"conclusion": conclusion,
}
Cross-Validated Brier Score for Model Selection
When comparing models with different complexity levels (e.g., logistic regression vs. deep neural network), walk-forward cross-validated Brier score provides a robust comparison that accounts for both predictive accuracy and generalization:
def cross_validated_model_comparison(
features: pd.DataFrame,
targets: pd.Series,
model_factories: dict[str, Callable],
validator: WalkForwardValidator,
) -> pd.DataFrame:
"""
Compare multiple models using walk-forward cross-validated Brier score.
For each model, runs the full walk-forward validation procedure
and collects per-fold Brier scores. Then performs pairwise
Diebold-Mariano tests to determine statistical significance of
differences.
Args:
features: Feature DataFrame.
targets: Target Series.
model_factories: Dict mapping model names to callables that
return untrained model instances.
validator: Walk-forward validation splitter.
Returns:
DataFrame with model comparison results.
"""
all_results = {}
all_predictions = {}
for name, factory in model_factories.items():
print(f"\n{'=' * 40}")
print(f"Evaluating: {name}")
print(f"{'=' * 40}")
result = walk_forward_evaluate(
features, targets, factory, validator,
)
all_results[name] = result
all_predictions[name] = (
result["all_predictions"],
result["all_targets"],
)
# Summary table
summary_rows = []
for name, result in all_results.items():
summary_rows.append({
"model": name,
"mean_brier": result["mean_score"],
"std_brier": result["std_score"],
"overall_brier": result["overall_score"],
"n_folds": result["n_folds"],
})
summary = pd.DataFrame(summary_rows).sort_values("mean_brier")
# Pairwise DM tests
model_names = list(all_predictions.keys())
dm_results = []
for i in range(len(model_names)):
for j in range(i + 1, len(model_names)):
name_1 = model_names[i]
name_2 = model_names[j]
preds_1 = all_predictions[name_1][0]
preds_2 = all_predictions[name_2][0]
targets_arr = all_predictions[name_1][1]
dm = diebold_mariano_test(preds_1, preds_2, targets_arr)
dm_results.append({
"model_1": name_1,
"model_2": name_2,
"dm_statistic": dm["dm_statistic"],
"p_value": dm["p_value"],
"conclusion": dm["conclusion"],
})
dm_table = pd.DataFrame(dm_results)
print("\n" + "=" * 60)
print("MODEL COMPARISON SUMMARY")
print("=" * 60)
print(summary.to_string(index=False))
print("\nPairwise Diebold-Mariano Tests:")
print(dm_table.to_string(index=False))
return summary
Practical Selection Criteria
Beyond statistical tests, practical considerations often drive model selection in sports betting:
| Criterion | Description | Importance |
|---|---|---|
| Predictive accuracy | Cross-validated Brier score or log loss | Primary |
| Calibration | ECE and reliability diagram quality | High |
| Profitability | Backtested ROI with realistic vig | High |
| Stability | Low variance in cross-validation scores across folds | Medium |
| Interpretability | Can you explain why the model made each prediction? | Medium |
| Computational cost | Training time, inference latency | Low-Medium |
| Complexity | Number of parameters, risk of overfitting | Medium |
| Maintainability | Ease of retraining, feature pipeline complexity | Medium |
A pragmatic approach is to select the simplest model whose performance is not statistically significantly worse than the best model (using the DM test). This follows the principle of parsimony: if a logistic regression achieves Brier score 0.215 and a deep neural network achieves 0.210 but the DM test shows $p = 0.18$, the logistic regression is the better choice --- it is simpler, more interpretable, faster to retrain, and less likely to overfit.
Model Ensembling
When multiple models have comparable performance, ensembling (averaging their predictions) often outperforms any individual model. The simple average:
$$\hat{p}_{\text{ensemble}} = \frac{1}{M} \sum_{m=1}^{M} \hat{p}_m$$
works well when models make different types of errors (i.e., their errors are not perfectly correlated). More sophisticated approaches weight models by their cross-validated performance:
$$\hat{p}_{\text{weighted}} = \sum_{m=1}^{M} w_m \hat{p}_m \quad \text{where} \quad w_m \propto \frac{1}{\text{BS}_m}$$
def ensemble_predictions(
model_predictions: dict[str, np.ndarray],
outcomes: np.ndarray,
method: str = "inverse_brier",
) -> np.ndarray:
"""
Combine predictions from multiple models into an ensemble.
Args:
model_predictions: Dict mapping model names to prediction arrays.
outcomes: True outcomes (used for weighting if method requires it).
method: 'simple' for equal weights, 'inverse_brier' for weights
inversely proportional to Brier score.
Returns:
Ensemble prediction array.
"""
names = list(model_predictions.keys())
preds_matrix = np.column_stack([model_predictions[n] for n in names])
if method == "simple":
weights = np.ones(len(names)) / len(names)
elif method == "inverse_brier":
brier_scores = []
for name in names:
bs = compute_brier_score(model_predictions[name], outcomes)
brier_scores.append(bs)
brier_scores = np.array(brier_scores)
# Inverse Brier score weights (lower BS = higher weight)
inverse_bs = 1.0 / np.maximum(brier_scores, 1e-10)
weights = inverse_bs / inverse_bs.sum()
else:
raise ValueError(f"Unknown method: {method}")
ensemble = preds_matrix @ weights
# Report
ensemble_bs = compute_brier_score(ensemble, outcomes)
print("Ensemble Weights:")
for name, w in zip(names, weights):
bs = compute_brier_score(model_predictions[name], outcomes)
print(f" {name}: weight={w:.3f}, Brier={bs:.4f}")
print(f" Ensemble: Brier={ensemble_bs:.4f}")
return ensemble
Common Pitfall: Model ensembles can hide overfitting. If you select which models to include in the ensemble based on their performance on the same data used to evaluate the ensemble, you introduce selection bias. Use a three-way split: (1) train individual models, (2) select ensemble composition and weights on a validation set, (3) evaluate the final ensemble on a held-out test set.
30.6 Chapter Summary
Key Concepts
-
Proper scoring rules (Brier score, log loss) are essential for evaluating probabilistic predictions because they reward both accuracy and calibration. Unlike raw accuracy, they capture the full quality of probability estimates.
-
The Brier score decomposition into reliability, resolution, and uncertainty provides diagnostic insight: reliability measures calibration quality, resolution measures discriminative ability, and uncertainty is a constant reflecting problem difficulty.
-
Brier Skill Score compares a model against a reference forecast. Using market-implied probabilities as the reference answers the crucial question: does your model add value beyond the market?
-
Backtesting simulates a model's historical betting performance with realistic transaction costs. It must avoid look-ahead bias by ensuring that only historically available information is used at each prediction point.
-
Walk-forward validation (expanding window or sliding window) is the correct cross-validation strategy for time series sports data. Standard k-fold cross-validation produces optimistically biased estimates due to temporal leakage.
-
Purged cross-validation adds a gap between training and test folds to prevent leakage through feature lookback windows. The purge gap should equal or exceed the longest feature window.
-
Calibration plots and ECE assess whether predicted probabilities match observed frequencies. Poorly calibrated models can be improved through recalibration (Platt scaling or isotonic regression) without retraining.
-
The Diebold-Mariano test provides a statistical test for whether two models have significantly different predictive accuracy, accounting for autocorrelation in the loss differential series.
-
Information criteria (AIC, BIC) balance model fit against complexity. BIC is generally more appropriate for betting models because it penalizes complexity more heavily, reducing overfitting risk.
-
Model ensembling often outperforms any individual model when ensemble members make diverse errors. Weighting by inverse Brier score concentrates weight on better-performing models.
Key Formulas
| Formula | Description |
|---|---|
| $\text{BS} = \frac{1}{n}\sum(p_i - y_i)^2$ | Brier score |
| $\text{BS} = \text{REL} - \text{RES} + \text{UNC}$ | Brier decomposition |
| $\text{BSS} = 1 - \text{BS}_{\text{model}} / \text{BS}_{\text{ref}}$ | Brier Skill Score |
| $\text{LL} = -\frac{1}{n}\sum[y\log p + (1-y)\log(1-p)]$ | Log loss |
| $\text{ECE} = \sum \frac{n_k}{n} |\bar{y}_k - \bar{p}_k|$ | Expected Calibration Error |
| $\text{AIC} = 2k - 2\ln\hat{L}$ | Akaike Information Criterion |
| $\text{BIC} = k\ln n - 2\ln\hat{L}$ | Bayesian Information Criterion |
| $\text{DM} = \bar{d} / \sqrt{\hat{V}(\bar{d})}$ | Diebold-Mariano test statistic |
Key Code Patterns
-
Scoring functions (
compute_brier_score,compute_log_loss,brier_score_decomposition): Evaluate probabilistic predictions with proper scoring rules and diagnostic decompositions. -
Backtesting framework (
BettingBacktester): Simulates historical betting with configurable staking strategies, edge thresholds, and bankroll tracking. -
Walk-forward validators (
WalkForwardValidator,PurgedKFoldCV): Generate temporally valid train/test splits that prevent data leakage in time series sports data. -
Calibration analysis (
calibration_analysis,ModelRecalibrator): Assess and improve the calibration of probability forecasts through reliability diagrams and recalibration. -
Model comparison (
diebold_mariano_test,cross_validated_model_comparison): Statistically rigorous comparison of competing models with proper accounting for autocorrelation. -
Ensemble (
ensemble_predictions): Combine multiple models with performance-weighted averaging.
Decision Framework: Model Evaluation and Selection
START: You have trained one or more models and generated predictions.
1. Compute proper scoring rules.
- Brier score and log loss for each model.
- Brier score decomposition for diagnostic insight.
- Brier Skill Score relative to market-implied probabilities.
2. Assess calibration.
- Generate reliability diagrams.
- Compute ECE and MCE.
- If poorly calibrated, apply Platt scaling or isotonic recalibration.
3. Validate with walk-forward scheme.
- Use expanding or sliding window (NOT standard k-fold).
- Ensure purge gap >= longest feature lookback window.
- Report mean and standard deviation of fold-level Brier scores.
4. Compare models statistically.
- Compute AIC/BIC for complexity-penalized comparison.
- Run Diebold-Mariano test for pairwise significance.
- If no significant difference, prefer the simpler model.
5. Backtest with realistic betting conditions.
- Apply vig (typically -110 both sides = 4.76%).
- Use fractional Kelly staking.
- Report ROI, max drawdown, and Sharpe ratio.
6. Select final model (or ensemble).
- If multiple models are comparable, ensemble them.
- Validate ensemble on held-out test period.
- Document model choice and rationale.
END: Deploy the selected model for live prediction.
What's Next
With the tools developed across Chapters 28 through 30 --- feature engineering, neural network architectures, and rigorous evaluation --- you now possess a complete machine learning pipeline for sports prediction. In the chapters ahead, we will explore advanced topics including ensemble methods and model stacking (combining multiple model types for superior performance), reinforcement learning for dynamic bet sizing (adapting strategy as bankroll and market conditions evolve), and live deployment considerations (moving from backtested models to production systems that make real-time predictions on live data). The evaluation framework from this chapter will remain central: every new technique must prove its worth through proper scoring rules, walk-forward validation, and realistic backtesting before it earns a place in your production system.
This chapter is part of The Sports Betting Textbook, a comprehensive guide to quantitative sports betting. All code examples use Python 3.11+ and are available in the companion repository.