20 min read

> "The question is not whether your model is wrong --- all models are wrong. The question is whether your model is useful, and that requires honest, rigorous evaluation."

Learning Objectives

  • Evaluate probabilistic predictions using proper scoring rules including Brier score, log loss, and their decompositions
  • Design and implement backtesting frameworks that simulate realistic betting scenarios with vigorish and transaction costs
  • Apply walk-forward validation techniques including expanding window, sliding window, and purged cross-validation
  • Construct and interpret calibration plots and reliability diagrams for assessing probability quality
  • Compare competing models using statistical tests and information criteria to make rigorous selection decisions

Chapter 30: Model Evaluation and Selection

"The question is not whether your model is wrong --- all models are wrong. The question is whether your model is useful, and that requires honest, rigorous evaluation." --- Adapted from George E. P. Box

Chapter Overview

You have engineered features (Chapter 28), trained models (Chapter 29), and generated predictions. But predictions are worthless until you can answer three critical questions: How good are these predictions? Are they good enough to bet on? And which of my models is best?

These questions are deceptively difficult. A model that predicts the correct winner 65% of the time sounds impressive, but if the closing line already implies 64% and you pay 4.76% vigorish, your 65% model loses money. A model with 0.20 Brier score might be excellent for NBA totals but mediocre for NFL game outcomes. And a model that appears to outperform your baseline on last season's data might simply have overfit --- the apparent advantage might vanish on genuinely new data.

Model evaluation for sports betting requires specialized techniques that account for the unique characteristics of our domain: outcomes are probabilistic (not deterministic), data is temporal (not i.i.d.), transaction costs are substantial (the vig), and the baseline we must beat is the market itself (not random chance). Standard machine learning evaluation metrics --- accuracy, precision, recall, F1 --- are inadequate. We need proper scoring rules that evaluate the full probability distribution, backtesting frameworks that simulate realistic betting conditions, validation schemes that respect temporal ordering, calibration analysis that assesses whether predicted probabilities are honest, and statistical tests that distinguish genuine model differences from noise.

This chapter provides the complete toolkit for rigorous model evaluation and selection. Every technique is implemented in Python with worked examples using realistic sports prediction scenarios.

In this chapter, you will learn to: - Evaluate probabilistic predictions with Brier score, log loss, and their decompositions into calibration, resolution, and uncertainty components - Build backtesting frameworks that simulate historical betting with realistic vig, bankroll constraints, and transaction costs - Implement walk-forward validation schemes that prevent temporal data leakage in sports prediction - Construct calibration plots and compute expected calibration error to assess probability quality - Compare models rigorously using the Diebold-Mariano test, information criteria, and cross-validated scoring rules


30.1 Brier Score and Log Loss for Probabilistic Predictions

Why Accuracy Is Insufficient

Consider two models predicting NBA game outcomes:

  • Model A predicts every game at 55% for the home team. It achieves 58% accuracy (roughly matching the home-court advantage in the NBA).
  • Model B predicts probabilities ranging from 25% to 85%, producing well-calibrated probability estimates. It also achieves 58% accuracy.

By accuracy alone, these models are identical. But Model B is far more useful for betting: it identifies games where the true probability is far from 50% and can generate large expected value bets when the market disagrees. Model A provides no actionable information beyond the baseline home-court advantage.

The problem with accuracy is that it evaluates only the binary decision (did you pick the right side?) while discarding all information about confidence (how confident was your prediction?). For betting, confidence is everything --- it determines bet sizing and whether a wager has positive expected value.

Proper Scoring Rules

A proper scoring rule is a function $S(p, y)$ that evaluates a probability forecast $p$ against an outcome $y$ such that the expected score is optimized when the forecast equals the true probability. In other words, a proper scoring rule cannot be "gamed" by reporting a probability different from your true belief.

Formally, a scoring rule $S$ is strictly proper if for all true probability distributions $q$:

$$\mathbb{E}_q[S(q, Y)] < \mathbb{E}_q[S(p, Y)] \quad \forall p \neq q$$

This property ensures that the forecaster's optimal strategy is to report their honest probability estimate --- exactly what we want.

Brier Score

The Brier score is the mean squared error of probability forecasts:

$$\text{BS} = \frac{1}{n} \sum_{i=1}^{n} (p_i - y_i)^2$$

where $p_i$ is the predicted probability and $y_i \in \{0, 1\}$ is the binary outcome. The Brier score ranges from 0 (perfect predictions) to 1 (maximally wrong predictions). A forecast of 50% for every game yields a Brier score of 0.25 (since $(0.5 - y)^2 = 0.25$ regardless of the outcome).

The Brier score can be decomposed into three additive components that provide diagnostic insight:

$$\text{BS} = \text{Reliability} - \text{Resolution} + \text{Uncertainty}$$

where:

  • Reliability (lower is better): Measures how close predicted probabilities are to observed frequencies. A perfectly calibrated model has reliability = 0.

$$\text{Reliability} = \frac{1}{n} \sum_{k=1}^{K} n_k (\bar{p}_k - \bar{y}_k)^2$$

  • Resolution (higher is better): Measures how much predicted probabilities vary from the base rate. A model that always predicts the base rate has resolution = 0.

$$\text{Resolution} = \frac{1}{n} \sum_{k=1}^{K} n_k (\bar{y}_k - \bar{y})^2$$

  • Uncertainty (constant for a given dataset): Measures the inherent difficulty of the prediction problem. It equals $\bar{y}(1 - \bar{y})$, which is maximized at 0.25 when the base rate is 50%.

$$\text{Uncertainty} = \bar{y}(1 - \bar{y})$$

Here, the forecasts are binned into $K$ groups, $n_k$ is the number of forecasts in bin $k$, $\bar{p}_k$ is the average predicted probability in bin $k$, $\bar{y}_k$ is the average outcome in bin $k$, and $\bar{y}$ is the overall base rate.

Brier Skill Score

The Brier Skill Score (BSS) compares a model's Brier score to that of a reference forecast (typically the base rate):

$$\text{BSS} = 1 - \frac{\text{BS}_{\text{model}}}{\text{BS}_{\text{reference}}}$$

A BSS of 0 means the model is no better than the reference. A BSS of 1 means perfect predictions. Negative BSS means the model is worse than the reference.

For sports betting, a useful reference forecast is the market-implied probability derived from closing lines. A positive BSS relative to the market implies your model contains information beyond what the market has already priced in --- a necessary (but not sufficient) condition for profitable betting.

Log Loss

Log loss (also called logarithmic loss or cross-entropy loss) is another proper scoring rule that penalizes confident wrong predictions more severely than the Brier score:

$$\text{LogLoss} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right]$$

Log loss is unbounded: a prediction of $p = 0$ for an outcome that occurs receives infinite penalty. In practice, predictions are clipped to a range like $[0.001, 0.999]$ to avoid this.

The key difference between Brier score and log loss is how they weight errors at the extremes. A prediction of 0.99 for an event that does not occur receives a Brier penalty of $(0.99)^2 = 0.98$ but a log loss penalty of $-\log(0.01) = 4.61$. Log loss is therefore more sensitive to overconfident predictions --- which is highly relevant in sports betting, where overconfidence leads to large losses on individual bets.

Log Loss Decomposition

Log loss can also be decomposed, though less cleanly than the Brier score. The calibration component of log loss measures the KL divergence between predicted and observed probabilities within each bin:

$$\text{CalibrationLoss} = \sum_{k=1}^{K} \frac{n_k}{n} D_{\text{KL}}(\bar{y}_k \| \bar{p}_k)$$

where $D_{\text{KL}}(q \| p) = q \log(q/p) + (1-q) \log((1-q)/(1-p))$ is the Kullback-Leibler divergence.

import numpy as np
import pandas as pd
from typing import Optional
import matplotlib.pyplot as plt


def compute_brier_score(
    predictions: np.ndarray,
    outcomes: np.ndarray,
) -> float:
    """
    Compute the Brier score for probabilistic predictions.

    The Brier score is the mean squared error between predicted
    probabilities and binary outcomes. Lower is better (0 = perfect,
    0.25 = uninformative baseline at 50%).

    Args:
        predictions: Array of predicted probabilities in [0, 1].
        outcomes: Array of binary outcomes in {0, 1}.

    Returns:
        The Brier score.

    Example:
        >>> preds = np.array([0.7, 0.3, 0.8, 0.5])
        >>> actual = np.array([1, 0, 1, 0])
        >>> compute_brier_score(preds, actual)
        0.0575
    """
    return float(np.mean((predictions - outcomes) ** 2))


def compute_log_loss(
    predictions: np.ndarray,
    outcomes: np.ndarray,
    epsilon: float = 1e-15,
) -> float:
    """
    Compute the log loss (binary cross-entropy) for probabilistic predictions.

    Clips predictions to [epsilon, 1 - epsilon] to avoid log(0).
    More sensitive to overconfident wrong predictions than Brier score.

    Args:
        predictions: Array of predicted probabilities in [0, 1].
        outcomes: Array of binary outcomes in {0, 1}.
        epsilon: Clipping threshold to prevent log(0).

    Returns:
        The log loss.
    """
    p = np.clip(predictions, epsilon, 1 - epsilon)
    return float(-np.mean(
        outcomes * np.log(p) + (1 - outcomes) * np.log(1 - p)
    ))


def brier_score_decomposition(
    predictions: np.ndarray,
    outcomes: np.ndarray,
    n_bins: int = 10,
) -> dict[str, float]:
    """
    Decompose the Brier score into reliability, resolution, and uncertainty.

    This decomposition provides diagnostic insight:
    - Reliability (lower is better): Are predicted probabilities well-calibrated?
    - Resolution (higher is better): Do predictions vary meaningfully?
    - Uncertainty (constant): How inherently difficult is the problem?

    The relationship is: BS = Reliability - Resolution + Uncertainty.

    Args:
        predictions: Array of predicted probabilities.
        outcomes: Array of binary outcomes.
        n_bins: Number of bins for grouping predictions.

    Returns:
        Dictionary with keys: 'brier_score', 'reliability', 'resolution',
        'uncertainty', and per-bin statistics.
    """
    n = len(predictions)
    base_rate = outcomes.mean()
    uncertainty = base_rate * (1 - base_rate)

    # Bin predictions
    bin_edges = np.linspace(0, 1, n_bins + 1)
    bin_indices = np.digitize(predictions, bin_edges[1:-1])

    reliability = 0.0
    resolution = 0.0
    bin_stats = []

    for k in range(n_bins):
        mask = bin_indices == k
        n_k = mask.sum()

        if n_k == 0:
            continue

        avg_pred = predictions[mask].mean()
        avg_outcome = outcomes[mask].mean()

        reliability += n_k * (avg_pred - avg_outcome) ** 2
        resolution += n_k * (avg_outcome - base_rate) ** 2

        bin_stats.append({
            "bin": k,
            "n": int(n_k),
            "avg_pred": float(avg_pred),
            "avg_outcome": float(avg_outcome),
            "reliability_contrib": float(n_k * (avg_pred - avg_outcome) ** 2 / n),
        })

    reliability /= n
    resolution /= n

    brier_score = compute_brier_score(predictions, outcomes)

    return {
        "brier_score": brier_score,
        "reliability": float(reliability),
        "resolution": float(resolution),
        "uncertainty": float(uncertainty),
        "decomposition_check": float(reliability - resolution + uncertainty),
        "bin_stats": bin_stats,
    }


def brier_skill_score(
    predictions: np.ndarray,
    outcomes: np.ndarray,
    reference_predictions: Optional[np.ndarray] = None,
) -> float:
    """
    Compute the Brier Skill Score relative to a reference forecast.

    BSS = 1 - BS_model / BS_reference.

    If no reference is provided, uses the base rate (climatological
    forecast) as the reference. For sports betting, passing
    market-implied probabilities as the reference answers the question:
    "Does my model add value beyond the market?"

    Args:
        predictions: Model's predicted probabilities.
        outcomes: Binary outcomes.
        reference_predictions: Reference forecast probabilities. If None,
            uses the base rate (constant prediction).

    Returns:
        The Brier Skill Score. Positive = better than reference.
    """
    bs_model = compute_brier_score(predictions, outcomes)

    if reference_predictions is None:
        # Use base rate as reference
        base_rate = outcomes.mean()
        bs_reference = base_rate * (1 - base_rate)
    else:
        bs_reference = compute_brier_score(reference_predictions, outcomes)

    if bs_reference == 0:
        return 0.0

    return float(1 - bs_model / bs_reference)

Worked Example: Comparing Two NBA Prediction Models

def scoring_rules_example():
    """
    Compare two NBA game prediction models using proper scoring rules.

    Model A: A simple model based on season win percentage.
    Model B: A neural network with engineered features.
    Reference: Market-implied probabilities from closing lines.

    This example demonstrates how proper scoring rules reveal
    differences in model quality that raw accuracy cannot detect.
    """
    np.random.seed(42)
    n_games = 500

    # Simulate game outcomes (home team wins ~58% of the time)
    true_probs = np.random.beta(3.5, 2.5, n_games)
    outcomes = (np.random.random(n_games) < true_probs).astype(float)

    # Model A: Simple model (predicts based on win pct, less varied)
    model_a_preds = np.clip(
        true_probs + np.random.normal(0, 0.15, n_games), 0.3, 0.7,
    )

    # Model B: Better model (more accurate, wider range)
    model_b_preds = np.clip(
        true_probs + np.random.normal(0, 0.08, n_games), 0.1, 0.9,
    )

    # Market: Very accurate (efficient market hypothesis)
    market_preds = np.clip(
        true_probs + np.random.normal(0, 0.05, n_games), 0.15, 0.85,
    )

    # Compute accuracy (binary threshold at 0.5)
    acc_a = ((model_a_preds > 0.5) == outcomes).mean()
    acc_b = ((model_b_preds > 0.5) == outcomes).mean()
    acc_market = ((market_preds > 0.5) == outcomes).mean()

    print("=== Accuracy (Binary) ===")
    print(f"  Model A:  {acc_a:.3f}")
    print(f"  Model B:  {acc_b:.3f}")
    print(f"  Market:   {acc_market:.3f}")

    # Compute Brier scores
    bs_a = compute_brier_score(model_a_preds, outcomes)
    bs_b = compute_brier_score(model_b_preds, outcomes)
    bs_market = compute_brier_score(market_preds, outcomes)

    print("\n=== Brier Score (lower is better) ===")
    print(f"  Model A:  {bs_a:.4f}")
    print(f"  Model B:  {bs_b:.4f}")
    print(f"  Market:   {bs_market:.4f}")

    # Compute log loss
    ll_a = compute_log_loss(model_a_preds, outcomes)
    ll_b = compute_log_loss(model_b_preds, outcomes)
    ll_market = compute_log_loss(market_preds, outcomes)

    print("\n=== Log Loss (lower is better) ===")
    print(f"  Model A:  {ll_a:.4f}")
    print(f"  Model B:  {ll_b:.4f}")
    print(f"  Market:   {ll_market:.4f}")

    # Brier Skill Score vs. market
    bss_a = brier_skill_score(model_a_preds, outcomes, market_preds)
    bss_b = brier_skill_score(model_b_preds, outcomes, market_preds)

    print("\n=== Brier Skill Score vs. Market ===")
    print(f"  Model A:  {bss_a:.4f}")
    print(f"  Model B:  {bss_b:.4f}")

    # Decomposition for Model B
    decomp = brier_score_decomposition(model_b_preds, outcomes)
    print("\n=== Brier Score Decomposition (Model B) ===")
    print(f"  Reliability:  {decomp['reliability']:.4f}")
    print(f"  Resolution:   {decomp['resolution']:.4f}")
    print(f"  Uncertainty:  {decomp['uncertainty']:.4f}")
    print(f"  Check (REL-RES+UNC): {decomp['decomposition_check']:.4f}")

Intuition: The Brier score decomposition tells you exactly what is wrong with your model. High reliability means your model is poorly calibrated --- when it says 70%, the outcome happens less or more than 70% of the time. Low resolution means your model does not discriminate well --- its predictions are clustered near the base rate and do not distinguish between likely and unlikely outcomes. A good betting model needs both: well-calibrated probabilities (reliability near zero) AND strong discrimination (high resolution).


30.2 Backtesting Strategies for Betting Models

Why Backtesting Matters

A model with excellent Brier score and calibration is not automatically a profitable betting model. Profitability depends on:

  1. The market's efficiency: If the market-implied probabilities are already well-calibrated, your model needs to be better than the market to generate edge.
  2. The vigorish: You must overcome the sportsbook's margin on every bet. At standard -110/-110 juice, you pay approximately 4.76% in transaction costs.
  3. Bet selection: You should only bet when your model's probability diverges sufficiently from the market price. Betting on every game dilutes edge with vig.
  4. Bankroll management: Even positive-EV strategies can go bankrupt with improper sizing.

Backtesting simulates your model's betting strategy on historical data, accounting for all these factors. It answers the critical question: If I had used this model and this strategy over the past N seasons, how much money would I have made or lost?

Avoiding Look-Ahead Bias

The most dangerous error in backtesting is look-ahead bias --- using information that would not have been available at the time of each prediction. Common sources include:

  • Feature leakage: Using season-ending statistics to predict mid-season games.
  • Model training leakage: Training on data from games you later "predict" in the backtest.
  • Odds leakage: Using closing odds when the model would have bet at opening odds.
  • Selection bias: Choosing to backtest a strategy only after seeing that it would have worked.

A properly constructed backtest retrains the model at each time step using only data available up to that point, makes predictions before observing outcomes, and records bets at odds that were actually available.

Backtesting Framework

from dataclasses import dataclass, field
from typing import Callable


@dataclass
class BetRecord:
    """Record of a single simulated bet."""
    date: str
    game_id: str
    side: str
    model_prob: float
    market_prob: float
    odds: float  # Decimal odds
    stake: float
    outcome: int  # 1 = win, 0 = loss
    profit: float
    bankroll_after: float


@dataclass
class BacktestResult:
    """Complete results of a backtesting simulation."""
    bets: list[BetRecord] = field(default_factory=list)
    initial_bankroll: float = 0.0
    final_bankroll: float = 0.0
    total_bets: int = 0
    winning_bets: int = 0
    total_staked: float = 0.0
    total_profit: float = 0.0
    roi: float = 0.0
    max_drawdown: float = 0.0
    sharpe_ratio: float = 0.0

    def summary(self) -> str:
        """Generate a human-readable summary of backtest results."""
        win_rate = self.winning_bets / self.total_bets if self.total_bets > 0 else 0
        return (
            f"Backtest Summary\n"
            f"{'=' * 40}\n"
            f"Total bets:      {self.total_bets}\n"
            f"Win rate:         {win_rate:.1%}\n"
            f"Total staked:    ${self.total_staked:,.2f}\n"
            f"Total profit:    ${self.total_profit:,.2f}\n"
            f"ROI:              {self.roi:.2%}\n"
            f"Initial bankroll: ${self.initial_bankroll:,.2f}\n"
            f"Final bankroll:  ${self.final_bankroll:,.2f}\n"
            f"Max drawdown:     {self.max_drawdown:.2%}\n"
            f"Sharpe ratio:     {self.sharpe_ratio:.2f}\n"
        )


class BettingBacktester:
    """
    Backtesting framework for sports betting models.

    Simulates historical betting with realistic constraints including
    vigorish, bet sizing limits, and bankroll tracking. Supports
    configurable bet selection criteria and staking strategies.

    The framework enforces temporal integrity: at each time step, only
    data available before that time step can be used for predictions.

    Args:
        initial_bankroll: Starting bankroll in dollars.
        min_edge: Minimum required edge (model_prob - implied_prob)
            to trigger a bet.
        max_bet_fraction: Maximum fraction of bankroll to wager on
            a single bet.
        staking_method: 'flat' for fixed stakes, 'kelly' for Kelly
            criterion, 'fractional_kelly' for a fraction of Kelly.
        kelly_fraction: Fraction of Kelly stake to use (only relevant
            if staking_method is 'fractional_kelly').
    """

    def __init__(
        self,
        initial_bankroll: float = 10000.0,
        min_edge: float = 0.03,
        max_bet_fraction: float = 0.05,
        staking_method: str = "fractional_kelly",
        kelly_fraction: float = 0.25,
    ):
        self.initial_bankroll = initial_bankroll
        self.min_edge = min_edge
        self.max_bet_fraction = max_bet_fraction
        self.staking_method = staking_method
        self.kelly_fraction = kelly_fraction

    def _calculate_stake(
        self,
        bankroll: float,
        model_prob: float,
        decimal_odds: float,
    ) -> float:
        """
        Calculate the stake for a bet using the configured staking method.

        Args:
            bankroll: Current bankroll.
            model_prob: Model's estimated win probability.
            decimal_odds: Decimal odds offered.

        Returns:
            Dollar amount to stake.
        """
        max_stake = bankroll * self.max_bet_fraction

        if self.staking_method == "flat":
            return min(100.0, max_stake)

        elif self.staking_method in ("kelly", "fractional_kelly"):
            # Kelly criterion: f* = (bp - q) / b
            # where b = decimal_odds - 1, p = model_prob, q = 1 - p
            b = decimal_odds - 1
            p = model_prob
            q = 1 - p

            if b <= 0:
                return 0.0

            kelly_fraction_optimal = (b * p - q) / b

            if kelly_fraction_optimal <= 0:
                return 0.0

            if self.staking_method == "fractional_kelly":
                kelly_fraction_optimal *= self.kelly_fraction

            stake = bankroll * kelly_fraction_optimal
            return min(stake, max_stake)

        else:
            raise ValueError(f"Unknown staking method: {self.staking_method}")

    def _implied_probability_from_odds(self, decimal_odds: float) -> float:
        """Convert decimal odds to implied probability."""
        return 1.0 / decimal_odds

    def run(
        self,
        games: pd.DataFrame,
        model_prob_col: str = "model_prob",
        market_odds_col: str = "decimal_odds",
        outcome_col: str = "outcome",
        date_col: str = "date",
        game_id_col: str = "game_id",
        side_col: str = "side",
    ) -> BacktestResult:
        """
        Run the backtest on historical game data.

        Processes games in chronological order, deciding whether to bet
        on each game based on the edge criteria, calculating stakes,
        and tracking bankroll evolution.

        Args:
            games: DataFrame with one row per betting opportunity, sorted
                by date. Must contain model probabilities, market odds,
                and outcomes.
            model_prob_col: Column with model's predicted probability.
            market_odds_col: Column with decimal odds available.
            outcome_col: Column with binary outcome (1 = win, 0 = loss).
            date_col: Column with game date.
            game_id_col: Column with game identifier.
            side_col: Column indicating which side is being considered.

        Returns:
            BacktestResult with complete simulation results.
        """
        games = games.sort_values(date_col).copy()
        bankroll = self.initial_bankroll
        bets = []
        peak_bankroll = bankroll
        max_drawdown = 0.0

        for _, game in games.iterrows():
            model_prob = game[model_prob_col]
            decimal_odds = game[market_odds_col]
            implied_prob = self._implied_probability_from_odds(decimal_odds)
            edge = model_prob - implied_prob

            # Only bet if edge exceeds minimum threshold
            if edge < self.min_edge:
                continue

            # Calculate stake
            stake = self._calculate_stake(bankroll, model_prob, decimal_odds)
            if stake <= 0 or stake > bankroll:
                continue

            # Resolve bet
            outcome = int(game[outcome_col])
            if outcome == 1:
                profit = stake * (decimal_odds - 1)
            else:
                profit = -stake

            bankroll += profit

            # Track drawdown
            peak_bankroll = max(peak_bankroll, bankroll)
            drawdown = (peak_bankroll - bankroll) / peak_bankroll
            max_drawdown = max(max_drawdown, drawdown)

            bets.append(BetRecord(
                date=str(game[date_col]),
                game_id=str(game[game_id_col]),
                side=str(game.get(side_col, "")),
                model_prob=float(model_prob),
                market_prob=float(implied_prob),
                odds=float(decimal_odds),
                stake=float(stake),
                outcome=outcome,
                profit=float(profit),
                bankroll_after=float(bankroll),
            ))

        # Compute summary statistics
        result = BacktestResult(
            bets=bets,
            initial_bankroll=self.initial_bankroll,
            final_bankroll=bankroll,
            total_bets=len(bets),
            winning_bets=sum(1 for b in bets if b.outcome == 1),
            total_staked=sum(b.stake for b in bets),
            total_profit=bankroll - self.initial_bankroll,
            max_drawdown=max_drawdown,
        )

        if result.total_staked > 0:
            result.roi = result.total_profit / result.total_staked

        # Compute Sharpe ratio (annualized)
        if len(bets) > 1:
            daily_returns = pd.Series([b.profit / b.stake for b in bets])
            if daily_returns.std() > 0:
                result.sharpe_ratio = (
                    daily_returns.mean() / daily_returns.std()
                    * np.sqrt(252)  # Annualize
                )

        return result


def plot_backtest_results(result: BacktestResult) -> None:
    """
    Visualize backtest results with bankroll trajectory and drawdown.

    Creates a two-panel plot showing:
    1. Bankroll over time with bets colored by outcome.
    2. Running drawdown from peak bankroll.
    """
    if not result.bets:
        print("No bets to plot.")
        return

    dates = [b.date for b in result.bets]
    bankrolls = [result.initial_bankroll] + [b.bankroll_after for b in result.bets]

    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 8), sharex=True)

    # Bankroll trajectory
    ax1.plot(range(len(bankrolls)), bankrolls, "b-", linewidth=1)
    ax1.axhline(
        y=result.initial_bankroll, color="gray", linestyle="--",
        alpha=0.5, label="Starting bankroll",
    )
    ax1.set_ylabel("Bankroll ($)")
    ax1.set_title("Backtest Results: Bankroll Trajectory")
    ax1.legend()

    # Drawdown
    peak = np.maximum.accumulate(bankrolls)
    drawdown = (peak - bankrolls) / peak
    ax2.fill_between(range(len(drawdown)), drawdown, alpha=0.3, color="red")
    ax2.plot(range(len(drawdown)), drawdown, "r-", linewidth=1)
    ax2.set_ylabel("Drawdown")
    ax2.set_xlabel("Bet Number")
    ax2.set_title(f"Drawdown (Max: {result.max_drawdown:.1%})")

    plt.tight_layout()
    plt.show()

Common Pitfall: The most subtle form of look-ahead bias in betting backtests is model selection bias. If you try 50 different model configurations and report the backtest results of the best one, you are implicitly using future information (the knowledge that this configuration would have performed best). The solution is to split your historical data into three periods: training, validation (for model selection), and test (for final backtest evaluation). The test period should be touched only once, after all model decisions are finalized.


30.3 Walk-Forward Validation

Why Standard Cross-Validation Fails for Sports Data

Standard $k$-fold cross-validation randomly shuffles observations into folds and trains on $k-1$ folds to predict the remaining fold. This is valid when observations are independent and identically distributed (i.i.d.). Sports data violates this assumption in two fundamental ways:

  1. Temporal dependence: A team's performance in game $t$ is correlated with its performance in games $t-1$, $t-2$, etc. Random shuffling can place game $t$ in the training set and game $t-1$ in the test set, allowing the model to learn from future information.

  2. Non-stationarity: The data-generating process changes over time. Teams improve, decline, change personnel, and adjust strategies. A model trained on 2020 data may be poorly suited for 2024 data.

If you apply standard cross-validation to sports data, you will get optimistically biased performance estimates. The model will appear to perform better than it actually would in production, because it has access to future information during training.

Walk-Forward Validation Schemes

Walk-forward validation respects temporal ordering by always training on past data and evaluating on future data. There are several variants:

Expanding Window

In expanding window validation, the training set grows over time. At each step, the model is trained on all data from the beginning through time $t$, and evaluated on data from time $t+1$ through $t+h$:

  • Step 1: Train on months 1--12, evaluate on month 13
  • Step 2: Train on months 1--13, evaluate on month 14
  • Step 3: Train on months 1--14, evaluate on month 15

This approach maximizes the training set size at each step but assumes that older data remains relevant.

Sliding Window

In sliding window validation, the training set has a fixed size $w$. At each step, the model is trained on the most recent $w$ observations and evaluated on the next $h$ observations:

  • Step 1: Train on months 1--12, evaluate on month 13
  • Step 2: Train on months 2--13, evaluate on month 14
  • Step 3: Train on months 3--14, evaluate on month 15

This approach discards old data that may no longer be relevant, but uses a smaller training set.

Purged Cross-Validation

Purged cross-validation adapts standard cross-validation for time series by introducing a gap between training and test folds. This gap ensures that the training set does not contain observations that are temporally close to (and therefore correlated with) test observations.

For sports data, the purge gap should be at least as wide as the longest feature lookback window. If your features include 20-game rolling averages, the purge gap should be at least 20 games, ensuring that no training observation's features overlap with any test observation's features.

from sklearn.model_selection import BaseCrossValidator


class WalkForwardValidator:
    """
    Walk-forward validation for time series sports data.

    Implements both expanding window and sliding window validation
    schemes with configurable window sizes and step lengths.

    Args:
        method: 'expanding' or 'sliding'.
        initial_train_size: Number of observations in the initial
            training set.
        test_size: Number of observations in each test fold.
        step_size: Number of observations to advance between folds.
        max_train_size: Maximum training set size (only for 'sliding').
        gap: Number of observations to skip between train and test
            (purge gap to prevent leakage from features with lookback).
    """

    def __init__(
        self,
        method: str = "expanding",
        initial_train_size: int = 500,
        test_size: int = 100,
        step_size: int = 50,
        max_train_size: Optional[int] = None,
        gap: int = 0,
    ):
        self.method = method
        self.initial_train_size = initial_train_size
        self.test_size = test_size
        self.step_size = step_size
        self.max_train_size = max_train_size
        self.gap = gap

    def split(self, n_samples: int) -> list[tuple[np.ndarray, np.ndarray]]:
        """
        Generate train/test index splits for walk-forward validation.

        Args:
            n_samples: Total number of observations.

        Returns:
            List of (train_indices, test_indices) tuples.
        """
        splits = []
        test_start = self.initial_train_size + self.gap

        while test_start + self.test_size <= n_samples:
            # Define test indices
            test_indices = np.arange(test_start, test_start + self.test_size)

            # Define train indices
            train_end = test_start - self.gap

            if self.method == "expanding":
                train_start = 0
            elif self.method == "sliding":
                if self.max_train_size is not None:
                    train_start = max(0, train_end - self.max_train_size)
                else:
                    train_start = max(
                        0, train_end - self.initial_train_size
                    )
            else:
                raise ValueError(f"Unknown method: {self.method}")

            train_indices = np.arange(train_start, train_end)

            if len(train_indices) > 0 and len(test_indices) > 0:
                splits.append((train_indices, test_indices))

            test_start += self.step_size

        return splits

    def get_n_splits(self, n_samples: int) -> int:
        """Return the number of splits."""
        return len(self.split(n_samples))


class PurgedKFoldCV:
    """
    Purged K-Fold cross-validation for time series data.

    Standard K-Fold with two modifications:
    1. Folds are temporally ordered (not shuffled).
    2. A purge gap is applied: observations within 'gap' steps of the
       test fold boundary are removed from training to prevent
       information leakage through overlapping feature windows.

    Args:
        n_splits: Number of folds.
        gap: Number of observations to purge from each side of the
            test fold boundary.
    """

    def __init__(self, n_splits: int = 5, gap: int = 20):
        self.n_splits = n_splits
        self.gap = gap

    def split(self, n_samples: int) -> list[tuple[np.ndarray, np.ndarray]]:
        """
        Generate purged K-Fold train/test splits.

        Returns:
            List of (train_indices, test_indices) tuples.
        """
        fold_size = n_samples // self.n_splits
        splits = []

        for i in range(self.n_splits):
            test_start = i * fold_size
            test_end = min((i + 1) * fold_size, n_samples)

            test_indices = np.arange(test_start, test_end)

            # Training indices: everything outside test + gap
            purge_start = max(0, test_start - self.gap)
            purge_end = min(n_samples, test_end + self.gap)

            train_indices = np.concatenate([
                np.arange(0, purge_start),
                np.arange(purge_end, n_samples),
            ])

            if len(train_indices) > 0 and len(test_indices) > 0:
                splits.append((train_indices, test_indices))

        return splits


def walk_forward_evaluate(
    features: pd.DataFrame,
    targets: pd.Series,
    model_factory: Callable,
    validator: WalkForwardValidator,
    scoring_fn: Callable = compute_brier_score,
) -> dict:
    """
    Run walk-forward evaluation with model retraining at each step.

    At each validation step:
    1. Trains a new model on the training fold.
    2. Generates predictions on the test fold.
    3. Computes the scoring metric.

    This provides an honest estimate of how the model would perform
    in production, where it is periodically retrained on new data.

    Args:
        features: Full feature DataFrame.
        targets: Full target Series.
        model_factory: Callable that returns a fresh, untrained model.
            Must have .fit(X, y) and .predict_proba(X) methods.
        validator: Walk-forward validation splitter.
        scoring_fn: Function(predictions, outcomes) -> float.

    Returns:
        Dictionary with per-fold scores and aggregate statistics.
    """
    n_samples = len(features)
    splits = validator.split(n_samples)

    fold_scores = []
    all_predictions = []
    all_targets = []
    all_fold_indices = []

    for fold_idx, (train_idx, test_idx) in enumerate(splits):
        # Split data
        X_train = features.iloc[train_idx]
        y_train = targets.iloc[train_idx]
        X_test = features.iloc[test_idx]
        y_test = targets.iloc[test_idx]

        # Train model
        model = model_factory()
        model.fit(X_train, y_train)

        # Predict
        if hasattr(model, "predict_proba"):
            preds = model.predict_proba(X_test)
            if preds.ndim == 2:
                preds = preds[:, 1]  # Probability of positive class
        else:
            preds = model.predict(X_test)

        # Score
        score = scoring_fn(preds, y_test.values)
        fold_scores.append(score)

        all_predictions.extend(preds)
        all_targets.extend(y_test.values)
        all_fold_indices.extend([fold_idx] * len(test_idx))

        if (fold_idx + 1) % 5 == 0:
            print(
                f"Fold {fold_idx + 1}/{len(splits)} | "
                f"Train: {len(train_idx)} | Test: {len(test_idx)} | "
                f"Score: {score:.4f}"
            )

    # Aggregate results
    overall_score = scoring_fn(
        np.array(all_predictions), np.array(all_targets),
    )

    results = {
        "fold_scores": fold_scores,
        "mean_score": float(np.mean(fold_scores)),
        "std_score": float(np.std(fold_scores)),
        "overall_score": overall_score,
        "n_folds": len(splits),
        "total_predictions": len(all_predictions),
        "all_predictions": np.array(all_predictions),
        "all_targets": np.array(all_targets),
    }

    print(f"\nWalk-Forward Results ({validator.method}):")
    print(f"  Folds:          {len(splits)}")
    print(f"  Mean score:     {results['mean_score']:.4f} +/- {results['std_score']:.4f}")
    print(f"  Overall score:  {results['overall_score']:.4f}")

    return results

Intuition: Walk-forward validation answers the question that matters for betting: "If I had actually used this model in real-time, retraining periodically as new data arrived, how would it have performed?" This is fundamentally different from the question answered by standard cross-validation, which is: "How well does this model fit a randomly held-out subset of data?" The first question is about production performance; the second is about in-sample fit. For betting, only the first question matters.


30.4 Calibration Plots and Reliability Diagrams

What Calibration Means

A model is well-calibrated if its predicted probabilities match observed frequencies. When the model predicts a 70% probability, the event should occur approximately 70% of the time across all such predictions. Formally:

$$\mathbb{P}(Y = 1 \mid p(X) = p) = p \quad \forall p \in [0, 1]$$

Calibration is essential for betting because the decision to bet depends on comparing the model's probability to the market's implied probability. If your model says 65% but is systematically overconfident (events it calls 65% actually occur 58% of the time), you will make losing bets that appear profitable.

Reliability Diagrams

A reliability diagram (also called a calibration plot) visualizes calibration by plotting predicted probabilities against observed frequencies. The procedure is:

  1. Sort all predictions by predicted probability.
  2. Divide into $K$ bins (e.g., [0.0--0.1), [0.1--0.2), ..., [0.9--1.0]).
  3. For each bin, compute the average predicted probability and the average observed outcome.
  4. Plot average predicted probability (x-axis) against average observed frequency (y-axis).

A perfectly calibrated model produces points along the diagonal $y = x$. Points above the diagonal indicate underconfidence (events occur more often than predicted); points below indicate overconfidence (events occur less often than predicted).

Expected Calibration Error (ECE)

The Expected Calibration Error quantifies the deviation from perfect calibration as a single number:

$$\text{ECE} = \sum_{k=1}^{K} \frac{n_k}{n} |\bar{y}_k - \bar{p}_k|$$

where the sum is over bins, $n_k$ is the number of predictions in bin $k$, $\bar{p}_k$ is the average predicted probability in that bin, and $\bar{y}_k$ is the average observed outcome. ECE weights each bin by its size, giving more influence to bins with more predictions.

A stricter variant, the Maximum Calibration Error (MCE), takes the worst-case bin:

$$\text{MCE} = \max_{k} |\bar{y}_k - \bar{p}_k|$$

def calibration_analysis(
    predictions: np.ndarray,
    outcomes: np.ndarray,
    n_bins: int = 10,
    strategy: str = "uniform",
    plot: bool = True,
    model_name: str = "Model",
) -> dict:
    """
    Comprehensive calibration analysis with reliability diagram.

    Bins predictions, computes calibration metrics, and optionally
    generates a reliability diagram with confidence intervals.

    Args:
        predictions: Array of predicted probabilities in [0, 1].
        outcomes: Array of binary outcomes in {0, 1}.
        n_bins: Number of calibration bins.
        strategy: 'uniform' for equal-width bins, 'quantile' for
            equal-frequency bins.
        plot: Whether to generate the reliability diagram.
        model_name: Label for the plot legend.

    Returns:
        Dictionary containing:
            'ece': Expected calibration error.
            'mce': Maximum calibration error.
            'bin_data': Per-bin statistics.
            'overall_bias': Mean prediction minus mean outcome.
    """
    if strategy == "uniform":
        bin_edges = np.linspace(0, 1, n_bins + 1)
    elif strategy == "quantile":
        bin_edges = np.percentile(
            predictions, np.linspace(0, 100, n_bins + 1),
        )
        bin_edges[0] = 0.0
        bin_edges[-1] = 1.0
    else:
        raise ValueError(f"Unknown strategy: {strategy}")

    bin_indices = np.digitize(predictions, bin_edges[1:-1])

    bin_data = []
    ece = 0.0
    mce = 0.0
    n = len(predictions)

    for k in range(n_bins):
        mask = bin_indices == k
        n_k = mask.sum()

        if n_k == 0:
            continue

        avg_pred = predictions[mask].mean()
        avg_outcome = outcomes[mask].mean()
        abs_error = abs(avg_outcome - avg_pred)

        ece += (n_k / n) * abs_error
        mce = max(mce, abs_error)

        # Confidence interval for observed frequency (Wilson score)
        z = 1.96  # 95% CI
        p_hat = avg_outcome
        denominator = 1 + z ** 2 / n_k
        center = (p_hat + z ** 2 / (2 * n_k)) / denominator
        half_width = z * np.sqrt(
            (p_hat * (1 - p_hat) + z ** 2 / (4 * n_k)) / n_k
        ) / denominator
        ci_lower = max(0, center - half_width)
        ci_upper = min(1, center + half_width)

        bin_data.append({
            "bin_center": float((bin_edges[k] + bin_edges[k + 1]) / 2),
            "avg_pred": float(avg_pred),
            "avg_outcome": float(avg_outcome),
            "n": int(n_k),
            "abs_error": float(abs_error),
            "ci_lower": float(ci_lower),
            "ci_upper": float(ci_upper),
        })

    if plot and bin_data:
        _plot_reliability_diagram(bin_data, ece, mce, model_name)

    return {
        "ece": float(ece),
        "mce": float(mce),
        "bin_data": bin_data,
        "overall_bias": float(predictions.mean() - outcomes.mean()),
    }


def _plot_reliability_diagram(
    bin_data: list[dict],
    ece: float,
    mce: float,
    model_name: str,
) -> None:
    """Plot a reliability diagram with confidence intervals."""
    fig, (ax1, ax2) = plt.subplots(
        1, 2, figsize=(14, 6),
        gridspec_kw={"width_ratios": [2, 1]},
    )

    avg_preds = [b["avg_pred"] for b in bin_data]
    avg_outcomes = [b["avg_outcome"] for b in bin_data]
    ci_lowers = [b["ci_lower"] for b in bin_data]
    ci_uppers = [b["ci_upper"] for b in bin_data]
    counts = [b["n"] for b in bin_data]

    # Reliability diagram
    ax1.plot([0, 1], [0, 1], "k--", alpha=0.5, label="Perfect calibration")
    ax1.errorbar(
        avg_preds, avg_outcomes,
        yerr=[
            [o - l for o, l in zip(avg_outcomes, ci_lowers)],
            [u - o for o, u in zip(avg_outcomes, ci_uppers)],
        ],
        fmt="o-", capsize=4, label=model_name,
    )
    ax1.fill_between([0, 1], [0, 1], alpha=0.05, color="gray")
    ax1.set_xlabel("Mean Predicted Probability")
    ax1.set_ylabel("Observed Frequency")
    ax1.set_title(
        f"Reliability Diagram\nECE = {ece:.4f}, MCE = {mce:.4f}"
    )
    ax1.legend()
    ax1.set_xlim(-0.05, 1.05)
    ax1.set_ylim(-0.05, 1.05)
    ax1.set_aspect("equal")

    # Histogram of predictions
    bin_centers = [b["bin_center"] for b in bin_data]
    ax2.bar(bin_centers, counts, width=0.08, alpha=0.7, color="steelblue")
    ax2.set_xlabel("Predicted Probability")
    ax2.set_ylabel("Count")
    ax2.set_title("Prediction Distribution")

    plt.tight_layout()
    plt.show()


def compare_calibration(
    models: dict[str, tuple[np.ndarray, np.ndarray]],
    n_bins: int = 10,
) -> pd.DataFrame:
    """
    Compare calibration metrics across multiple models.

    Args:
        models: Dictionary mapping model names to
            (predictions, outcomes) tuples.
        n_bins: Number of calibration bins.

    Returns:
        DataFrame with calibration metrics for each model.
    """
    results = []
    for name, (preds, outcomes) in models.items():
        cal = calibration_analysis(
            preds, outcomes, n_bins=n_bins, plot=False, model_name=name,
        )
        results.append({
            "model": name,
            "ece": cal["ece"],
            "mce": cal["mce"],
            "bias": cal["overall_bias"],
            "brier_score": compute_brier_score(preds, outcomes),
            "log_loss": compute_log_loss(preds, outcomes),
        })

    return pd.DataFrame(results).sort_values("brier_score")

Recalibration Techniques

If a model is well-discriminating but poorly calibrated, recalibration can improve probability estimates without retraining the model. Common techniques include:

  1. Platt scaling: Fits a logistic regression $p_{\text{calibrated}} = \sigma(a \cdot p_{\text{raw}} + b)$ on a held-out calibration set. Simple and effective for monotonically miscalibrated models.

  2. Isotonic regression: Fits a non-parametric, monotonically increasing function mapping raw probabilities to calibrated ones. More flexible than Platt scaling but requires more calibration data.

  3. Beta calibration: Fits a beta distribution-based mapping that can handle more complex miscalibration patterns than Platt scaling while being more constrained than isotonic regression.

from sklearn.isotonic import IsotonicRegression
from sklearn.linear_model import LogisticRegression


class ModelRecalibrator:
    """
    Recalibrate model probabilities using Platt scaling or isotonic regression.

    Fits the recalibration mapping on a held-out calibration set,
    then applies it to transform raw model probabilities into
    better-calibrated estimates.

    Important: The calibration set must be separate from both the
    training set (to avoid double-counting) and the test set (to
    preserve test set integrity).

    Args:
        method: 'platt' for Platt scaling, 'isotonic' for isotonic
            regression.
    """

    def __init__(self, method: str = "isotonic"):
        self.method = method
        self.calibrator_ = None

    def fit(
        self,
        raw_probabilities: np.ndarray,
        outcomes: np.ndarray,
    ) -> "ModelRecalibrator":
        """
        Fit the recalibration mapping on calibration data.

        Args:
            raw_probabilities: Raw model output probabilities.
            outcomes: True binary outcomes.

        Returns:
            self
        """
        if self.method == "platt":
            self.calibrator_ = LogisticRegression(C=1e10, solver="lbfgs")
            self.calibrator_.fit(
                raw_probabilities.reshape(-1, 1), outcomes,
            )
        elif self.method == "isotonic":
            self.calibrator_ = IsotonicRegression(
                y_min=0.001, y_max=0.999, out_of_bounds="clip",
            )
            self.calibrator_.fit(raw_probabilities, outcomes)
        else:
            raise ValueError(f"Unknown method: {self.method}")

        return self

    def transform(self, raw_probabilities: np.ndarray) -> np.ndarray:
        """
        Apply the learned recalibration mapping to new probabilities.

        Args:
            raw_probabilities: Raw model output probabilities.

        Returns:
            Recalibrated probabilities.
        """
        if self.calibrator_ is None:
            raise RuntimeError("Calibrator not fitted. Call fit() first.")

        if self.method == "platt":
            return self.calibrator_.predict_proba(
                raw_probabilities.reshape(-1, 1)
            )[:, 1]
        elif self.method == "isotonic":
            return self.calibrator_.predict(raw_probabilities)
        else:
            raise ValueError(f"Unknown method: {self.method}")

Real-World Application: In practice, most sports prediction models are somewhat overconfident --- they assign extreme probabilities (near 0 or 1) more often than the data warrants. This is partly because models overfit to training data and partly because sports outcomes have substantial irreducible randomness. Isotonic recalibration on a held-out season of data typically reduces ECE by 30--60%, which translates directly to better betting decisions: you stop making large bets on events your model is falsely confident about.


30.5 Model Comparison and Selection Criteria

The Model Selection Problem

Given multiple candidate models --- say, a logistic regression, an XGBoost model, a feedforward neural network, and an LSTM --- how do you choose the best one? This question is more subtle than it appears. A model that achieves the best Brier score on one validation fold may not be the best on another. The difference between models may be within the range of statistical noise. And the "best" model on historical data may overfit in ways that do not generalize.

Rigorous model selection requires both information criteria (which penalize model complexity) and statistical tests (which determine whether observed performance differences are significant).

Information Criteria: AIC and BIC

Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) trade off model fit against complexity:

$$\text{AIC} = 2k - 2 \ln \hat{L}$$

$$\text{BIC} = k \ln n - 2 \ln \hat{L}$$

where $k$ is the number of model parameters, $n$ is the number of observations, and $\hat{L}$ is the maximized likelihood.

For probabilistic predictions where we use log loss as the negative log-likelihood:

$$-2 \ln \hat{L} = 2n \cdot \text{LogLoss}$$

So:

$$\text{AIC} = 2k + 2n \cdot \text{LogLoss}$$

$$\text{BIC} = k \ln n + 2n \cdot \text{LogLoss}$$

BIC penalizes model complexity more heavily than AIC (for $n > 7$), so it tends to select simpler models. For betting, where overfitting is a primary concern and prediction accuracy on new data matters more than capturing every nuance in historical data, BIC is generally the more appropriate criterion.

def compute_information_criteria(
    predictions: np.ndarray,
    outcomes: np.ndarray,
    n_parameters: int,
    epsilon: float = 1e-15,
) -> dict[str, float]:
    """
    Compute AIC and BIC for a probabilistic prediction model.

    Uses log loss as the negative log-likelihood per observation,
    then applies the standard AIC and BIC formulas.

    Args:
        predictions: Model's predicted probabilities.
        outcomes: True binary outcomes.
        n_parameters: Number of trainable parameters in the model.
        epsilon: Clipping value to prevent log(0).

    Returns:
        Dictionary with 'aic', 'bic', 'log_loss', and 'n_obs'.

    Example:
        >>> preds = np.array([0.7, 0.3, 0.8, 0.6])
        >>> outcomes = np.array([1, 0, 1, 0])
        >>> result = compute_information_criteria(preds, outcomes, n_parameters=10)
        >>> print(f"AIC: {result['aic']:.2f}")
    """
    n = len(predictions)
    p = np.clip(predictions, epsilon, 1 - epsilon)

    # Log-likelihood
    log_likelihood = np.sum(
        outcomes * np.log(p) + (1 - outcomes) * np.log(1 - p)
    )

    # Information criteria
    aic = 2 * n_parameters - 2 * log_likelihood
    bic = n_parameters * np.log(n) - 2 * log_likelihood

    # AIC corrected for small sample size (AICc)
    if n - n_parameters - 1 > 0:
        aicc = aic + (2 * n_parameters * (n_parameters + 1)) / (n - n_parameters - 1)
    else:
        aicc = float("inf")

    return {
        "aic": float(aic),
        "aicc": float(aicc),
        "bic": float(bic),
        "log_likelihood": float(log_likelihood),
        "log_loss": float(-log_likelihood / n),
        "n_obs": n,
        "n_parameters": n_parameters,
    }

The Diebold-Mariano Test

The Diebold-Mariano (DM) test is a statistical test for comparing the predictive accuracy of two forecasting models. It tests the null hypothesis that two models have equal expected loss:

$$H_0: \mathbb{E}[d_t] = 0 \quad \text{where} \quad d_t = L(e_{1,t}) - L(e_{2,t})$$

Here, $d_t$ is the difference in loss between models 1 and 2 at time $t$, and $L$ is the loss function (Brier score or log loss for individual predictions).

The DM test statistic is:

$$\text{DM} = \frac{\bar{d}}{\sqrt{\hat{V}(\bar{d})}}$$

where $\bar{d} = \frac{1}{n} \sum_{t=1}^{n} d_t$ is the mean loss differential and $\hat{V}(\bar{d})$ is the estimated variance of $\bar{d}$, accounting for autocorrelation using the Newey-West estimator:

$$\hat{V}(\bar{d}) = \frac{1}{n} \left[ \hat{\gamma}_0 + 2 \sum_{k=1}^{h} \left(1 - \frac{k}{h+1}\right) \hat{\gamma}_k \right]$$

where $\hat{\gamma}_k$ is the sample autocovariance of $d_t$ at lag $k$ and $h$ is the truncation lag.

Under the null hypothesis, the DM statistic follows a standard normal distribution asymptotically. A large positive DM statistic indicates that Model 2 is significantly better (lower loss) than Model 1.

from scipy import stats


def diebold_mariano_test(
    predictions_1: np.ndarray,
    predictions_2: np.ndarray,
    outcomes: np.ndarray,
    loss_fn: str = "brier",
    max_lag: Optional[int] = None,
) -> dict[str, float]:
    """
    Diebold-Mariano test for comparing two forecasting models.

    Tests whether the difference in predictive accuracy between two
    models is statistically significant, accounting for temporal
    autocorrelation in the loss differential series.

    A positive test statistic means Model 2 is better (lower loss).
    A negative test statistic means Model 1 is better.

    Args:
        predictions_1: Probability forecasts from Model 1.
        predictions_2: Probability forecasts from Model 2.
        outcomes: True binary outcomes.
        loss_fn: Loss function to use: 'brier' for squared error,
            'log' for log loss.
        max_lag: Maximum lag for the Newey-West variance estimator.
            If None, uses int(n^(1/3)).

    Returns:
        Dictionary with keys:
            'dm_statistic': The test statistic.
            'p_value': Two-sided p-value.
            'mean_diff': Mean loss difference (positive = Model 2 better).
            'model_1_loss': Mean loss of Model 1.
            'model_2_loss': Mean loss of Model 2.
            'conclusion': Human-readable interpretation.

    Example:
        >>> preds_1 = np.array([0.6, 0.4, 0.7, 0.5, 0.8])
        >>> preds_2 = np.array([0.65, 0.35, 0.72, 0.48, 0.82])
        >>> outcomes = np.array([1, 0, 1, 0, 1])
        >>> result = diebold_mariano_test(preds_1, preds_2, outcomes)
    """
    n = len(outcomes)

    # Compute individual losses
    if loss_fn == "brier":
        losses_1 = (predictions_1 - outcomes) ** 2
        losses_2 = (predictions_2 - outcomes) ** 2
    elif loss_fn == "log":
        eps = 1e-15
        p1 = np.clip(predictions_1, eps, 1 - eps)
        p2 = np.clip(predictions_2, eps, 1 - eps)
        losses_1 = -(outcomes * np.log(p1) + (1 - outcomes) * np.log(1 - p1))
        losses_2 = -(outcomes * np.log(p2) + (1 - outcomes) * np.log(1 - p2))
    else:
        raise ValueError(f"Unknown loss function: {loss_fn}")

    # Loss differentials
    d = losses_1 - losses_2  # Positive means Model 2 is better
    d_mean = d.mean()

    # Newey-West variance estimate
    if max_lag is None:
        max_lag = int(n ** (1 / 3))

    # Autocovariances
    d_centered = d - d_mean
    gamma_0 = np.mean(d_centered ** 2)

    nw_sum = gamma_0
    for k in range(1, max_lag + 1):
        gamma_k = np.mean(d_centered[k:] * d_centered[:-k])
        weight = 1 - k / (max_lag + 1)  # Bartlett kernel
        nw_sum += 2 * weight * gamma_k

    variance = nw_sum / n

    if variance <= 0:
        return {
            "dm_statistic": 0.0,
            "p_value": 1.0,
            "mean_diff": float(d_mean),
            "model_1_loss": float(losses_1.mean()),
            "model_2_loss": float(losses_2.mean()),
            "conclusion": "Variance estimate non-positive; test inconclusive.",
        }

    # DM statistic
    dm_stat = d_mean / np.sqrt(variance)
    p_value = 2 * (1 - stats.norm.cdf(abs(dm_stat)))

    # Interpretation
    alpha = 0.05
    if p_value < alpha:
        if dm_stat > 0:
            conclusion = (
                f"Model 2 is significantly better than Model 1 "
                f"(p = {p_value:.4f})"
            )
        else:
            conclusion = (
                f"Model 1 is significantly better than Model 2 "
                f"(p = {p_value:.4f})"
            )
    else:
        conclusion = (
            f"No significant difference between models "
            f"(p = {p_value:.4f})"
        )

    return {
        "dm_statistic": float(dm_stat),
        "p_value": float(p_value),
        "mean_diff": float(d_mean),
        "model_1_loss": float(losses_1.mean()),
        "model_2_loss": float(losses_2.mean()),
        "conclusion": conclusion,
    }

Cross-Validated Brier Score for Model Selection

When comparing models with different complexity levels (e.g., logistic regression vs. deep neural network), walk-forward cross-validated Brier score provides a robust comparison that accounts for both predictive accuracy and generalization:

def cross_validated_model_comparison(
    features: pd.DataFrame,
    targets: pd.Series,
    model_factories: dict[str, Callable],
    validator: WalkForwardValidator,
) -> pd.DataFrame:
    """
    Compare multiple models using walk-forward cross-validated Brier score.

    For each model, runs the full walk-forward validation procedure
    and collects per-fold Brier scores. Then performs pairwise
    Diebold-Mariano tests to determine statistical significance of
    differences.

    Args:
        features: Feature DataFrame.
        targets: Target Series.
        model_factories: Dict mapping model names to callables that
            return untrained model instances.
        validator: Walk-forward validation splitter.

    Returns:
        DataFrame with model comparison results.
    """
    all_results = {}
    all_predictions = {}

    for name, factory in model_factories.items():
        print(f"\n{'=' * 40}")
        print(f"Evaluating: {name}")
        print(f"{'=' * 40}")

        result = walk_forward_evaluate(
            features, targets, factory, validator,
        )
        all_results[name] = result
        all_predictions[name] = (
            result["all_predictions"],
            result["all_targets"],
        )

    # Summary table
    summary_rows = []
    for name, result in all_results.items():
        summary_rows.append({
            "model": name,
            "mean_brier": result["mean_score"],
            "std_brier": result["std_score"],
            "overall_brier": result["overall_score"],
            "n_folds": result["n_folds"],
        })

    summary = pd.DataFrame(summary_rows).sort_values("mean_brier")

    # Pairwise DM tests
    model_names = list(all_predictions.keys())
    dm_results = []

    for i in range(len(model_names)):
        for j in range(i + 1, len(model_names)):
            name_1 = model_names[i]
            name_2 = model_names[j]
            preds_1 = all_predictions[name_1][0]
            preds_2 = all_predictions[name_2][0]
            targets_arr = all_predictions[name_1][1]

            dm = diebold_mariano_test(preds_1, preds_2, targets_arr)
            dm_results.append({
                "model_1": name_1,
                "model_2": name_2,
                "dm_statistic": dm["dm_statistic"],
                "p_value": dm["p_value"],
                "conclusion": dm["conclusion"],
            })

    dm_table = pd.DataFrame(dm_results)

    print("\n" + "=" * 60)
    print("MODEL COMPARISON SUMMARY")
    print("=" * 60)
    print(summary.to_string(index=False))
    print("\nPairwise Diebold-Mariano Tests:")
    print(dm_table.to_string(index=False))

    return summary

Practical Selection Criteria

Beyond statistical tests, practical considerations often drive model selection in sports betting:

Criterion Description Importance
Predictive accuracy Cross-validated Brier score or log loss Primary
Calibration ECE and reliability diagram quality High
Profitability Backtested ROI with realistic vig High
Stability Low variance in cross-validation scores across folds Medium
Interpretability Can you explain why the model made each prediction? Medium
Computational cost Training time, inference latency Low-Medium
Complexity Number of parameters, risk of overfitting Medium
Maintainability Ease of retraining, feature pipeline complexity Medium

A pragmatic approach is to select the simplest model whose performance is not statistically significantly worse than the best model (using the DM test). This follows the principle of parsimony: if a logistic regression achieves Brier score 0.215 and a deep neural network achieves 0.210 but the DM test shows $p = 0.18$, the logistic regression is the better choice --- it is simpler, more interpretable, faster to retrain, and less likely to overfit.

Model Ensembling

When multiple models have comparable performance, ensembling (averaging their predictions) often outperforms any individual model. The simple average:

$$\hat{p}_{\text{ensemble}} = \frac{1}{M} \sum_{m=1}^{M} \hat{p}_m$$

works well when models make different types of errors (i.e., their errors are not perfectly correlated). More sophisticated approaches weight models by their cross-validated performance:

$$\hat{p}_{\text{weighted}} = \sum_{m=1}^{M} w_m \hat{p}_m \quad \text{where} \quad w_m \propto \frac{1}{\text{BS}_m}$$

def ensemble_predictions(
    model_predictions: dict[str, np.ndarray],
    outcomes: np.ndarray,
    method: str = "inverse_brier",
) -> np.ndarray:
    """
    Combine predictions from multiple models into an ensemble.

    Args:
        model_predictions: Dict mapping model names to prediction arrays.
        outcomes: True outcomes (used for weighting if method requires it).
        method: 'simple' for equal weights, 'inverse_brier' for weights
            inversely proportional to Brier score.

    Returns:
        Ensemble prediction array.
    """
    names = list(model_predictions.keys())
    preds_matrix = np.column_stack([model_predictions[n] for n in names])

    if method == "simple":
        weights = np.ones(len(names)) / len(names)

    elif method == "inverse_brier":
        brier_scores = []
        for name in names:
            bs = compute_brier_score(model_predictions[name], outcomes)
            brier_scores.append(bs)
        brier_scores = np.array(brier_scores)

        # Inverse Brier score weights (lower BS = higher weight)
        inverse_bs = 1.0 / np.maximum(brier_scores, 1e-10)
        weights = inverse_bs / inverse_bs.sum()

    else:
        raise ValueError(f"Unknown method: {method}")

    ensemble = preds_matrix @ weights

    # Report
    ensemble_bs = compute_brier_score(ensemble, outcomes)
    print("Ensemble Weights:")
    for name, w in zip(names, weights):
        bs = compute_brier_score(model_predictions[name], outcomes)
        print(f"  {name}: weight={w:.3f}, Brier={bs:.4f}")
    print(f"  Ensemble: Brier={ensemble_bs:.4f}")

    return ensemble

Common Pitfall: Model ensembles can hide overfitting. If you select which models to include in the ensemble based on their performance on the same data used to evaluate the ensemble, you introduce selection bias. Use a three-way split: (1) train individual models, (2) select ensemble composition and weights on a validation set, (3) evaluate the final ensemble on a held-out test set.


30.6 Chapter Summary

Key Concepts

  1. Proper scoring rules (Brier score, log loss) are essential for evaluating probabilistic predictions because they reward both accuracy and calibration. Unlike raw accuracy, they capture the full quality of probability estimates.

  2. The Brier score decomposition into reliability, resolution, and uncertainty provides diagnostic insight: reliability measures calibration quality, resolution measures discriminative ability, and uncertainty is a constant reflecting problem difficulty.

  3. Brier Skill Score compares a model against a reference forecast. Using market-implied probabilities as the reference answers the crucial question: does your model add value beyond the market?

  4. Backtesting simulates a model's historical betting performance with realistic transaction costs. It must avoid look-ahead bias by ensuring that only historically available information is used at each prediction point.

  5. Walk-forward validation (expanding window or sliding window) is the correct cross-validation strategy for time series sports data. Standard k-fold cross-validation produces optimistically biased estimates due to temporal leakage.

  6. Purged cross-validation adds a gap between training and test folds to prevent leakage through feature lookback windows. The purge gap should equal or exceed the longest feature window.

  7. Calibration plots and ECE assess whether predicted probabilities match observed frequencies. Poorly calibrated models can be improved through recalibration (Platt scaling or isotonic regression) without retraining.

  8. The Diebold-Mariano test provides a statistical test for whether two models have significantly different predictive accuracy, accounting for autocorrelation in the loss differential series.

  9. Information criteria (AIC, BIC) balance model fit against complexity. BIC is generally more appropriate for betting models because it penalizes complexity more heavily, reducing overfitting risk.

  10. Model ensembling often outperforms any individual model when ensemble members make diverse errors. Weighting by inverse Brier score concentrates weight on better-performing models.

Key Formulas

Formula Description
$\text{BS} = \frac{1}{n}\sum(p_i - y_i)^2$ Brier score
$\text{BS} = \text{REL} - \text{RES} + \text{UNC}$ Brier decomposition
$\text{BSS} = 1 - \text{BS}_{\text{model}} / \text{BS}_{\text{ref}}$ Brier Skill Score
$\text{LL} = -\frac{1}{n}\sum[y\log p + (1-y)\log(1-p)]$ Log loss
$\text{ECE} = \sum \frac{n_k}{n} |\bar{y}_k - \bar{p}_k|$ Expected Calibration Error
$\text{AIC} = 2k - 2\ln\hat{L}$ Akaike Information Criterion
$\text{BIC} = k\ln n - 2\ln\hat{L}$ Bayesian Information Criterion
$\text{DM} = \bar{d} / \sqrt{\hat{V}(\bar{d})}$ Diebold-Mariano test statistic

Key Code Patterns

  1. Scoring functions (compute_brier_score, compute_log_loss, brier_score_decomposition): Evaluate probabilistic predictions with proper scoring rules and diagnostic decompositions.

  2. Backtesting framework (BettingBacktester): Simulates historical betting with configurable staking strategies, edge thresholds, and bankroll tracking.

  3. Walk-forward validators (WalkForwardValidator, PurgedKFoldCV): Generate temporally valid train/test splits that prevent data leakage in time series sports data.

  4. Calibration analysis (calibration_analysis, ModelRecalibrator): Assess and improve the calibration of probability forecasts through reliability diagrams and recalibration.

  5. Model comparison (diebold_mariano_test, cross_validated_model_comparison): Statistically rigorous comparison of competing models with proper accounting for autocorrelation.

  6. Ensemble (ensemble_predictions): Combine multiple models with performance-weighted averaging.

Decision Framework: Model Evaluation and Selection

START: You have trained one or more models and generated predictions.

1. Compute proper scoring rules.
   - Brier score and log loss for each model.
   - Brier score decomposition for diagnostic insight.
   - Brier Skill Score relative to market-implied probabilities.

2. Assess calibration.
   - Generate reliability diagrams.
   - Compute ECE and MCE.
   - If poorly calibrated, apply Platt scaling or isotonic recalibration.

3. Validate with walk-forward scheme.
   - Use expanding or sliding window (NOT standard k-fold).
   - Ensure purge gap >= longest feature lookback window.
   - Report mean and standard deviation of fold-level Brier scores.

4. Compare models statistically.
   - Compute AIC/BIC for complexity-penalized comparison.
   - Run Diebold-Mariano test for pairwise significance.
   - If no significant difference, prefer the simpler model.

5. Backtest with realistic betting conditions.
   - Apply vig (typically -110 both sides = 4.76%).
   - Use fractional Kelly staking.
   - Report ROI, max drawdown, and Sharpe ratio.

6. Select final model (or ensemble).
   - If multiple models are comparable, ensemble them.
   - Validate ensemble on held-out test period.
   - Document model choice and rationale.

END: Deploy the selected model for live prediction.

What's Next

With the tools developed across Chapters 28 through 30 --- feature engineering, neural network architectures, and rigorous evaluation --- you now possess a complete machine learning pipeline for sports prediction. In the chapters ahead, we will explore advanced topics including ensemble methods and model stacking (combining multiple model types for superior performance), reinforcement learning for dynamic bet sizing (adapting strategy as bankroll and market conditions evolve), and live deployment considerations (moving from backtested models to production systems that make real-time predictions on live data). The evaluation framework from this chapter will remain central: every new technique must prove its worth through proper scoring rules, walk-forward validation, and realistic backtesting before it earns a place in your production system.


This chapter is part of The Sports Betting Textbook, a comprehensive guide to quantitative sports betting. All code examples use Python 3.11+ and are available in the companion repository.