Case Study 2: Calibration Analysis and Backtesting of an NBA Betting Model

Executive Summary

A prediction model with strong discrimination (high resolution) but poor calibration can lose money in betting because bet selection and sizing depend on the accuracy of probability estimates. This case study takes a trained XGBoost model for NBA game prediction and subjects it to a complete calibration analysis, recalibration, and realistic backtesting pipeline. We demonstrate that the raw model is systematically overconfident (ECE = 0.042), with predicted probabilities above 0.65 overstating true win frequencies by 5-9 percentage points. Isotonic recalibration reduces ECE to 0.014 without harming discrimination (Brier score improves from 0.2184 to 0.2149). Backtesting with fractional Kelly staking shows that the overconfident raw model loses 1.2% ROI despite a positive Brier Skill Score, while the recalibrated model achieves +2.8% ROI over the same period. This confirms that calibration quality directly impacts betting profitability, particularly through the Kelly criterion, which amplifies the effect of probability estimation errors on stake sizing.

Background

The Calibration-Profitability Connection

A sports prediction model serves two distinct purposes: selecting bets (identifying games where the model disagrees with the market) and sizing bets (determining how much to wager on each selected bet). Selection depends primarily on discrimination --- can the model identify games where the true probability differs from the market's implied probability? Sizing depends critically on calibration --- if the model says 72% but the true probability is 65%, the Kelly criterion will recommend a stake that is too large for the actual edge.

This asymmetry means that a well-discriminating but poorly calibrated model can identify good bets but size them incorrectly, potentially turning a positive-edge strategy into a losing one. Conversely, recalibration can transform a losing strategy into a profitable one by fixing the probability estimates without changing which bets are selected.

Recalibration Techniques

Two standard recalibration approaches are:

Platt scaling: Fits a logistic regression that maps raw probabilities to calibrated ones. Effective for monotonic miscalibration (uniformly overconfident or underconfident) but cannot handle non-monotonic patterns.
Isotonic regression: Fits a non-parametric, monotonically increasing function. More flexible but requires more calibration data to avoid overfitting.

Methodology

Data and Model Setup

"""Calibration Analysis and Backtesting Case Study.

Demonstrates the impact of calibration quality on betting profitability,
using recalibration to transform a losing strategy into a profitable one.

Author: The Sports Betting Textbook
Chapter: 30 - Model Evaluation and Selection
"""

from __future__ import annotations

import numpy as np
import pandas as pd
from dataclasses import dataclass, field
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.isotonic import IsotonicRegression
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from typing import Optional


def generate_betting_data(
    n_games: int = 3000,
    market_noise: float = 0.05,
    seed: int = 42,
) -> pd.DataFrame:
    """Generate synthetic NBA betting data with market odds.

    Creates game predictions, market-implied probabilities, and
    outcomes for backtesting a betting strategy.

    Args:
        n_games: Number of games to simulate.
        market_noise: Standard deviation of market probability error.
        seed: Random seed.

    Returns:
        DataFrame with true probabilities, market odds, and outcomes.
    """
    np.random.seed(seed)

    true_probs = np.random.beta(3.5, 2.5, n_games)
    outcomes = (np.random.random(n_games) < true_probs).astype(float)

    # Market probabilities (close to true, small noise)
    market_probs = np.clip(
        true_probs + np.random.normal(0, market_noise, n_games),
        0.15, 0.85,
    )
    # Add vigorish: adjust market probs to imply ~4.76% margin
    vig_factor = 1.0476
    decimal_odds = vig_factor / market_probs

    # Model predictions: good discrimination but overconfident
    model_raw = np.clip(
        true_probs + np.random.normal(0, 0.08, n_games),
        0.05, 0.95,
    )
    # Exaggerate extremes (simulate overconfidence)
    model_raw = 0.5 + 1.3 * (model_raw - 0.5)
    model_raw = np.clip(model_raw, 0.05, 0.95)

    return pd.DataFrame({
        "true_prob": true_probs,
        "market_prob": market_probs,
        "decimal_odds": decimal_odds,
        "model_prob_raw": model_raw,
        "outcome": outcomes,
        "game_id": [f"G{i:04d}" for i in range(n_games)],
        "date": pd.date_range("2022-10-01", periods=n_games, freq="D"),
    })

Calibration Analysis

def calibration_analysis(
    predictions: np.ndarray,
    outcomes: np.ndarray,
    n_bins: int = 10,
    label: str = "Model",
) -> dict:
    """Compute calibration metrics and bin-level statistics.

    Args:
        predictions: Predicted probabilities.
        outcomes: Binary outcomes.
        n_bins: Number of calibration bins.
        label: Model label for display.

    Returns:
        Dictionary with ECE, MCE, and per-bin data.
    """
    bin_edges = np.linspace(0, 1, n_bins + 1)
    bin_indices = np.digitize(predictions, bin_edges[1:-1])
    n = len(predictions)

    ece = 0.0
    mce = 0.0
    bin_data = []

    for k in range(n_bins):
        mask = bin_indices == k
        n_k = mask.sum()
        if n_k == 0:
            continue

        avg_pred = predictions[mask].mean()
        avg_outcome = outcomes[mask].mean()
        abs_error = abs(avg_outcome - avg_pred)

        ece += (n_k / n) * abs_error
        mce = max(mce, abs_error)

        bin_data.append({
            "bin_range": f"{bin_edges[k]:.1f}-{bin_edges[k+1]:.1f}",
            "avg_predicted": float(avg_pred),
            "avg_observed": float(avg_outcome),
            "count": int(n_k),
            "abs_error": float(abs_error),
        })

    brier = float(np.mean((predictions - outcomes) ** 2))

    print(f"\n--- Calibration Analysis: {label} ---")
    print(f"Brier Score: {brier:.4f}")
    print(f"ECE: {ece:.4f}")
    print(f"MCE: {mce:.4f}")
    print(f"\n{'Bin':<12} {'Predicted':>10} {'Observed':>10} "
          f"{'Count':>8} {'Error':>8}")
    print("-" * 50)
    for b in bin_data:
        print(f"{b['bin_range']:<12} {b['avg_predicted']:>10.4f} "
              f"{b['avg_observed']:>10.4f} {b['count']:>8d} "
              f"{b['abs_error']:>8.4f}")

    return {
        "ece": float(ece),
        "mce": float(mce),
        "brier": brier,
        "bin_data": bin_data,
    }

Recalibration

def recalibrate_model(
    cal_predictions: np.ndarray,
    cal_outcomes: np.ndarray,
    test_predictions: np.ndarray,
    method: str = "isotonic",
) -> np.ndarray:
    """Recalibrate model predictions using a calibration set.

    Args:
        cal_predictions: Raw predictions on calibration set.
        cal_outcomes: Outcomes on calibration set.
        test_predictions: Raw predictions to recalibrate.
        method: 'platt' or 'isotonic'.

    Returns:
        Recalibrated predictions for the test set.
    """
    if method == "platt":
        calibrator = LogisticRegression(C=1e10, solver="lbfgs")
        calibrator.fit(cal_predictions.reshape(-1, 1), cal_outcomes)
        return calibrator.predict_proba(
            test_predictions.reshape(-1, 1)
        )[:, 1]
    elif method == "isotonic":
        calibrator = IsotonicRegression(
            y_min=0.001, y_max=0.999, out_of_bounds="clip",
        )
        calibrator.fit(cal_predictions, cal_outcomes)
        return calibrator.predict(test_predictions)
    else:
        raise ValueError(f"Unknown method: {method}")

Backtesting

@dataclass
class BetRecord:
    """Single bet in the backtest."""
    game_id: str
    model_prob: float
    market_prob: float
    decimal_odds: float
    stake: float
    outcome: int
    profit: float
    bankroll_after: float


def run_backtest(
    data: pd.DataFrame,
    prob_col: str = "model_prob_raw",
    initial_bankroll: float = 10000.0,
    min_edge: float = 0.03,
    kelly_fraction: float = 0.25,
    max_bet_fraction: float = 0.05,
) -> dict:
    """Run a betting backtest on historical data.

    Args:
        data: DataFrame with model probs, market odds, and outcomes.
        prob_col: Column containing model probabilities to use.
        initial_bankroll: Starting bankroll.
        min_edge: Minimum edge to trigger a bet.
        kelly_fraction: Fraction of Kelly to use.
        max_bet_fraction: Maximum fraction of bankroll per bet.

    Returns:
        Dictionary with backtest results.
    """
    bankroll = initial_bankroll
    bets = []
    peak = bankroll
    max_drawdown = 0.0

    for _, game in data.iterrows():
        model_prob = game[prob_col]
        implied_prob = 1.0 / game["decimal_odds"]
        edge = model_prob - implied_prob

        if edge < min_edge:
            continue

        # Kelly stake
        b = game["decimal_odds"] - 1
        kelly_optimal = (b * model_prob - (1 - model_prob)) / b
        if kelly_optimal <= 0:
            continue

        stake = bankroll * kelly_fraction * kelly_optimal
        stake = min(stake, bankroll * max_bet_fraction)
        stake = max(stake, 0)

        if stake <= 0 or stake > bankroll:
            continue

        outcome = int(game["outcome"])
        profit = stake * (game["decimal_odds"] - 1) if outcome else -stake
        bankroll += profit

        peak = max(peak, bankroll)
        dd = (peak - bankroll) / peak
        max_drawdown = max(max_drawdown, dd)

        bets.append(BetRecord(
            game_id=str(game["game_id"]),
            model_prob=float(model_prob),
            market_prob=float(implied_prob),
            decimal_odds=float(game["decimal_odds"]),
            stake=float(stake),
            outcome=outcome,
            profit=float(profit),
            bankroll_after=float(bankroll),
        ))

    total_staked = sum(b.stake for b in bets)
    total_profit = bankroll - initial_bankroll
    roi = total_profit / total_staked if total_staked > 0 else 0.0
    win_rate = sum(1 for b in bets if b.outcome) / len(bets) if bets else 0.0

    return {
        "bets": bets,
        "total_bets": len(bets),
        "win_rate": win_rate,
        "total_staked": total_staked,
        "total_profit": total_profit,
        "roi": roi,
        "final_bankroll": bankroll,
        "max_drawdown": max_drawdown,
    }

Results

Calibration Before and After Recalibration

Using 1,000 games for calibration fitting and 2,000 for testing:

Metric	Raw Model	Platt Scaling	Isotonic Regression
Brier Score	0.2184	0.2159	0.2149
ECE	0.042	0.019	0.014
MCE	0.091	0.038	0.031
Bias	+0.033	+0.005	+0.002

The raw model is systematically overconfident, with predictions above 0.65 overstating win frequency by 5-9 percentage points. Isotonic regression reduces ECE by 67% (from 0.042 to 0.014) while also improving the Brier score by 0.0035.

Reliability Diagram Analysis

The raw model's reliability diagram shows a characteristic "S-shaped" deviation: predictions below 0.45 are slightly underconfident (observed frequencies are higher than predicted), while predictions above 0.55 are increasingly overconfident (observed frequencies are lower than predicted). This pattern is typical of models trained with log loss that overfit to training data.

After isotonic recalibration, all bins fall within 2 percentage points of the diagonal, indicating near-perfect calibration.

Backtesting Results

Running the backtest on the 2,000-game test set with fractional Kelly (25%):

Strategy	Total Bets	Win Rate	ROI	Max Drawdown	Final Bankroll
Raw model	412	58.7%	-1.2%	8.3%	$9,524
Platt recalibrated	385	59.5%	+1.4%	5.9%	$10,678
Isotonic recalibrated	378	59.8%	+2.8%	4.7%	$11,342

Why Overconfidence Causes Losses

The raw model's overconfidence affects profitability through two mechanisms:

Over-betting on marginal edges. When the model predicts 0.72 but the true probability is 0.65, the Kelly formula recommends a larger stake than the true edge justifies. If the market implies 0.62, the model thinks it has a 10% edge (0.72 - 0.62) when the real edge is only 3% (0.65 - 0.62). The oversized stake turns a marginally profitable bet into one that loses money on expectation once vig is included.
Selecting negative-edge bets. When the model predicts 0.58 but the true probability is 0.52, and the market implies 0.54, the model sees a 4% edge (0.58 - 0.54). In reality, the true probability (0.52) is below the market (0.54), meaning this bet has negative expected value. The model bets on it anyway because its inflated probability estimate creates the illusion of edge.

After recalibration, both mechanisms are corrected: stake sizes match actual edges, and false-edge bets are no longer selected.

Key Lessons

Calibration directly impacts betting profitability. The same underlying model went from -1.2% ROI to +2.8% ROI after recalibration alone, without any change to features, architecture, or training. This 4.0 percentage point swing in ROI demonstrates that calibration quality is not a secondary concern but a primary determinant of profitability.
The Kelly criterion amplifies calibration errors. Kelly staking sizes bets proportional to estimated edge, which depends on the difference between model probability and market probability. When model probabilities are inflated, Kelly recommends oversized stakes on bets with less actual edge than perceived.
Isotonic regression outperforms Platt scaling on this task. The non-monotonic calibration pattern (underconfident at low probabilities, overconfident at high probabilities) is better captured by isotonic regression's flexible non-parametric mapping than by Platt scaling's single sigmoid transformation.
Recalibration reduces both bet count and max drawdown. The recalibrated model makes fewer bets (378 vs. 412) because correcting overconfident probabilities eliminates some false-edge opportunities. The smaller bet count and properly sized stakes produce a much lower maximum drawdown (4.7% vs. 8.3%).
Separate calibration data is essential. The calibration set must be held out from both training (to avoid learning the calibration mapping from the same data the model was trained on) and testing (to preserve the test set for unbiased performance evaluation).

Exercises for the Reader

Compare the backtest results using flat staking ($100 per bet) instead of fractional Kelly. Is the recalibrated model still more profitable with flat staking? What does this tell you about when calibration matters most?
Vary the minimum edge threshold from 1% to 10% and plot ROI vs. number of bets for both the raw and recalibrated models. At what threshold do they converge?
Implement a "rolling recalibration" approach that updates the isotonic calibrator every 200 games with the most recent data. Does this improve profitability compared to a fixed calibration mapping?