17 min read

> "God does not play dice with the universe --- but we can, and should, play dice with our models."

Learning Objectives

  • Understand the theoretical foundations of Monte Carlo simulation including convergence properties and the law of large numbers
  • Build complete season and tournament simulations for NFL, NBA, and March Madness bracket analysis
  • Apply bootstrap methods to construct confidence intervals for betting performance metrics
  • Design and execute permutation tests to evaluate sports hypotheses and model improvements
  • Implement variance reduction techniques to improve simulation efficiency by orders of magnitude

Chapter 24: Simulation and Monte Carlo Methods

"God does not play dice with the universe --- but we can, and should, play dice with our models." --- Adapted from Albert Einstein

Chapter Overview

Many of the most important questions in sports betting cannot be answered analytically. What is the probability that a 48-win NBA team makes it past the second round of the playoffs? How confident should you be that your 55% historical win rate reflects genuine skill rather than luck? If you simulate the remaining NFL season 10,000 times given current standings, what is the distribution of playoff seeds?

These questions require simulation --- the art and science of using random number generation to explore the behavior of complex systems. Monte Carlo methods, named after the famous casino in Monaco, use repeated random sampling to obtain numerical results that would be intractable to compute analytically. Where Chapter 23 built models from historical data, this chapter generates synthetic data from models to answer forward-looking questions.

Monte Carlo simulation is arguably the most versatile tool in the quantitative bettor's arsenal. It can handle arbitrary complexity: non-linear interactions, non-normal distributions, correlated outcomes, path-dependent strategies, and multi-stage tournaments. The price of this versatility is computational cost, which is why we dedicate the final section to variance reduction techniques that make simulations run faster and produce more precise estimates.

This chapter assumes familiarity with probability distributions (Chapter 7) and basic statistical modeling (Chapter 9). We will work entirely in Python, building simulation tools from the ground up.

In this chapter, you will learn to:

  • Build Monte Carlo simulations from first principles with proper convergence diagnostics
  • Simulate entire sports seasons and playoff brackets to compute probabilities of interest
  • Use bootstrap resampling to quantify uncertainty in your betting track record
  • Design permutation tests that rigorously evaluate sports hypotheses
  • Apply variance reduction techniques that can make your simulations 10x to 100x more efficient

24.1 Fundamentals of Monte Carlo Simulation

The Core Idea

The fundamental insight of Monte Carlo simulation is this: if you want to know the expected value of a function $g(X)$ where $X$ is a random variable with distribution $F$, you can estimate it by drawing $N$ independent samples $X_1, X_2, \ldots, X_N$ from $F$ and computing the sample mean:

$$\hat{\mu}_N = \frac{1}{N} \sum_{i=1}^{N} g(X_i) \approx E[g(X)] = \int g(x) \, dF(x)$$

By the Law of Large Numbers, $\hat{\mu}_N \to E[g(X)]$ as $N \to \infty$ (almost surely). By the Central Limit Theorem, for large $N$:

$$\frac{\hat{\mu}_N - E[g(X)]}{\sigma / \sqrt{N}} \xrightarrow{d} N(0, 1)$$

where $\sigma^2 = \text{Var}(g(X))$. This gives us:

  1. Convergence: Our estimate gets closer to the truth as we increase $N$.
  2. Convergence rate: The standard error of our estimate decreases as $O(1/\sqrt{N})$. To halve the error, we need four times as many samples.
  3. Confidence intervals: We can construct approximate confidence intervals: $\hat{\mu}_N \pm z_{\alpha/2} \cdot \hat{\sigma}/\sqrt{N}$.

The $O(1/\sqrt{N})$ convergence rate is both a blessing and a curse. It is a blessing because it is dimension-free --- unlike numerical integration methods whose cost explodes with dimensionality, Monte Carlo works just as well in 1,000 dimensions as in one. It is a curse because improving precision is expensive: going from 1% precision to 0.1% precision requires 100 times as many samples.

Random Number Generation

All Monte Carlo simulation rests on the ability to generate sequences of random (more precisely, pseudorandom) numbers. Modern pseudorandom number generators (PRNGs) like the Mersenne Twister (used by NumPy's legacy RandomState) and the PCG family (used by NumPy's modern Generator) produce sequences that pass stringent statistical tests for randomness while being deterministically reproducible given a seed.

Key principles for simulation in practice:

  • Always set and record seeds for reproducibility. A simulation that cannot be reproduced is worthless for peer review or debugging.
  • Use NumPy's modern Generator API (np.random.default_rng(seed)) rather than the legacy np.random functions, which use a global state and are less suitable for parallel simulations.
  • Generate in bulk rather than one at a time. rng.normal(0, 1, size=10000) is orders of magnitude faster than calling rng.normal(0, 1) in a loop 10,000 times due to NumPy's vectorized operations.

Estimating Probabilities and Expectations

The simplest application of Monte Carlo is estimating a probability. If $A$ is an event, then:

$$P(A) = E[\mathbf{1}_A] \approx \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}_{A}(X_i)$$

where $\mathbf{1}_A$ is the indicator function that equals 1 when event $A$ occurs and 0 otherwise. The count of simulations where $A$ occurs, divided by $N$, estimates $P(A)$.

For betting applications, common quantities to estimate include:

  • The probability that a team wins the championship
  • The expected return of a betting strategy over a season
  • The probability of ruin (bankroll reaching zero) under a given staking plan
  • The distribution of season-end records given current performance

Python Foundation

import numpy as np
import pandas as pd
from scipy import stats
from typing import Callable
import time


class MonteCarloSimulator:
    """
    Foundation class for Monte Carlo simulation with convergence
    diagnostics and confidence interval estimation.

    Provides a structured framework for running simulations,
    tracking convergence, and reporting results with proper
    uncertainty quantification.
    """

    def __init__(self, seed: int = 42):
        """
        Initialize the simulator with a reproducible random seed.

        Args:
            seed: Random seed for reproducibility.
        """
        self.seed = seed
        self.rng = np.random.default_rng(seed)
        self.results = None
        self.running_mean = None
        self.running_se = None

    def run(
        self,
        simulation_func: Callable,
        n_simulations: int = 10000,
        batch_size: int = 1000,
        **kwargs,
    ) -> dict:
        """
        Execute a Monte Carlo simulation with convergence tracking.

        Args:
            simulation_func: A callable that takes an rng object and
                keyword arguments and returns a numeric result.
            n_simulations: Total number of simulation replications.
            batch_size: Number of simulations per batch (for tracking
                convergence).
            **kwargs: Additional arguments passed to simulation_func.

        Returns:
            Dictionary with point estimate, confidence interval,
            standard error, and convergence history.
        """
        results = np.zeros(n_simulations)
        running_means = []
        running_ses = []

        start_time = time.time()

        for i in range(n_simulations):
            results[i] = simulation_func(self.rng, **kwargs)

            # Track convergence at batch boundaries
            if (i + 1) % batch_size == 0:
                current_results = results[: i + 1]
                running_means.append(current_results.mean())
                running_ses.append(current_results.std() / np.sqrt(i + 1))

        elapsed = time.time() - start_time
        self.results = results
        self.running_mean = np.array(running_means)
        self.running_se = np.array(running_ses)

        mean_estimate = results.mean()
        std_estimate = results.std(ddof=1)
        se_estimate = std_estimate / np.sqrt(n_simulations)
        ci_95 = (
            mean_estimate - 1.96 * se_estimate,
            mean_estimate + 1.96 * se_estimate,
        )

        return {
            "mean": mean_estimate,
            "std": std_estimate,
            "standard_error": se_estimate,
            "ci_95_lower": ci_95[0],
            "ci_95_upper": ci_95[1],
            "n_simulations": n_simulations,
            "elapsed_seconds": elapsed,
            "convergence_means": self.running_mean,
            "convergence_ses": self.running_se,
            "raw_results": results,
        }

    def estimate_probability(
        self,
        event_func: Callable,
        n_simulations: int = 10000,
        **kwargs,
    ) -> dict:
        """
        Estimate the probability of an event via Monte Carlo.

        Args:
            event_func: A callable that takes an rng object and
                keyword arguments and returns True/False.
            n_simulations: Number of simulations to run.
            **kwargs: Additional arguments passed to event_func.

        Returns:
            Dictionary with probability estimate, exact binomial
            confidence interval, and simulation details.
        """
        # Wrap boolean function to return 0/1
        def indicator(rng, **kw):
            return float(event_func(rng, **kw))

        result = self.run(indicator, n_simulations, **kwargs)

        # Use exact binomial CI (Clopper-Pearson) for better coverage
        successes = int(result["mean"] * n_simulations)
        ci_lower, ci_upper = stats.binom.ppf(
            [0.025, 0.975], n_simulations, result["mean"]
        ) / n_simulations

        result["probability"] = result["mean"]
        result["exact_ci_95_lower"] = ci_lower
        result["exact_ci_95_upper"] = ci_upper
        result["successes"] = successes

        return result

    def required_samples(
        self,
        target_se: float,
        pilot_n: int = 1000,
        simulation_func: Callable = None,
        **kwargs,
    ) -> int:
        """
        Estimate the number of samples needed to achieve a target
        standard error, using a pilot simulation.

        Args:
            target_se: Desired standard error.
            pilot_n: Number of pilot simulations to estimate variance.
            simulation_func: The simulation function.
            **kwargs: Arguments for the simulation function.

        Returns:
            Estimated number of samples required.
        """
        pilot = self.run(simulation_func, pilot_n, **kwargs)
        variance = pilot["std"] ** 2
        required_n = int(np.ceil(variance / target_se ** 2))
        return required_n


# Worked Example: Estimating the probability of a parlay hitting
def simulate_three_leg_parlay(
    rng: np.random.Generator,
    win_probs: tuple[float, float, float] = (0.55, 0.52, 0.58),
    correlation: float = 0.0,
) -> float:
    """
    Simulate a three-leg parlay with optional correlation between legs.

    Args:
        rng: NumPy random generator.
        win_probs: Win probability for each leg.
        correlation: Pairwise correlation between leg outcomes.
            0 = independent, >0 = positive correlation.

    Returns:
        1.0 if all three legs win, 0.0 otherwise.
    """
    if correlation == 0:
        # Independent legs
        outcomes = [rng.random() < p for p in win_probs]
    else:
        # Correlated legs using Gaussian copula
        mean = np.zeros(3)
        cov = np.full((3, 3), correlation)
        np.fill_diagonal(cov, 1.0)
        z = rng.multivariate_normal(mean, cov)
        u = stats.norm.cdf(z)
        outcomes = [u[i] < win_probs[i] for i in range(3)]

    return float(all(outcomes))


# Run the simulation
sim = MonteCarloSimulator(seed=42)

print("MONTE CARLO PARLAY SIMULATION")
print("=" * 60)

# Independent legs
result_indep = sim.estimate_probability(
    simulate_three_leg_parlay,
    n_simulations=100000,
    win_probs=(0.55, 0.52, 0.58),
    correlation=0.0,
)
theoretical_indep = 0.55 * 0.52 * 0.58
print(f"\nIndependent legs:")
print(f"  Simulated probability: {result_indep['probability']:.4f}")
print(f"  Theoretical:           {theoretical_indep:.4f}")
print(f"  95% CI: ({result_indep['ci_95_lower']:.4f}, "
      f"{result_indep['ci_95_upper']:.4f})")
print(f"  Standard error:        {result_indep['standard_error']:.5f}")

# Correlated legs (e.g., same-game parlay)
sim2 = MonteCarloSimulator(seed=42)
result_corr = sim2.estimate_probability(
    simulate_three_leg_parlay,
    n_simulations=100000,
    win_probs=(0.55, 0.52, 0.58),
    correlation=0.3,
)
print(f"\nCorrelated legs (rho=0.3):")
print(f"  Simulated probability: {result_corr['probability']:.4f}")
print(f"  95% CI: ({result_corr['ci_95_lower']:.4f}, "
      f"{result_corr['ci_95_upper']:.4f})")
print(f"  Difference from independent: "
      f"{result_corr['probability'] - result_indep['probability']:+.4f}")

Key Insight: The correlation between parlay legs matters enormously for pricing. A same-game parlay (e.g., Team A to win AND the over to hit) has positively correlated legs --- if the team wins big, the over is more likely. This correlation increases the true probability of all legs hitting relative to the naive product of individual probabilities, which is why sportsbooks offer worse odds on same-game parlays than the individual leg prices would imply.


24.2 Season and Tournament Simulations

Simulating a Full Season

A season simulation takes as input a model of game outcomes and runs the entire schedule many times, producing a distribution of possible season outcomes. This is one of the most practically useful applications of Monte Carlo in sports betting, enabling us to:

  • Estimate playoff probabilities for each team
  • Compute the distribution of season win totals (for over/under futures bets)
  • Evaluate the likelihood of specific seeding matchups
  • Assess the impact of a single game on playoff odds ("leverage")

The key modeling decision is how to simulate individual game outcomes. The simplest approach uses win probabilities derived from team power ratings:

$$P(\text{Team A wins}) = \frac{1}{1 + 10^{-(R_A - R_B + H) / s}}$$

where $R_A$ and $R_B$ are team ratings, $H$ is the home-court/home-field advantage, and $s$ is a scaling parameter that controls the relationship between rating differences and win probabilities. This is essentially a logistic model.

For point-spread markets, we may want to simulate the actual score difference, not just the winner. A common approach models the point differential as:

$$\text{Margin} \sim N(R_A - R_B + H, \sigma^2)$$

where $\sigma$ is the game-level standard deviation (approximately 13.5 for the NFL, 12 for the NBA).

Python Implementation: Full Season Simulation

import numpy as np
import pandas as pd
from collections import defaultdict
from typing import Optional


class TeamRatings:
    """
    Container for team power ratings used in simulation.
    """

    def __init__(self, ratings: dict[str, float]):
        """
        Args:
            ratings: Dictionary mapping team names to power ratings.
                A rating of 0 represents a league-average team.
                Positive ratings indicate above-average teams.
        """
        self.ratings = ratings

    def win_probability(
        self,
        home_team: str,
        away_team: str,
        home_advantage: float = 3.0,
        scale: float = 10.0,
    ) -> float:
        """
        Calculate the home team's win probability.

        Uses a logistic model based on the rating difference.

        Args:
            home_team: Name of the home team.
            away_team: Name of the away team.
            home_advantage: Home court/field advantage in rating points.
            scale: Scaling parameter (higher = more parity).

        Returns:
            Probability that the home team wins.
        """
        diff = (
            self.ratings[home_team]
            - self.ratings[away_team]
            + home_advantage
        )
        return 1.0 / (1.0 + 10.0 ** (-diff / scale))

    def expected_margin(
        self,
        home_team: str,
        away_team: str,
        home_advantage: float = 3.0,
    ) -> float:
        """
        Calculate the expected point margin (home - away).

        Args:
            home_team: Name of the home team.
            away_team: Name of the away team.
            home_advantage: Home advantage in points.

        Returns:
            Expected margin (positive favors home team).
        """
        return (
            self.ratings[home_team]
            - self.ratings[away_team]
            + home_advantage
        )


class SeasonSimulator:
    """
    Monte Carlo simulator for full sports seasons.

    Simulates an entire season schedule N times to produce
    distributions of wins, losses, playoff probabilities, and
    other outcomes of interest.
    """

    def __init__(
        self,
        schedule: pd.DataFrame,
        ratings: TeamRatings,
        seed: int = 42,
    ):
        """
        Args:
            schedule: DataFrame with columns 'home_team' and 'away_team'
                representing the full season schedule.
            ratings: TeamRatings object with power ratings for all teams.
            seed: Random seed for reproducibility.
        """
        self.schedule = schedule
        self.ratings = ratings
        self.rng = np.random.default_rng(seed)
        self.teams = sorted(
            set(schedule["home_team"]) | set(schedule["away_team"])
        )

    def simulate_season(
        self,
        margin_std: float = 13.5,
    ) -> dict[str, int]:
        """
        Simulate one complete season and return win totals.

        Args:
            margin_std: Standard deviation of game margins.

        Returns:
            Dictionary mapping team names to win counts.
        """
        wins = defaultdict(int)

        for _, game in self.schedule.iterrows():
            home = game["home_team"]
            away = game["away_team"]

            expected = self.ratings.expected_margin(home, away)
            margin = self.rng.normal(expected, margin_std)

            if margin > 0:
                wins[home] += 1
            elif margin < 0:
                wins[away] += 1
            else:
                # Exact tie: coin flip (extremely rare with continuous dist)
                if self.rng.random() < 0.5:
                    wins[home] += 1
                else:
                    wins[away] += 1

        return dict(wins)

    def run_simulations(
        self,
        n_simulations: int = 10000,
        margin_std: float = 13.5,
        playoff_threshold: int | None = None,
        n_playoff_spots: int | None = None,
    ) -> dict:
        """
        Run N season simulations and aggregate results.

        Args:
            n_simulations: Number of seasons to simulate.
            margin_std: Standard deviation of game margins.
            playoff_threshold: Win count needed for playoffs (if known).
            n_playoff_spots: Number of playoff spots (rank-based cutoff).

        Returns:
            Dictionary with win distributions, playoff probabilities,
            and other aggregate statistics for each team.
        """
        # Track results
        win_totals = {team: [] for team in self.teams}
        playoff_counts = {team: 0 for team in self.teams}
        best_record_counts = {team: 0 for team in self.teams}

        for sim in range(n_simulations):
            season_wins = self.simulate_season(margin_std)

            for team in self.teams:
                w = season_wins.get(team, 0)
                win_totals[team].append(w)

            # Determine playoff teams
            if n_playoff_spots is not None:
                sorted_teams = sorted(
                    self.teams,
                    key=lambda t: season_wins.get(t, 0),
                    reverse=True,
                )
                for team in sorted_teams[:n_playoff_spots]:
                    playoff_counts[team] += 1
                best_record_counts[sorted_teams[0]] += 1

            elif playoff_threshold is not None:
                for team in self.teams:
                    if season_wins.get(team, 0) >= playoff_threshold:
                        playoff_counts[team] += 1

        # Compile results
        results = {}
        for team in self.teams:
            wins = np.array(win_totals[team])
            results[team] = {
                "mean_wins": wins.mean(),
                "std_wins": wins.std(),
                "median_wins": np.median(wins),
                "min_wins": wins.min(),
                "max_wins": wins.max(),
                "win_distribution": wins,
                "p10_wins": np.percentile(wins, 10),
                "p90_wins": np.percentile(wins, 90),
                "playoff_probability": playoff_counts[team] / n_simulations,
                "best_record_probability": (
                    best_record_counts[team] / n_simulations
                ),
            }

        return results


def create_nfl_schedule(teams: list[str], games_per_team: int = 17) -> pd.DataFrame:
    """
    Create a simplified NFL-style schedule.

    Each team plays approximately `games_per_team` games with
    a roughly even home/away split.

    Args:
        teams: List of team names.
        games_per_team: Target games per team.

    Returns:
        DataFrame with 'home_team' and 'away_team' columns.
    """
    games = []
    n_teams = len(teams)

    # Round-robin with home/away assignment
    for i in range(n_teams):
        opponents_needed = games_per_team
        for j in range(n_teams):
            if i == j or opponents_needed <= 0:
                continue
            if i < j:  # Avoid duplicate matchups
                games.append({"home_team": teams[i], "away_team": teams[j]})
            opponents_needed -= 1

    return pd.DataFrame(games)


# Worked Example: NFL Season Simulation
print("NFL SEASON SIMULATION")
print("=" * 60)

# Define 8 teams with power ratings
nfl_teams = {
    "Chiefs": 5.0,
    "Bills": 4.0,
    "Eagles": 3.5,
    "49ers": 3.0,
    "Lions": 2.0,
    "Ravens": 1.5,
    "Cowboys": -1.0,
    "Jets": -2.0,
}

ratings = TeamRatings(nfl_teams)
schedule = create_nfl_schedule(list(nfl_teams.keys()), games_per_team=7)

# Run simulations
simulator = SeasonSimulator(schedule, ratings, seed=42)
results = simulator.run_simulations(
    n_simulations=10000,
    margin_std=13.5,
    n_playoff_spots=4,
)

# Display results
print(f"\n{'Team':<12} {'Mean W':>7} {'Std':>6} {'P10-P90':>12} "
      f"{'Playoff%':>10} {'#1 Seed%':>10}")
print("-" * 60)
for team in sorted(nfl_teams.keys(), key=lambda t: -results[t]["mean_wins"]):
    r = results[team]
    print(f"{team:<12} {r['mean_wins']:>7.1f} {r['std_wins']:>6.1f} "
          f"{r['p10_wins']:>5.0f}-{r['p90_wins']:<5.0f} "
          f"{r['playoff_probability']:>9.1%} "
          f"{r['best_record_probability']:>9.1%}")

Tournament and Bracket Simulation

Tournament simulations extend season simulations to multi-round elimination formats. The most famous application is March Madness bracket pools, where millions of brackets are filled out each year and the combinatorial explosion makes analytical computation intractable.

class BracketSimulator:
    """
    Monte Carlo simulator for single-elimination tournament brackets.

    Supports power-rating-based win probabilities with seed-based
    upset adjustments, suitable for NCAA March Madness or NFL
    playoff simulation.
    """

    def __init__(self, seed: int = 42):
        self.rng = np.random.default_rng(seed)

    def simulate_game(
        self,
        team_a: dict,
        team_b: dict,
        upset_factor: float = 10.0,
    ) -> dict:
        """
        Simulate a single tournament game.

        Args:
            team_a: Dict with 'name', 'rating', and optionally 'seed'.
            team_b: Dict with 'name', 'rating', and optionally 'seed'.
            upset_factor: Scaling parameter for win probability.

        Returns:
            The winning team's dictionary.
        """
        diff = team_a["rating"] - team_b["rating"]
        prob_a = 1.0 / (1.0 + 10.0 ** (-diff / upset_factor))
        return team_a if self.rng.random() < prob_a else team_b

    def simulate_bracket(
        self,
        teams: list[dict],
        upset_factor: float = 10.0,
    ) -> dict:
        """
        Simulate a complete single-elimination bracket.

        Teams are assumed to be ordered by seed (1-seed first).
        Standard bracket matchups are used (1v16, 8v9, etc.).

        Args:
            teams: List of team dictionaries, ordered by seed.
            upset_factor: Scaling parameter for upset probability.

        Returns:
            Dictionary with champion, finalist, round-by-round results.
        """
        n = len(teams)
        if n & (n - 1) != 0:
            raise ValueError("Number of teams must be a power of 2.")

        # Standard bracket ordering (1v16, 8v9, 5v12, 4v13, ...)
        if n == 16:
            matchup_order = [0, 15, 7, 8, 4, 11, 3, 12, 5, 10, 2, 13, 6, 9, 1, 14]
        elif n == 8:
            matchup_order = [0, 7, 3, 4, 2, 5, 1, 6]
        elif n == 4:
            matchup_order = [0, 3, 1, 2]
        else:
            matchup_order = list(range(n))

        bracket_teams = [teams[i] for i in matchup_order]
        round_results = []
        current_round = bracket_teams

        round_num = 1
        while len(current_round) > 1:
            next_round = []
            round_matchups = []
            for i in range(0, len(current_round), 2):
                winner = self.simulate_game(
                    current_round[i], current_round[i + 1], upset_factor
                )
                next_round.append(winner)
                round_matchups.append({
                    "team_a": current_round[i]["name"],
                    "team_b": current_round[i + 1]["name"],
                    "winner": winner["name"],
                })
            round_results.append({
                "round": round_num,
                "matchups": round_matchups,
            })
            current_round = next_round
            round_num += 1

        return {
            "champion": current_round[0],
            "rounds": round_results,
        }

    def run_tournament_simulations(
        self,
        teams: list[dict],
        n_simulations: int = 10000,
        upset_factor: float = 10.0,
    ) -> pd.DataFrame:
        """
        Simulate a tournament many times and compute advancement
        probabilities.

        Args:
            teams: List of team dictionaries with 'name' and 'rating'.
            n_simulations: Number of tournament simulations.
            upset_factor: Scaling parameter.

        Returns:
            DataFrame with each team's probability of winning each
            round and the championship.
        """
        n_rounds = int(np.log2(len(teams)))
        team_names = [t["name"] for t in teams]

        # Track advancement counts
        advancement = {
            name: [0] * (n_rounds + 1) for name in team_names
        }

        # Everyone "advances" to round 0 (they're in the tournament)
        for name in team_names:
            advancement[name][0] = n_simulations

        for _ in range(n_simulations):
            result = self.simulate_bracket(teams, upset_factor)

            # Track round-by-round winners
            for round_info in result["rounds"]:
                r = round_info["round"]
                for matchup in round_info["matchups"]:
                    advancement[matchup["winner"]][r] += 1

        # Convert to probabilities
        round_names = ["Field"] + [f"Round {i+1}" for i in range(n_rounds - 1)] + ["Champion"]
        prob_df = pd.DataFrame(
            {name: [c / n_simulations for c in counts]
             for name, counts in advancement.items()},
            index=round_names,
        ).T
        prob_df.index.name = "Team"

        return prob_df


# Worked Example: March Madness Region Simulation
print("\nMARCH MADNESS REGION SIMULATION")
print("=" * 60)

# 16-team region with seed-based ratings
region_teams = [
    {"name": "1-Purdue", "seed": 1, "rating": 12.0},
    {"name": "2-Marquette", "seed": 2, "rating": 10.5},
    {"name": "3-Creighton", "seed": 3, "rating": 9.5},
    {"name": "4-Duke", "seed": 4, "rating": 9.0},
    {"name": "5-Wisconsin", "seed": 5, "rating": 7.5},
    {"name": "6-Texas Tech", "seed": 6, "rating": 7.0},
    {"name": "7-Florida", "seed": 7, "rating": 6.5},
    {"name": "8-Nebraska", "seed": 8, "rating": 5.5},
    {"name": "9-Memphis", "seed": 9, "rating": 5.0},
    {"name": "10-Colorado St", "seed": 10, "rating": 4.5},
    {"name": "11-NC State", "seed": 11, "rating": 4.0},
    {"name": "12-James Madison", "seed": 12, "rating": 3.5},
    {"name": "13-Vermont", "seed": 13, "rating": 2.0},
    {"name": "14-Oakland", "seed": 14, "rating": 1.5},
    {"name": "15-W. Kentucky", "seed": 15, "rating": 0.5},
    {"name": "16-Montana St", "seed": 16, "rating": -1.0},
]

bracket_sim = BracketSimulator(seed=42)
probs = bracket_sim.run_tournament_simulations(
    region_teams, n_simulations=50000, upset_factor=11.0
)

print(f"\n{'Team':<20} {'R32':>6} {'S16':>6} {'E8':>6} {'Champ':>8}")
print("-" * 50)
for team in probs.index:
    row = probs.loc[team]
    print(f"{team:<20} {row.iloc[1]:>6.1%} {row.iloc[2]:>6.1%} "
          f"{row.iloc[3]:>6.1%} {row.iloc[4]:>8.1%}")

Real-World Application: Season simulation is the backbone of sites like FiveThirtyEight's forecasts. For betting, compare your simulated playoff probabilities against the implied probabilities from futures markets. If your simulation gives a team a 25% chance to make the playoffs but the futures market implies only 15%, that represents a potential betting opportunity --- provided you trust your ratings and model.


24.3 Bootstrap Methods for Confidence Intervals

The Bootstrap Principle

The bootstrap, introduced by Bradley Efron in 1979, is one of the most powerful ideas in modern statistics. The core insight is deceptively simple: if we want to know the sampling distribution of a statistic $\hat{\theta}$, but we cannot derive it analytically, we can approximate it by resampling from our observed data.

Given a sample $x_1, x_2, \ldots, x_n$ from an unknown distribution $F$, the bootstrap proceeds as follows:

  1. Draw a bootstrap sample $x_1^*, x_2^*, \ldots, x_n^*$ by sampling $n$ observations with replacement from the original sample.
  2. Compute the statistic of interest $\hat{\theta}^*$ on the bootstrap sample.
  3. Repeat steps 1-2 $B$ times to obtain bootstrap replicates $\hat{\theta}_1^*, \hat{\theta}_2^*, \ldots, \hat{\theta}_B^*$.
  4. Use the distribution of bootstrap replicates to estimate the standard error, bias, and confidence intervals for $\hat{\theta}$.

For sports betting, the bootstrap is invaluable because it answers questions like:

  • "My ROI over 500 bets is 3.2%. What is the confidence interval around this estimate?"
  • "My model improved its Brier score by 0.005 compared to the baseline. Is this statistically significant?"
  • "My win rate is 54%. How likely is it that my true win rate is above 52% (the break-even point at -110)?"

Types of Bootstrap Confidence Intervals

Percentile Method: The simplest approach takes the $\alpha/2$ and $1 - \alpha/2$ percentiles of the bootstrap distribution:

$$CI_{1-\alpha} = [\hat{\theta}^*_{(\alpha/2)}, \hat{\theta}^*_{(1-\alpha/2)}]$$

This is easy to compute but can have poor coverage when the bootstrap distribution is skewed.

Basic (Pivotal) Bootstrap: Uses the bootstrap to estimate the distribution of $\hat{\theta} - \theta$:

$$CI_{1-\alpha} = [2\hat{\theta} - \hat{\theta}^*_{(1-\alpha/2)}, \, 2\hat{\theta} - \hat{\theta}^*_{(\alpha/2)}]$$

BCa (Bias-Corrected and Accelerated): The gold standard for bootstrap confidence intervals. It corrects for both bias and skewness in the bootstrap distribution:

$$CI_{1-\alpha} = [\hat{\theta}^*_{(\alpha_1)}, \hat{\theta}^*_{(\alpha_2)}]$$

where $\alpha_1$ and $\alpha_2$ are adjusted percentiles that account for the bias correction factor $\hat{z}_0$ and the acceleration factor $\hat{a}$:

$$\alpha_1 = \Phi\left(\hat{z}_0 + \frac{\hat{z}_0 + z_{\alpha/2}}{1 - \hat{a}(\hat{z}_0 + z_{\alpha/2})}\right)$$

$$\alpha_2 = \Phi\left(\hat{z}_0 + \frac{\hat{z}_0 + z_{1-\alpha/2}}{1 - \hat{a}(\hat{z}_0 + z_{1-\alpha/2})}\right)$$

The bias correction $\hat{z}_0$ measures how the center of the bootstrap distribution differs from the original estimate. The acceleration $\hat{a}$ measures the rate at which the standard error changes with the parameter value.

Python Implementation

import numpy as np
import pandas as pd
from scipy import stats
from typing import Callable, Optional


class BettingBootstrap:
    """
    Bootstrap analysis tools for evaluating betting performance.

    Provides methods for constructing confidence intervals around
    key betting metrics (ROI, win rate, Sharpe ratio) using
    non-parametric, parametric, and BCa bootstrap methods.
    """

    def __init__(self, seed: int = 42):
        self.rng = np.random.default_rng(seed)

    def non_parametric_bootstrap(
        self,
        data: np.ndarray,
        statistic_func: Callable[[np.ndarray], float],
        n_bootstrap: int = 10000,
        ci_level: float = 0.95,
        method: str = "bca",
    ) -> dict:
        """
        Compute bootstrap confidence intervals for a statistic.

        Args:
            data: The observed data array.
            statistic_func: Function that computes the statistic
                from a data array.
            n_bootstrap: Number of bootstrap replicates.
            ci_level: Confidence level (e.g., 0.95 for 95% CI).
            method: CI method: 'percentile', 'basic', or 'bca'.

        Returns:
            Dictionary with point estimate, bootstrap standard error,
            confidence interval, and bootstrap distribution.
        """
        n = len(data)
        observed_stat = statistic_func(data)

        # Generate bootstrap replicates
        boot_stats = np.zeros(n_bootstrap)
        for b in range(n_bootstrap):
            boot_sample = self.rng.choice(data, size=n, replace=True)
            boot_stats[b] = statistic_func(boot_sample)

        # Bootstrap standard error
        boot_se = boot_stats.std(ddof=1)

        # Bootstrap bias
        boot_bias = boot_stats.mean() - observed_stat

        alpha = 1 - ci_level

        if method == "percentile":
            ci_lower = np.percentile(boot_stats, 100 * alpha / 2)
            ci_upper = np.percentile(boot_stats, 100 * (1 - alpha / 2))

        elif method == "basic":
            ci_lower = 2 * observed_stat - np.percentile(
                boot_stats, 100 * (1 - alpha / 2)
            )
            ci_upper = 2 * observed_stat - np.percentile(
                boot_stats, 100 * alpha / 2
            )

        elif method == "bca":
            # Bias correction factor
            z0 = stats.norm.ppf(np.mean(boot_stats < observed_stat))

            # Acceleration factor (jackknife estimate)
            jackknife_stats = np.zeros(n)
            for i in range(n):
                jack_sample = np.delete(data, i)
                jackknife_stats[i] = statistic_func(jack_sample)
            jack_mean = jackknife_stats.mean()
            numerator = np.sum((jack_mean - jackknife_stats) ** 3)
            denominator = 6 * (
                np.sum((jack_mean - jackknife_stats) ** 2) ** 1.5
            )
            a_hat = numerator / denominator if denominator != 0 else 0

            # Adjusted percentiles
            z_alpha_low = stats.norm.ppf(alpha / 2)
            z_alpha_high = stats.norm.ppf(1 - alpha / 2)

            alpha1 = stats.norm.cdf(
                z0 + (z0 + z_alpha_low) / (1 - a_hat * (z0 + z_alpha_low))
            )
            alpha2 = stats.norm.cdf(
                z0 + (z0 + z_alpha_high) / (1 - a_hat * (z0 + z_alpha_high))
            )

            ci_lower = np.percentile(boot_stats, 100 * alpha1)
            ci_upper = np.percentile(boot_stats, 100 * alpha2)

        else:
            raise ValueError(f"Unknown method: {method}")

        return {
            "observed_statistic": observed_stat,
            "bootstrap_mean": boot_stats.mean(),
            "bootstrap_se": boot_se,
            "bootstrap_bias": boot_bias,
            "ci_lower": ci_lower,
            "ci_upper": ci_upper,
            "ci_level": ci_level,
            "method": method,
            "n_bootstrap": n_bootstrap,
            "bootstrap_distribution": boot_stats,
        }

    def betting_performance_analysis(
        self,
        bet_results: np.ndarray,
        stake: float = 100.0,
        n_bootstrap: int = 10000,
    ) -> dict:
        """
        Comprehensive bootstrap analysis of betting performance.

        Analyzes an array of bet returns (positive = profit,
        negative = loss) and produces confidence intervals for
        key metrics.

        Args:
            bet_results: Array of profit/loss for each bet.
                E.g., +91 for a winning -110 bet, -100 for a loss.
            stake: Standard stake size (for normalization).
            n_bootstrap: Number of bootstrap replicates.

        Returns:
            Dictionary with CI for ROI, win rate, Sharpe ratio, and
            other metrics.
        """
        n_bets = len(bet_results)

        # Define statistic functions
        def roi(x):
            return x.sum() / (len(x) * stake) * 100

        def win_rate(x):
            return np.mean(x > 0)

        def sharpe_ratio(x):
            return x.mean() / x.std() if x.std() > 0 else 0

        def max_drawdown(x):
            cumsum = np.cumsum(x)
            running_max = np.maximum.accumulate(cumsum)
            drawdowns = running_max - cumsum
            return drawdowns.max()

        # Bootstrap each metric
        metrics = {}
        for name, func in [
            ("roi_pct", roi),
            ("win_rate", win_rate),
            ("sharpe_ratio", sharpe_ratio),
            ("max_drawdown", max_drawdown),
        ]:
            metrics[name] = self.non_parametric_bootstrap(
                bet_results, func, n_bootstrap, method="bca"
            )

        # Probability of positive ROI (posterior probability of skill)
        roi_dist = metrics["roi_pct"]["bootstrap_distribution"]
        prob_positive = np.mean(roi_dist > 0)
        metrics["prob_positive_roi"] = prob_positive

        # Probability of beating the vig (~52.4% win rate at -110)
        wr_dist = metrics["win_rate"]["bootstrap_distribution"]
        prob_beating_vig = np.mean(wr_dist > 0.524)
        metrics["prob_beating_vig"] = prob_beating_vig

        return metrics


# Worked Example: Evaluating a Bettor's Track Record
print("BOOTSTRAP ANALYSIS OF BETTING PERFORMANCE")
print("=" * 60)

# Simulate a bettor's results: 500 bets at -110 odds
rng = np.random.default_rng(42)
true_win_rate = 0.54  # Genuine edge
n_bets = 500
stake = 100

wins = rng.binomial(1, true_win_rate, size=n_bets)
# At -110: win +90.91, lose -100
bet_results = np.where(wins, 90.91, -100.0)

bootstrap = BettingBootstrap(seed=42)
analysis = bootstrap.betting_performance_analysis(
    bet_results, stake=stake, n_bootstrap=10000
)

print(f"\nSample size: {n_bets} bets")
print(f"\nROI:")
roi_result = analysis["roi_pct"]
print(f"  Point estimate: {roi_result['observed_statistic']:.2f}%")
print(f"  95% BCa CI:     ({roi_result['ci_lower']:.2f}%, "
      f"{roi_result['ci_upper']:.2f}%)")
print(f"  Bootstrap SE:   {roi_result['bootstrap_se']:.2f}%")

print(f"\nWin Rate:")
wr_result = analysis["win_rate"]
print(f"  Point estimate: {wr_result['observed_statistic']:.3f}")
print(f"  95% BCa CI:     ({wr_result['ci_lower']:.3f}, "
      f"{wr_result['ci_upper']:.3f})")

print(f"\nSharpe Ratio:")
sr_result = analysis["sharpe_ratio"]
print(f"  Point estimate: {sr_result['observed_statistic']:.4f}")
print(f"  95% BCa CI:     ({sr_result['ci_lower']:.4f}, "
      f"{sr_result['ci_upper']:.4f})")

print(f"\nMax Drawdown:")
md_result = analysis["max_drawdown"]
print(f"  Point estimate: ${md_result['observed_statistic']:.0f}")
print(f"  95% BCa CI:     (${md_result['ci_lower']:.0f}, "
      f"${md_result['ci_upper']:.0f})")

print(f"\nP(positive ROI):     {analysis['prob_positive_roi']:.3f}")
print(f"P(beating the vig):  {analysis['prob_beating_vig']:.3f}")

Key Insight: A bettor with a 54% win rate over 500 bets at -110 has a genuinely profitable record, but the bootstrap analysis reveals substantial uncertainty. The 95% confidence interval for the win rate might span from 50% to 58%, meaning we cannot rule out the possibility that the bettor is break-even. This is why serious bettors need thousands, not hundreds, of bets before drawing firm conclusions about skill. The bootstrap provides an honest assessment of how much we really know.


24.4 Permutation Tests for Sports Hypotheses

The Logic of Permutation Testing

A permutation test (also called a randomization test) evaluates a hypothesis by comparing an observed test statistic to the distribution of that statistic under random permutations of the data. The logic is elegant:

  1. Compute the test statistic $T_{\text{obs}}$ from the observed data.
  2. Under the null hypothesis, the labels (e.g., "home" vs. "away", "before model change" vs. "after model change") are exchangeable --- assigning them randomly should not affect the test statistic.
  3. Generate the permutation distribution by computing $T$ for many random rearrangements of the labels.
  4. The p-value is the proportion of permutation statistics that are at least as extreme as $T_{\text{obs}}$.

Permutation tests have several advantages over parametric tests:

  • No distributional assumptions: They work regardless of the underlying distribution.
  • Exact tests: Given enough permutations, the p-value is exact, not approximate.
  • Flexible test statistics: You can use any test statistic, not just those with known null distributions.

Applications in Sports Betting

Common questions where permutation tests excel:

  1. Is home-field advantage real? Permute the home/away labels and see if the observed home advantage falls in the tail of the permutation distribution.
  2. Did my model improve? Compare the prediction accuracy before and after a model change. Permute the "before"/"after" labels.
  3. Is a streaky player genuinely streaky? Permute the temporal order of outcomes and see if the observed streak length is extreme.
  4. Do conference games differ from non-conference games? Permute the conference/non-conference labels.

Python Implementation

import numpy as np
import pandas as pd
from scipy import stats


def permutation_test(
    group_a: np.ndarray,
    group_b: np.ndarray,
    statistic: str = "mean_diff",
    n_permutations: int = 10000,
    alternative: str = "two-sided",
    seed: int = 42,
) -> dict:
    """
    Perform a permutation test comparing two groups.

    Args:
        group_a: Data for group A.
        group_b: Data for group B.
        statistic: Test statistic to use. Options:
            'mean_diff': Difference in means.
            'median_diff': Difference in medians.
            'ks': Kolmogorov-Smirnov statistic.
        n_permutations: Number of random permutations.
        alternative: 'two-sided', 'greater', or 'less'.
        seed: Random seed.

    Returns:
        Dictionary with observed statistic, p-value, permutation
        distribution, and effect size.
    """
    rng = np.random.default_rng(seed)
    combined = np.concatenate([group_a, group_b])
    n_a = len(group_a)
    n_total = len(combined)

    # Compute test statistic
    def compute_stat(a, b):
        if statistic == "mean_diff":
            return a.mean() - b.mean()
        elif statistic == "median_diff":
            return np.median(a) - np.median(b)
        elif statistic == "ks":
            return stats.ks_2samp(a, b).statistic
        else:
            raise ValueError(f"Unknown statistic: {statistic}")

    observed = compute_stat(group_a, group_b)

    # Generate permutation distribution
    perm_stats = np.zeros(n_permutations)
    for i in range(n_permutations):
        perm = rng.permutation(combined)
        perm_a = perm[:n_a]
        perm_b = perm[n_a:]
        perm_stats[i] = compute_stat(perm_a, perm_b)

    # Compute p-value
    if alternative == "two-sided":
        p_value = np.mean(np.abs(perm_stats) >= np.abs(observed))
    elif alternative == "greater":
        p_value = np.mean(perm_stats >= observed)
    elif alternative == "less":
        p_value = np.mean(perm_stats <= observed)
    else:
        raise ValueError(f"Unknown alternative: {alternative}")

    # Effect size (Cohen's d)
    pooled_std = np.sqrt(
        (group_a.var() * (n_a - 1) + group_b.var() * (len(group_b) - 1))
        / (n_total - 2)
    )
    cohens_d = (group_a.mean() - group_b.mean()) / pooled_std if pooled_std > 0 else 0

    return {
        "observed_statistic": observed,
        "p_value": p_value,
        "n_permutations": n_permutations,
        "alternative": alternative,
        "permutation_distribution": perm_stats,
        "cohens_d": cohens_d,
        "significant_005": p_value < 0.05,
        "significant_001": p_value < 0.01,
    }


def test_home_advantage(
    home_margins: np.ndarray,
    n_permutations: int = 10000,
    seed: int = 42,
) -> dict:
    """
    Test whether home-field advantage is statistically significant
    using a permutation test.

    Under the null hypothesis, the home/away label does not matter,
    so each margin could equally well be positive or negative.
    This is equivalent to a sign-flipping test.

    Args:
        home_margins: Array of (home score - away score) for each game.
        n_permutations: Number of permutations.
        seed: Random seed.

    Returns:
        Dictionary with test results.
    """
    rng = np.random.default_rng(seed)
    observed_mean = home_margins.mean()

    # Sign-flipping permutation: under H0, each margin's sign is random
    perm_means = np.zeros(n_permutations)
    for i in range(n_permutations):
        signs = rng.choice([-1, 1], size=len(home_margins))
        perm_means[i] = (home_margins * signs).mean()

    p_value = np.mean(perm_means >= observed_mean)

    return {
        "observed_home_advantage": observed_mean,
        "p_value": p_value,
        "permutation_distribution": perm_means,
        "significant": p_value < 0.05,
        "n_games": len(home_margins),
        "perm_mean": perm_means.mean(),
        "perm_std": perm_means.std(),
    }


def test_model_improvement(
    errors_old: np.ndarray,
    errors_new: np.ndarray,
    metric: str = "mse",
    n_permutations: int = 10000,
    seed: int = 42,
) -> dict:
    """
    Test whether a new prediction model is significantly better
    than an old one using paired permutation test.

    This is a paired test: for each observation, we have predictions
    from both models. Under the null hypothesis, the two models are
    equally good, and any observed difference is due to chance.

    Args:
        errors_old: Prediction errors from the old model.
        errors_new: Prediction errors from the new model.
        metric: Error metric: 'mse' (squared errors) or 'mae'
            (absolute errors).
        n_permutations: Number of permutations.
        seed: Random seed.

    Returns:
        Dictionary with test results.
    """
    rng = np.random.default_rng(seed)

    if metric == "mse":
        old_metric = (errors_old ** 2).mean()
        new_metric = (errors_new ** 2).mean()
        loss_old = errors_old ** 2
        loss_new = errors_new ** 2
    elif metric == "mae":
        old_metric = np.abs(errors_old).mean()
        new_metric = np.abs(errors_new).mean()
        loss_old = np.abs(errors_old)
        loss_new = np.abs(errors_new)
    else:
        raise ValueError(f"Unknown metric: {metric}")

    # Paired differences in loss
    loss_diff = loss_old - loss_new  # Positive = new model is better
    observed_improvement = loss_diff.mean()

    # Under H0, the sign of each difference is random
    perm_improvements = np.zeros(n_permutations)
    for i in range(n_permutations):
        signs = rng.choice([-1, 1], size=len(loss_diff))
        perm_improvements[i] = (loss_diff * signs).mean()

    # One-sided test: is new model significantly better?
    p_value = np.mean(perm_improvements >= observed_improvement)

    return {
        "old_model_metric": old_metric,
        "new_model_metric": new_metric,
        "observed_improvement": observed_improvement,
        "pct_improvement": (
            (old_metric - new_metric) / old_metric * 100
            if old_metric > 0 else 0
        ),
        "p_value": p_value,
        "significant": p_value < 0.05,
        "n_observations": len(errors_old),
        "permutation_distribution": perm_improvements,
    }


# Worked Example 1: Testing Home-Field Advantage
print("PERMUTATION TEST: HOME-FIELD ADVANTAGE")
print("=" * 60)

rng = np.random.default_rng(42)

# Simulate NFL-like margins with true home advantage of ~2.5 points
n_games = 256  # Full NFL regular season
true_hfa = 2.5
home_margins = rng.normal(true_hfa, 14.0, size=n_games)

hfa_result = test_home_advantage(home_margins, n_permutations=10000)
print(f"Number of games: {hfa_result['n_games']}")
print(f"Observed home advantage: {hfa_result['observed_home_advantage']:.2f} points")
print(f"Permutation p-value: {hfa_result['p_value']:.4f}")
print(f"Significant at alpha=0.05: {hfa_result['significant']}")

# Worked Example 2: Testing Model Improvement
print("\nPERMUTATION TEST: MODEL IMPROVEMENT")
print("=" * 60)

# Simulate prediction errors from two models
n_games = 200
true_outcomes = rng.normal(0, 14, size=n_games)

# Old model: slightly worse predictions
old_predictions = true_outcomes + rng.normal(0, 4.5, size=n_games)
new_predictions = true_outcomes + rng.normal(0, 4.0, size=n_games)

errors_old = true_outcomes - old_predictions
errors_new = true_outcomes - new_predictions

model_result = test_model_improvement(
    errors_old, errors_new, metric="mse", n_permutations=10000
)
print(f"Old model MSE: {model_result['old_model_metric']:.2f}")
print(f"New model MSE: {model_result['new_model_metric']:.2f}")
print(f"Improvement:   {model_result['pct_improvement']:.1f}%")
print(f"P-value:       {model_result['p_value']:.4f}")
print(f"Significant:   {model_result['significant']}")

# Worked Example 3: Comparing Two Groups
print("\nPERMUTATION TEST: REST VS. BACK-TO-BACK")
print("=" * 60)

rest_performance = rng.normal(3.0, 12, size=150)
b2b_performance = rng.normal(0.5, 12, size=50)

b2b_result = permutation_test(
    rest_performance, b2b_performance,
    statistic="mean_diff",
    alternative="greater",
    n_permutations=10000,
)
print(f"Rest games mean:  {rest_performance.mean():.2f}")
print(f"B2B games mean:   {b2b_performance.mean():.2f}")
print(f"Observed diff:    {b2b_result['observed_statistic']:.2f}")
print(f"Cohen's d:        {b2b_result['cohens_d']:.3f}")
print(f"P-value:          {b2b_result['p_value']:.4f}")
print(f"Significant:      {b2b_result['significant_005']}")

Practical Tip: When testing whether your betting model has improved, always use a paired permutation test rather than an unpaired one. Each game has two predictions (one from each model), and the paired test controls for game-level difficulty. An unpaired test would be less powerful because it ignores the fact that some games are inherently harder to predict than others.


24.5 Variance Reduction Techniques

The Efficiency Problem

The standard Monte Carlo convergence rate of $O(1/\sqrt{N})$ means that reducing the standard error by a factor of 10 requires 100 times as many simulations. When each simulation is computationally expensive --- for example, simulating an entire season with complex player-level models --- this can be prohibitive.

Variance reduction techniques improve efficiency by producing more precise estimates for a given number of simulations, or equivalently, achieving the same precision with fewer simulations. The key insight is that we can modify how we sample or how we compute our estimate without changing the expected value, while reducing the variance.

The efficiency gain of a variance reduction technique is measured by the ratio:

$$\text{Efficiency} = \frac{\text{Var}(\hat{\mu}_{\text{naive}})}{\text{Var}(\hat{\mu}_{\text{reduced}})}$$

An efficiency gain of 10 means the variance-reduced estimator achieves the same precision with 10x fewer samples.

Antithetic Variates

The simplest variance reduction technique pairs each simulation with its "mirror image." If $U_1, U_2, \ldots, U_n$ are uniform random variables used to generate sample $X_1$, then $1 - U_1, 1 - U_2, \ldots, 1 - U_n$ generate the antithetic sample $X_1'$. The estimator is:

$$\hat{\mu}_{\text{AV}} = \frac{1}{2N} \sum_{i=1}^{N} [g(X_i) + g(X_i')]$$

This works because $X_i$ and $X_i'$ are negatively correlated: when one overestimates, the other tends to underestimate, and the average is closer to the truth.

For the antithetic estimator, the variance is:

$$\text{Var}(\hat{\mu}_{\text{AV}}) = \frac{1}{N} \cdot \frac{\text{Var}(g(X)) + \text{Cov}(g(X), g(X'))}{2}$$

When $\text{Cov}(g(X), g(X')) < 0$ (which it is for monotone $g$), this is smaller than the naive variance $\text{Var}(g(X))/N$.

Control Variates

If we have a variable $C$ whose expected value $E[C] = \mu_C$ is known, and $C$ is correlated with our quantity of interest $g(X)$, we can use the control variate estimator:

$$\hat{\mu}_{\text{CV}} = \hat{\mu}_N - \beta(\hat{C}_N - \mu_C)$$

where $\hat{C}_N$ is the sample mean of $C$ and $\beta$ is a coefficient chosen to minimize variance:

$$\beta^* = \frac{\text{Cov}(g(X), C)}{\text{Var}(C)}$$

The resulting variance is:

$$\text{Var}(\hat{\mu}_{\text{CV}}) = \text{Var}(\hat{\mu}_N)(1 - \rho^2_{g,C})$$

where $\rho_{g,C}$ is the correlation between $g(X)$ and $C$. If $|\rho| = 0.9$, the variance reduction is $1 - 0.81 = 0.19$, a 5x improvement.

In sports betting, natural control variates include:

  • The mean of simulated win probabilities (which should average to 0.5 in a zero-sum league)
  • The total number of simulated wins across all teams (which is fixed by the schedule)
  • Known analytical results for simplified versions of the problem

Importance Sampling

Importance sampling estimates $E_f[g(X)]$ by sampling from a different distribution $q$ and reweighting:

$$E_f[g(X)] = E_q\left[g(X) \frac{f(X)}{q(X)}\right] \approx \frac{1}{N} \sum_{i=1}^{N} g(X_i) \frac{f(X_i)}{q(X_i)}$$

where $X_i \sim q$. The ratio $w(X_i) = f(X_i)/q(X_i)$ is the importance weight.

This is particularly useful for estimating rare-event probabilities. If we want to estimate the probability that a 35-win team wins the NBA championship (a rare event), naive simulation might produce zero championships in 10,000 trials. Importance sampling can oversample favorable scenarios and reweight to get the correct probability.

Stratified Sampling

Stratified sampling divides the sample space into non-overlapping strata $S_1, S_2, \ldots, S_K$ and draws a fixed number of samples from each stratum. If stratum $k$ has probability $p_k$ and we draw $n_k$ samples from it:

$$\hat{\mu}_{\text{strat}} = \sum_{k=1}^{K} p_k \hat{\mu}_k$$

where $\hat{\mu}_k$ is the within-stratum sample mean. The variance is:

$$\text{Var}(\hat{\mu}_{\text{strat}}) = \sum_{k=1}^{K} \frac{p_k^2 \sigma_k^2}{n_k}$$

which is always less than or equal to the unstratified variance when samples are allocated proportionally.

Python Implementation

import numpy as np
import pandas as pd
from scipy import stats
import time


class VarianceReduction:
    """
    Implements variance reduction techniques for Monte Carlo
    simulation in sports betting contexts.
    """

    def __init__(self, seed: int = 42):
        self.rng = np.random.default_rng(seed)

    def naive_monte_carlo(
        self,
        simulation_func: callable,
        n_simulations: int = 10000,
        **kwargs,
    ) -> dict:
        """Standard Monte Carlo without variance reduction."""
        results = np.array([
            simulation_func(self.rng, **kwargs) for _ in range(n_simulations)
        ])
        return {
            "mean": results.mean(),
            "variance": results.var(ddof=1),
            "se": results.std(ddof=1) / np.sqrt(n_simulations),
            "n_simulations": n_simulations,
        }

    def antithetic_variates(
        self,
        simulation_func: callable,
        n_pairs: int = 5000,
        **kwargs,
    ) -> dict:
        """
        Monte Carlo with antithetic variates.

        Generates pairs of negatively correlated samples to reduce
        variance. The simulation function must accept a `uniforms`
        parameter --- an array of uniform random values used as the
        source of randomness.

        Args:
            simulation_func: Function taking (rng, uniforms, **kwargs).
            n_pairs: Number of antithetic pairs (total simulations = 2 * n_pairs).
            **kwargs: Additional arguments.

        Returns:
            Dictionary with estimates and efficiency metrics.
        """
        results = np.zeros(n_pairs)
        antithetic_results = np.zeros(n_pairs)

        for i in range(n_pairs):
            # Generate uniform randoms
            u = self.rng.random(size=kwargs.get("n_uniforms", 100))
            # Original sample
            results[i] = simulation_func(u, **kwargs)
            # Antithetic sample
            antithetic_results[i] = simulation_func(1 - u, **kwargs)

        # Combined estimates
        paired_means = (results + antithetic_results) / 2
        n_total = 2 * n_pairs

        return {
            "mean": paired_means.mean(),
            "variance": paired_means.var(ddof=1),
            "se": paired_means.std(ddof=1) / np.sqrt(n_pairs),
            "n_simulations": n_total,
            "naive_variance": np.concatenate([results, antithetic_results]).var(ddof=1),
            "correlation": np.corrcoef(results, antithetic_results)[0, 1],
        }

    def control_variates(
        self,
        simulation_func: callable,
        control_func: callable,
        control_mean: float,
        n_simulations: int = 10000,
        **kwargs,
    ) -> dict:
        """
        Monte Carlo with control variates.

        Uses a correlated variable with known expectation to reduce
        variance.

        Args:
            simulation_func: Primary simulation function.
            control_func: Control variate function (with known mean).
            control_mean: Known expected value of the control variate.
            n_simulations: Number of simulations.
            **kwargs: Additional arguments.

        Returns:
            Dictionary with estimates and efficiency metrics.
        """
        primary = np.zeros(n_simulations)
        control = np.zeros(n_simulations)

        for i in range(n_simulations):
            state = self.rng.bit_generator.state
            primary[i] = simulation_func(self.rng, **kwargs)
            self.rng.bit_generator.state = state
            control[i] = control_func(self.rng, **kwargs)

        # Optimal beta
        cov_pc = np.cov(primary, control)[0, 1]
        var_c = control.var(ddof=1)
        beta = cov_pc / var_c if var_c > 0 else 0

        # Adjusted estimates
        adjusted = primary - beta * (control - control_mean)

        # Compute correlation for efficiency reporting
        correlation = np.corrcoef(primary, control)[0, 1]

        return {
            "mean": adjusted.mean(),
            "variance": adjusted.var(ddof=1),
            "se": adjusted.std(ddof=1) / np.sqrt(n_simulations),
            "naive_variance": primary.var(ddof=1),
            "beta": beta,
            "correlation": correlation,
            "efficiency_gain": (
                primary.var(ddof=1) / adjusted.var(ddof=1)
                if adjusted.var(ddof=1) > 0 else np.inf
            ),
            "n_simulations": n_simulations,
        }

    def stratified_sampling(
        self,
        simulation_func: callable,
        n_strata: int = 10,
        n_per_stratum: int = 1000,
        **kwargs,
    ) -> dict:
        """
        Monte Carlo with stratified sampling.

        Divides the primary uniform random variable into equal strata
        and samples uniformly within each stratum.

        Args:
            simulation_func: Function taking (uniform_value, **kwargs).
            n_strata: Number of strata.
            n_per_stratum: Samples per stratum.
            **kwargs: Additional arguments.

        Returns:
            Dictionary with estimates and efficiency metrics.
        """
        n_total = n_strata * n_per_stratum
        stratum_means = np.zeros(n_strata)

        for k in range(n_strata):
            lower = k / n_strata
            upper = (k + 1) / n_strata
            # Sample uniformly within stratum
            u_stratified = lower + (upper - lower) * self.rng.random(
                size=n_per_stratum
            )
            stratum_results = np.array([
                simulation_func(u, **kwargs) for u in u_stratified
            ])
            stratum_means[k] = stratum_results.mean()

        overall_mean = stratum_means.mean()
        stratified_variance = stratum_means.var(ddof=1) / n_strata

        # Compare with naive (generate same total number of samples)
        naive_results = np.array([
            simulation_func(self.rng.random(), **kwargs)
            for _ in range(n_total)
        ])
        naive_variance = naive_results.var(ddof=1) / n_total

        return {
            "mean": overall_mean,
            "variance": stratified_variance,
            "se": np.sqrt(stratified_variance),
            "naive_variance": naive_variance,
            "efficiency_gain": (
                naive_variance / stratified_variance
                if stratified_variance > 0 else np.inf
            ),
            "n_simulations": n_total,
            "n_strata": n_strata,
        }


# Worked Example: Comparing Variance Reduction Techniques
print("VARIANCE REDUCTION TECHNIQUES COMPARISON")
print("=" * 60)

# Problem: Estimate the probability that a team with a 55% win rate
# finishes with >= 50 wins in an 82-game season

def simulate_season_wins_naive(rng, win_prob=0.55, n_games=82, target=50):
    """Returns 1 if team reaches target wins, 0 otherwise."""
    wins = rng.binomial(n_games, win_prob)
    return float(wins >= target)

def simulate_season_wins_av(uniforms, win_prob=0.55, n_games=82,
                            target=50, n_uniforms=82):
    """Antithetic version using pre-supplied uniforms."""
    wins = np.sum(uniforms[:n_games] < win_prob)
    return float(wins >= target)

# Analytical answer for comparison
from scipy.stats import binom
true_prob = 1 - binom.cdf(49, 82, 0.55)
print(f"True probability (analytical): {true_prob:.6f}")

vr = VarianceReduction(seed=42)

# Naive Monte Carlo
print(f"\nNaive Monte Carlo (N=10,000):")
naive = vr.naive_monte_carlo(
    simulate_season_wins_naive, n_simulations=10000,
    win_prob=0.55, n_games=82, target=50
)
print(f"  Estimate: {naive['mean']:.4f}")
print(f"  SE:       {naive['se']:.4f}")
print(f"  Variance: {naive['variance']:.6f}")

# Antithetic Variates
print(f"\nAntithetic Variates (N=10,000):")
av_result = vr.antithetic_variates(
    simulate_season_wins_av, n_pairs=5000,
    win_prob=0.55, n_games=82, target=50, n_uniforms=82
)
print(f"  Estimate:    {av_result['mean']:.4f}")
print(f"  SE:          {av_result['se']:.4f}")
print(f"  Correlation: {av_result['correlation']:.4f}")
print(f"  Variance ratio (naive/AV): "
      f"{av_result['naive_variance'] / av_result['variance']:.2f}x")

# Summary comparison
print(f"\nSUMMARY")
print("-" * 40)
print(f"{'Method':<25} {'Estimate':>10} {'SE':>10} {'Efficiency':>12}")
print("-" * 60)
print(f"{'Analytical':<25} {true_prob:>10.4f} {'---':>10} {'---':>12}")
print(f"{'Naive MC':<25} {naive['mean']:>10.4f} {naive['se']:>10.4f} {'1.00x':>12}")
print(f"{'Antithetic':<25} {av_result['mean']:>10.4f} {av_result['se']:>10.4f} "
      f"{av_result['naive_variance'] / max(av_result['variance'], 1e-10):>11.2f}x")

Practical Guidelines for Variance Reduction

Choosing the right variance reduction technique depends on the problem:

Technique Best When Typical Gain Complexity
Antithetic variates Simulation output is monotone in the random inputs 2-5x Low
Control variates A correlated quantity with known mean is available 2-50x Medium
Importance sampling Estimating rare-event probabilities 10-1000x High
Stratified sampling One-dimensional source of randomness dominates 2-10x Low-Medium

For most sports betting applications, control variates offer the best combination of simplicity and efficiency. The total wins in a league (which must equal a fixed number) and the average simulated win probability (which should be 0.5) are natural control variates.

Common Pitfall: Importance sampling can dramatically reduce variance when done correctly, but it can also increase variance catastrophically when the proposal distribution $q$ is poorly chosen. If $q$ assigns too little probability to regions where $f \cdot g$ is large, the importance weights will occasionally be enormous, producing a high-variance estimator. Always check the effective sample size: $N_{\text{eff}} = (\sum w_i)^2 / \sum w_i^2$. If $N_{\text{eff}} \ll N$, your importance distribution is poor.


24.6 Chapter Summary

This chapter developed the Monte Carlo simulation toolkit for sports betting analysis:

Fundamentals (Section 24.1): We established the theoretical foundations --- the Law of Large Numbers guarantees convergence, the Central Limit Theorem provides confidence intervals, and the $O(1/\sqrt{N})$ convergence rate determines the computational cost of precision. We built a reusable MonteCarloSimulator class and demonstrated probability estimation for correlated parlays.

Season and Tournament Simulation (Section 24.2): We constructed full season simulators that translate team power ratings into distributions of wins, playoff probabilities, and seeding outcomes. We extended this to single-elimination tournament brackets, computing round-by-round advancement probabilities. These tools directly support futures betting and bracket pool strategy.

Bootstrap Methods (Section 24.3): We implemented non-parametric bootstrap analysis for betting performance evaluation, including the BCa method for bias-corrected confidence intervals. The bootstrap provides honest uncertainty quantification for ROI, win rate, and other performance metrics --- essential for distinguishing skill from luck.

Permutation Tests (Section 24.4): We developed distribution-free hypothesis tests for common sports questions: home-field advantage, model improvement, and group comparisons. Permutation tests avoid parametric assumptions and work with any test statistic, making them ideal for the irregular distributions common in sports data.

Variance Reduction (Section 24.5): We implemented four techniques --- antithetic variates, control variates, importance sampling, and stratified sampling --- that can improve simulation efficiency by orders of magnitude. Each technique exploits different structural properties of the problem.

Key Takeaways

  1. Monte Carlo simulation is the most versatile analytical tool available: any question that can be formulated probabilistically can be answered by simulation, given enough computing time.

  2. Always quantify simulation uncertainty. A point estimate from a simulation without a standard error or confidence interval is incomplete and potentially misleading.

  3. Bootstrap confidence intervals (especially BCa) provide honest uncertainty quantification for betting performance, revealing how much of your observed edge is guaranteed by the data versus potentially explained by luck.

  4. Permutation tests are the gold standard for hypothesis testing in sports because they make no distributional assumptions and work with any test statistic.

  5. Variance reduction techniques are not optional for serious simulation work. A 10x efficiency gain means you can run overnight what would otherwise take 10 nights.

What Comes Next

Chapter 25 moves from simulation to optimization. Where this chapter asked "what are the probabilities?", Chapter 25 asks "given these probabilities, what is the optimal action?" We will develop linear programming, portfolio optimization, arbitrage detection, and multi-objective optimization methods --- the tools that translate probabilistic insights into optimal betting decisions.


Chapter 24 Exercises

  1. Build a Monte Carlo simulator for a five-leg parlay where legs have individual win probabilities of (0.58, 0.55, 0.52, 0.60, 0.51) and pairwise correlations of 0.15 between all pairs. Compare the simulated probability of all legs hitting versus the naive product-of-probabilities calculation. How much does correlation change the probability?

  2. Simulate the remaining NBA season 10,000 times using current standings and power ratings (you may use any public source for ratings). Compute each team's probability of making the playoffs, winning the conference, and winning the championship. Compare your playoff probabilities against the current futures market and identify any discrepancies.

  3. Collect your personal betting history (or generate a synthetic one with 300 bets at -110 with a 53% win rate). Compute BCa bootstrap confidence intervals for your ROI, win rate, and Sharpe ratio. What is the probability that your true win rate exceeds the 52.4% breakeven point?

  4. Using permutation tests, evaluate whether NBA home-court advantage has changed significantly between the 2018-19 season (pre-COVID) and the 2023-24 season. Use the difference in mean home margins as your test statistic.

  5. Implement all four variance reduction techniques for the following problem: estimate the probability that a 10-win NFL team (true win probability 0.59) wins 13 or more games. Compare the efficiency gains and discuss which technique is most practical for this application.