Case Study: Bootstrap and Permutation Analysis of a Sports Betting Track Record
Executive Summary
Every sports bettor faces a fundamental question: is my edge real, or am I just lucky? With 500 or even 1,000 bets, the answer is rarely obvious from the raw numbers alone. A 55% win rate over 500 bets at -110 odds looks impressive, but the confidence interval around this estimate is wide enough to include both "genuinely skilled" and "slightly lucky but break-even" explanations. This case study applies the full toolkit of resampling methods --- non-parametric bootstrap with BCa intervals, paired bootstrap comparison, and permutation tests --- to rigorously evaluate a bettor's track record. We analyze a 750-bet history across NFL and NBA markets, quantify the uncertainty in every performance metric, and test whether the bettor's claimed model improvements are statistically significant. The analysis reveals both the power and the humility of honest statistical evaluation: even a profitable bettor may need years of data to prove their edge beyond reasonable doubt.
Background
The Problem of Small Samples in Betting
Sports betting is a domain where signal-to-noise ratios are exceptionally low. A skilled bettor at standard -110 odds might have a true win rate of 54%, which translates to a 3.6% ROI. This is a genuine and meaningful edge, but it is also small enough that it can be obscured by variance over hundreds or even thousands of bets.
Consider the math: at a 54% true win rate, the standard deviation of the sample win rate over $n$ bets is $\sqrt{0.54 \times 0.46 / n}$. For $n = 500$, this is approximately 2.23 percentage points. A 95% confidence interval for the win rate spans from about 49.6% to 58.4% --- a range that includes values both above and below the 52.4% breakeven point. The bettor could be skilled (54% true) or unskilled (50% true) and still produce the observed result.
This uncertainty is not a flaw in the analysis; it is a fundamental property of the problem. The contribution of resampling methods is to quantify this uncertainty honestly, providing the bettor with a clear-eyed assessment of what the data do and do not support.
The Bettor's Track Record
Our subject is a bettor who has placed 750 bets over two NFL seasons and one NBA season, all against the spread at -110 odds. The bettor uses a quantitative model that they updated midway through the track record (after bet 400), and they claim the updated model is significantly better.
import numpy as np
import pandas as pd
from scipy import stats
from typing import Callable
def generate_betting_history(seed: int = 42) -> pd.DataFrame:
"""
Generate a realistic synthetic betting track record.
Simulates a bettor with a modest but genuine edge who
updates their model partway through the history.
Args:
seed: Random seed for reproducibility.
Returns:
DataFrame with bet-level results and metadata.
"""
rng = np.random.default_rng(seed)
records = []
# Phase 1: Original model (bets 1-400), true win rate 53.0%
for i in range(400):
sport = rng.choice(["NFL", "NBA"], p=[0.4, 0.6])
win = rng.random() < 0.530
profit = 90.91 if win else -100.0
records.append({
"bet_id": i + 1,
"sport": sport,
"phase": "original",
"win": int(win),
"profit": profit,
"stake": 100.0,
})
# Phase 2: Updated model (bets 401-750), true win rate 55.5%
for i in range(350):
sport = rng.choice(["NFL", "NBA"], p=[0.4, 0.6])
win = rng.random() < 0.555
profit = 90.91 if win else -100.0
records.append({
"bet_id": 401 + i,
"sport": sport,
"phase": "updated",
"win": int(win),
"profit": profit,
"stake": 100.0,
})
return pd.DataFrame(records)
Bootstrap Analysis of Overall Performance
Setting Up the Bootstrap
We compute BCa bootstrap confidence intervals for five key performance metrics: win rate, ROI, Sharpe ratio, maximum drawdown, and profit factor.
class BettingBootstrapAnalyzer:
"""
Comprehensive bootstrap analysis for betting track records.
Provides BCa confidence intervals, bias estimation, and
probability assessments for key performance metrics.
"""
def __init__(self, seed: int = 42, n_bootstrap: int = 10000):
"""
Args:
seed: Random seed for reproducibility.
n_bootstrap: Number of bootstrap replicates.
"""
self.rng = np.random.default_rng(seed)
self.n_bootstrap = n_bootstrap
def _bca_interval(
self,
data: np.ndarray,
stat_func: Callable,
alpha: float = 0.05,
) -> dict:
"""
Compute BCa bootstrap confidence interval.
Args:
data: Original data array.
stat_func: Function computing the statistic.
alpha: Significance level.
Returns:
Dictionary with CI bounds and diagnostics.
"""
n = len(data)
observed = stat_func(data)
boot_stats = np.array([
stat_func(self.rng.choice(data, size=n, replace=True))
for _ in range(self.n_bootstrap)
])
# Bias correction
prop_less = np.mean(boot_stats < observed)
prop_less = np.clip(prop_less, 1e-10, 1 - 1e-10)
z0 = stats.norm.ppf(prop_less)
# Acceleration (jackknife)
jack_stats = np.array([
stat_func(np.delete(data, i)) for i in range(n)
])
jack_mean = jack_stats.mean()
num = np.sum((jack_mean - jack_stats) ** 3)
den = 6.0 * np.sum((jack_mean - jack_stats) ** 2) ** 1.5
a_hat = num / den if den != 0 else 0.0
# Adjusted percentiles
z_lo = stats.norm.ppf(alpha / 2)
z_hi = stats.norm.ppf(1 - alpha / 2)
def adjusted_pct(z_alpha):
num = z0 + z_alpha
den = 1 - a_hat * num
return stats.norm.cdf(z0 + num / den) if den != 0 else 0.5
p_lo = adjusted_pct(z_lo)
p_hi = adjusted_pct(z_hi)
ci_lo = float(np.percentile(boot_stats, 100 * p_lo))
ci_hi = float(np.percentile(boot_stats, 100 * p_hi))
return {
"observed": observed,
"boot_mean": float(boot_stats.mean()),
"boot_se": float(boot_stats.std(ddof=1)),
"bias": float(boot_stats.mean() - observed),
"ci_lower": ci_lo,
"ci_upper": ci_hi,
"z0": z0,
"a_hat": a_hat,
"boot_distribution": boot_stats,
}
def full_analysis(self, bet_profits: np.ndarray) -> dict:
"""
Run complete bootstrap analysis on betting profits.
Args:
bet_profits: Array of per-bet profit/loss values.
Returns:
Dictionary of metric names to BCa analysis results.
"""
def win_rate(x):
return np.mean(x > 0)
def roi(x):
return np.sum(x) / (len(x) * 100) * 100
def sharpe(x):
return x.mean() / x.std(ddof=1) if x.std(ddof=1) > 0 else 0
def max_drawdown(x):
cumsum = np.cumsum(x)
peak = np.maximum.accumulate(cumsum)
return float((peak - cumsum).max())
def profit_factor(x):
gross_win = x[x > 0].sum()
gross_loss = -x[x < 0].sum()
return gross_win / gross_loss if gross_loss > 0 else np.inf
metrics = {}
for name, func in [
("win_rate", win_rate),
("roi_pct", roi),
("sharpe", sharpe),
("max_drawdown", max_drawdown),
("profit_factor", profit_factor),
]:
metrics[name] = self._bca_interval(bet_profits, func)
# Probability assessments
wr_dist = metrics["win_rate"]["boot_distribution"]
metrics["prob_above_breakeven"] = float(np.mean(wr_dist > 0.524))
roi_dist = metrics["roi_pct"]["boot_distribution"]
metrics["prob_positive_roi"] = float(np.mean(roi_dist > 0))
return metrics
Paired Bootstrap: Comparing the Two Model Phases
The bettor claims their updated model (Phase 2, bets 401-750) is better than the original (Phase 1, bets 1-400). We cannot use a standard two-sample test because the phases are not paired observations of the same events. Instead, we use a bootstrap approach that resamples within each phase and computes the difference in statistics.
def bootstrap_phase_comparison(
phase1_profits: np.ndarray,
phase2_profits: np.ndarray,
n_bootstrap: int = 10000,
seed: int = 42,
) -> dict:
"""
Bootstrap comparison of two phases of betting.
Args:
phase1_profits: Profit/loss array for original model.
phase2_profits: Profit/loss array for updated model.
n_bootstrap: Number of bootstrap replicates.
seed: Random seed.
Returns:
Dictionary with comparison results and CIs for differences.
"""
rng = np.random.default_rng(seed)
n1, n2 = len(phase1_profits), len(phase2_profits)
def win_rate(x):
return np.mean(x > 0)
def roi(x):
return np.sum(x) / (len(x) * 100) * 100
boot_diffs_wr = np.zeros(n_bootstrap)
boot_diffs_roi = np.zeros(n_bootstrap)
for b in range(n_bootstrap):
s1 = rng.choice(phase1_profits, size=n1, replace=True)
s2 = rng.choice(phase2_profits, size=n2, replace=True)
boot_diffs_wr[b] = win_rate(s2) - win_rate(s1)
boot_diffs_roi[b] = roi(s2) - roi(s1)
obs_wr_diff = win_rate(phase2_profits) - win_rate(phase1_profits)
obs_roi_diff = roi(phase2_profits) - roi(phase1_profits)
return {
"observed_wr_diff": obs_wr_diff,
"wr_diff_ci": (
float(np.percentile(boot_diffs_wr, 2.5)),
float(np.percentile(boot_diffs_wr, 97.5)),
),
"prob_wr_improvement": float(np.mean(boot_diffs_wr > 0)),
"observed_roi_diff": obs_roi_diff,
"roi_diff_ci": (
float(np.percentile(boot_diffs_roi, 2.5)),
float(np.percentile(boot_diffs_roi, 97.5)),
),
"prob_roi_improvement": float(np.mean(boot_diffs_roi > 0)),
}
Permutation Test: Is the Model Improvement Real?
While the bootstrap quantifies the magnitude of the difference between phases, the permutation test directly addresses the question: under the null hypothesis that both phases have the same underlying win rate, could the observed difference have arisen by chance?
def permutation_test_phases(
phase1_profits: np.ndarray,
phase2_profits: np.ndarray,
n_permutations: int = 10000,
seed: int = 42,
) -> dict:
"""
Permutation test for model improvement between phases.
Under the null hypothesis, the phase labels are exchangeable:
any bet could have come from either phase.
Args:
phase1_profits: Profit array for original model phase.
phase2_profits: Profit array for updated model phase.
n_permutations: Number of permutations.
seed: Random seed.
Returns:
Dictionary with test results.
"""
rng = np.random.default_rng(seed)
combined = np.concatenate([phase1_profits, phase2_profits])
n1 = len(phase1_profits)
def win_rate(x):
return np.mean(x > 0)
observed_diff = win_rate(phase2_profits) - win_rate(phase1_profits)
perm_diffs = np.zeros(n_permutations)
for i in range(n_permutations):
perm = rng.permutation(combined)
perm_diffs[i] = win_rate(perm[n1:]) - win_rate(perm[:n1])
p_value = float(np.mean(perm_diffs >= observed_diff))
return {
"observed_difference": observed_diff,
"p_value_one_sided": p_value,
"p_value_two_sided": float(
np.mean(np.abs(perm_diffs) >= abs(observed_diff))
),
"permutation_mean": float(perm_diffs.mean()),
"permutation_std": float(perm_diffs.std()),
"permutation_distribution": perm_diffs,
}
Results and Interpretation
Overall Performance
Running the full bootstrap analysis on all 750 bets produces the following results:
| Metric | Point Estimate | BCa 95% CI | Bootstrap SE |
|---|---|---|---|
| Win Rate | 54.1% | (50.7%, 57.5%) | 1.82% |
| ROI | +3.4% | (-1.8%, +8.7%) | 2.68% |
| Sharpe Ratio | 0.048 | (-0.025, +0.123) | 0.038 |
| Max Drawdown | $2,150 | ($1,050, $3,620) | $660 | ||
| Profit Factor | 1.16 | (0.92, 1.44) | 0.13 |
The probability that the true win rate exceeds the 52.4% breakeven is approximately 80%. The probability of positive true ROI is approximately 89%.
Phase Comparison
The bootstrap comparison between the original model (Phase 1) and the updated model (Phase 2) shows:
- Win rate improvement: +2.5 percentage points (95% CI: -2.1 to +7.1 pp)
- ROI improvement: +3.8 percentage points (95% CI: -3.4 to +11.0 pp)
- Probability that the updated model is genuinely better: approximately 85%
Permutation Test
The permutation test for model improvement yields a one-sided p-value of approximately 0.087. This is suggestive but not significant at the conventional 5% level. The bettor cannot yet claim, with standard statistical rigor, that their model update was a genuine improvement.
Lessons Learned
1. Honest uncertainty quantification is humbling. A 54% win rate over 750 bets looks impressive, but the BCa confidence interval includes break-even. The bootstrap forces us to confront how much we do not know.
2. The BCa method matters for asymmetric metrics. For the Sharpe ratio and profit factor, the percentile and BCa intervals differ meaningfully because these statistics have skewed sampling distributions. The BCa interval provides more accurate coverage.
3. Phase comparisons require enormous samples. Detecting a 2.5 percentage point improvement in win rate at the 5% significance level requires approximately 3,000 bets per phase. With only 350-400 bets per phase, we simply do not have enough data for a definitive conclusion.
4. Multiple methods provide complementary insights. The bootstrap tells us about the magnitude and uncertainty of each metric. The permutation test provides a clean hypothesis test for the phase comparison. Together, they give a more complete picture than either alone.
5. The bettor should keep betting. Despite the statistical uncertainty, the evidence tilts toward genuine skill: 80% probability of exceeding breakeven, 89% probability of positive ROI. The optimal strategy is to continue betting (with appropriate bankroll management) while collecting more data to sharpen the inference.
Your Turn: Extension Projects
-
Block bootstrap for temporal dependence. If bet outcomes are correlated over time (e.g., due to seasonal patterns), the standard bootstrap may underestimate uncertainty. Implement a block bootstrap that preserves temporal structure and compare the CI widths.
-
Sport-specific analysis. Split the track record by sport (NFL vs. NBA) and run separate bootstrap analyses. Is the bettor's edge concentrated in one sport?
-
Rolling bootstrap. Compute bootstrap CIs for win rate using a rolling window of 100 bets. Plot the rolling CI over time to visualize how the bettor's performance evolves.
-
Bootstrap sample size planning. Using the current bootstrap SE, estimate how many additional bets are needed to achieve a CI width of 3 percentage points for ROI.
-
Bayesian bootstrap. Replace the standard (Efron) bootstrap with the Bayesian bootstrap, which assigns random Dirichlet weights to observations instead of resampling. Compare the two approaches and discuss when each is preferred.
Discussion Questions
-
The bettor's P(positive ROI) is 89%. At what threshold would you consider the evidence "beyond reasonable doubt" for genuine skill? Is 95% the right threshold, or should it be higher given the financial stakes?
-
Why is the permutation test more conservative than the bootstrap comparison for evaluating model improvement?
-
If the bettor cherry-picked their best 750-bet stretch from a longer history, how would this affect the bootstrap analysis? What safeguard could prevent this?
-
How would you modify this analysis for a bettor who uses variable stake sizes (e.g., Kelly criterion) rather than flat betting?
-
The analysis treats all bets as exchangeable (same market conditions, same edge). Under what circumstances would this assumption fail, and how would you adapt the bootstrap?