Case Study 1: Is This Bettor Skilled or Lucky? A Statistical Investigation

Introduction

Mike has been betting on NFL and NBA games for the past two and a half seasons. He tracks every bet meticulously in a spreadsheet and has compiled a record he's proud of: 428 wins out of 800 bets against the spread at standard -110 juice. That's a 53.5% win rate, comfortably above the 52.38% breakeven threshold required to turn a profit at -110.

Mike's friends are impressed. His bookie is starting to limit his action. A sports betting forum has taken notice. But the question remains: is Mike genuinely skilled, or could these results be the product of random chance?

This case study applies the full toolkit of hypothesis testing to answer that question rigorously.

The Data

Mike's complete betting record breaks down as follows:

Metric	Value
Total bets	800
Wins	428
Losses	372
Win rate	53.50%
Average odds	-110
Unit size	$100
Total wagered	$80,000
Total profit	+$3,527
ROI	4.41%

His record by sport and season:

Period	Sport	Bets	Wins	Win %
Season 1	NFL	95	52	54.7%
Season 1	NBA	160	87	54.4%
Season 2	NFL	100	54	54.0%
Season 2	NBA	175	92	52.6%
Season 3	NFL	80	44	55.0%
Season 3	NBA	190	99	52.1%

Step 1: Formulating the Hypotheses

We begin by clearly stating our hypotheses:

Null Hypothesis (H0): Mike has no predictive ability. His true win rate is 50% (equivalent to flipping a fair coin against the spread).

Alternative Hypothesis (H1): Mike has genuine predictive ability. His true win rate is greater than 50%.

We choose a one-sided test because we are specifically interested in whether Mike performs better than chance. If his win rate were significantly below 50%, that would also be interesting (suggesting consistently bad judgment), but that is not the claim being evaluated here.

Significance Level: We set alpha = 0.05, meaning we require less than a 5% probability of observing results this extreme under the null hypothesis before concluding Mike is skilled.

Step 2: The Z-Test for Proportions

The most straightforward approach is a z-test comparing Mike's observed win rate to the hypothesized rate.

Observed proportion: p_hat = 428 / 800 = 0.535

Null proportion: p0 = 0.50

Standard error under H0: SE = sqrt(p0 * (1 - p0) / n) = sqrt(0.50 * 0.50 / 800) = sqrt(0.0003125) = 0.01768

Z-statistic: z = (p_hat - p0) / SE = (0.535 - 0.50) / 0.01768 = 0.035 / 0.01768 = 1.980

One-sided p-value: P(Z > 1.980) = 0.0239

The p-value is 0.024, which is below our significance threshold of 0.05. Under the standard framework, we reject the null hypothesis.

But let us not stop here. A single test tells only part of the story.

Step 3: Confidence Intervals

A 95% confidence interval for Mike's true win rate gives us a range of plausible values:

Standard Error (using observed proportion): SE = sqrt(0.535 * 0.465 / 800) = 0.01764

95% CI: 0.535 +/- 1.96 * 0.01764 = (0.5004, 0.5696)

Interpretation: We are 95% confident Mike's true win rate falls between 50.04% and 56.96%. This interval barely excludes 50%, which is consistent with the borderline significance we found above.

Critical observation: The interval includes 52.38% (the -110 breakeven), so we cannot be 95% confident that Mike is actually profitable. He might be better than a coin flip but still losing money to the vig.

Let us also compute the Wilson score interval, which has better coverage properties:

Wilson 95% CI: (0.5010, 0.5686)

The Wilson interval is similar in this case because the sample size is large and the proportion is not near 0 or 1.

Step 4: Testing Profitability (Not Just Skill)

Testing against 50% tells us whether Mike is better than random. But the real question for a bettor is whether they are profitable.

New null hypothesis: H0: p = 0.5238 (the breakeven rate at -110)

Z-statistic: z = (0.535 - 0.5238) / sqrt(0.5238 * 0.4762 / 800) = 0.0112 / 0.01765 = 0.635

One-sided p-value: P(Z > 0.635) = 0.263

We cannot reject the null hypothesis that Mike's true win rate equals the breakeven rate. While he appears to be winning, we have no statistically significant evidence that he is truly profitable in the long run.

This is a crucial distinction that many bettors overlook.

Step 5: The Exact Binomial Test

For added rigor, we can use the exact binomial test rather than the normal approximation:

P(X >= 428 | n = 800, p = 0.50) = sum from k=428 to 800 of C(800, k) * 0.5^800

Computing this exactly (or using Python):

Exact one-sided p-value: 0.0243

This is very close to the z-test p-value of 0.024, confirming that the normal approximation is reasonable for this sample size.

Step 6: Subsample Consistency Analysis

A skilled bettor should show some consistency across different periods and sports. If Mike's edge is concentrated in one small subsample, that is more likely to be a fluke.

NFL overall: 150 wins / 275 bets = 54.5% z = (0.545 - 0.50) / sqrt(0.25/275) = 0.045 / 0.03015 = 1.493, p = 0.068

NBA overall: 278 wins / 525 bets = 52.95% z = (0.5295 - 0.50) / sqrt(0.25/525) = 0.0295 / 0.02182 = 1.352, p = 0.088

Neither sport is individually significant at the 5% level. The overall significance emerges only from the combined sample. This is not necessarily damning -- the edge might simply be small and require a large sample to detect -- but it weakens the case somewhat.

Season-by-season consistency: The win rates range from 52.1% to 55.0%, which is relatively consistent. There is no obvious pattern of deterioration, which would suggest the market adapting to an exploitable pattern.

Step 7: What Would Chance Look Like?

To build intuition, let us simulate what chance alone could produce. If we simulated 10,000 bettors, each placing 800 fair-coin-flip bets:

About 240 would have win rates at or above 53.5% (matching our one-sided p-value of 0.024)
About 50 would have win rates at or above 55%
About 5 would have win rates at or above 56.5%

Mike's results are unusual under the null hypothesis, but not extraordinarily so. Out of the thousands of active sports bettors, many would be expected to achieve records at least as good as his by pure chance.

Step 8: Python Implementation

"""
Case Study 1: Statistical Evaluation of a Bettor's Record
Is This Bettor Skilled or Lucky?
"""

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from typing import Tuple, Dict


def evaluate_bettor_record(
    wins: int,
    total_bets: int,
    null_proportion: float = 0.50,
    alpha: float = 0.05,
) -> Dict[str, float]:
    """
    Comprehensive hypothesis test evaluation of a bettor's record.

    Args:
        wins: Number of winning bets.
        total_bets: Total number of bets placed.
        null_proportion: The hypothesized win rate under H0.
        alpha: Significance level for the test.

    Returns:
        Dictionary containing test results and statistics.
    """
    p_hat = wins / total_bets
    se_null = np.sqrt(null_proportion * (1 - null_proportion) / total_bets)
    se_obs = np.sqrt(p_hat * (1 - p_hat) / total_bets)

    # Z-test
    z_stat = (p_hat - null_proportion) / se_null
    p_value_one_sided = 1 - stats.norm.cdf(z_stat)
    p_value_two_sided = 2 * (1 - stats.norm.cdf(abs(z_stat)))

    # Exact binomial test
    binom_p_value = 1 - stats.binom.cdf(wins - 1, total_bets, null_proportion)

    # Confidence intervals (Wald)
    ci_95_lower = p_hat - 1.96 * se_obs
    ci_95_upper = p_hat + 1.96 * se_obs

    # Wilson score interval
    z_crit = stats.norm.ppf(1 - alpha / 2)
    denominator = 1 + z_crit**2 / total_bets
    center = (p_hat + z_crit**2 / (2 * total_bets)) / denominator
    margin = (z_crit / denominator) * np.sqrt(
        p_hat * (1 - p_hat) / total_bets + z_crit**2 / (4 * total_bets**2)
    )
    wilson_lower = center - margin
    wilson_upper = center + margin

    return {
        "observed_win_rate": p_hat,
        "null_proportion": null_proportion,
        "standard_error": se_null,
        "z_statistic": z_stat,
        "p_value_one_sided": p_value_one_sided,
        "p_value_two_sided": p_value_two_sided,
        "binomial_p_value": binom_p_value,
        "ci_95_wald": (ci_95_lower, ci_95_upper),
        "ci_95_wilson": (wilson_lower, wilson_upper),
        "reject_h0": p_value_one_sided < alpha,
    }


def print_evaluation_report(results: Dict[str, float], label: str = "") -> None:
    """Print a formatted evaluation report."""
    print(f"\n{'=' * 60}")
    print(f"  HYPOTHESIS TEST EVALUATION {label}")
    print(f"{'=' * 60}")
    print(f"  Observed win rate:    {results['observed_win_rate']:.4f}")
    print(f"  Null hypothesis:      p = {results['null_proportion']:.4f}")
    print(f"  Standard error:       {results['standard_error']:.4f}")
    print(f"  Z-statistic:          {results['z_statistic']:.4f}")
    print(f"  P-value (one-sided):  {results['p_value_one_sided']:.4f}")
    print(f"  P-value (two-sided):  {results['p_value_two_sided']:.4f}")
    print(f"  Binomial p-value:     {results['binomial_p_value']:.4f}")
    ci_wald = results["ci_95_wald"]
    ci_wilson = results["ci_95_wilson"]
    print(f"  95% CI (Wald):        ({ci_wald[0]:.4f}, {ci_wald[1]:.4f})")
    print(f"  95% CI (Wilson):      ({ci_wilson[0]:.4f}, {ci_wilson[1]:.4f})")
    verdict = "REJECT H0" if results["reject_h0"] else "FAIL TO REJECT H0"
    print(f"  Decision:             {verdict}")
    print(f"{'=' * 60}\n")


def plot_sampling_distribution(
    wins: int,
    total_bets: int,
    null_proportion: float = 0.50,
) -> None:
    """
    Visualize the sampling distribution under H0, the observed value,
    and the rejection region.
    """
    p_hat = wins / total_bets
    se = np.sqrt(null_proportion * (1 - null_proportion) / total_bets)

    x = np.linspace(null_proportion - 4 * se, null_proportion + 4 * se, 500)
    y = stats.norm.pdf(x, null_proportion, se)

    fig, ax = plt.subplots(figsize=(10, 6))
    ax.plot(x, y, "b-", linewidth=2, label="Distribution under H0")

    # Shade rejection region (one-sided, right tail)
    critical_value = null_proportion + 1.645 * se
    x_reject = x[x >= critical_value]
    y_reject = stats.norm.pdf(x_reject, null_proportion, se)
    ax.fill_between(x_reject, y_reject, alpha=0.3, color="red",
                    label=f"Rejection region (alpha=0.05)")

    # Mark observed value
    ax.axvline(x=p_hat, color="green", linestyle="--", linewidth=2,
               label=f"Observed win rate = {p_hat:.3f}")

    # Mark null hypothesis
    ax.axvline(x=null_proportion, color="blue", linestyle=":", linewidth=1,
               label=f"H0: p = {null_proportion:.3f}")

    ax.set_xlabel("Win Rate", fontsize=12)
    ax.set_ylabel("Density", fontsize=12)
    ax.set_title("Sampling Distribution Under H0 with Observed Result",
                 fontsize=14)
    ax.legend(fontsize=10)
    plt.tight_layout()
    plt.savefig("sampling_distribution.png", dpi=150)
    plt.show()


def plot_confidence_interval_progression(
    cumulative_wins: list,
    cumulative_bets: list,
) -> None:
    """
    Plot how the confidence interval narrows as the sample size grows.
    """
    fig, ax = plt.subplots(figsize=(12, 6))

    win_rates = [w / n for w, n in zip(cumulative_wins, cumulative_bets)]
    ci_lowers = []
    ci_uppers = []

    for w, n in zip(cumulative_wins, cumulative_bets):
        p = w / n
        se = np.sqrt(p * (1 - p) / n)
        ci_lowers.append(p - 1.96 * se)
        ci_uppers.append(p + 1.96 * se)

    ax.plot(cumulative_bets, win_rates, "b-o", markersize=4, label="Win Rate")
    ax.fill_between(cumulative_bets, ci_lowers, ci_uppers, alpha=0.2,
                    color="blue", label="95% CI")
    ax.axhline(y=0.50, color="red", linestyle="--", label="H0: p = 0.50")
    ax.axhline(y=0.5238, color="orange", linestyle="--",
               label="Breakeven (52.38%)")

    ax.set_xlabel("Number of Bets", fontsize=12)
    ax.set_ylabel("Win Rate", fontsize=12)
    ax.set_title("Win Rate and 95% Confidence Interval Over Time", fontsize=14)
    ax.legend(fontsize=10)
    ax.set_ylim(0.40, 0.65)
    plt.tight_layout()
    plt.savefig("confidence_interval_progression.png", dpi=150)
    plt.show()


def simulate_null_distribution(
    total_bets: int = 800,
    n_simulations: int = 10_000,
    seed: int = 42,
) -> np.ndarray:
    """
    Simulate the distribution of win rates under the null hypothesis.
    """
    rng = np.random.default_rng(seed)
    simulated_wins = rng.binomial(total_bets, 0.50, n_simulations)
    simulated_rates = simulated_wins / total_bets
    return simulated_rates


def plot_null_simulation(
    observed_rate: float,
    total_bets: int = 800,
    n_simulations: int = 10_000,
) -> None:
    """
    Plot the simulated null distribution and compare with observed result.
    """
    simulated_rates = simulate_null_distribution(total_bets, n_simulations)

    fig, ax = plt.subplots(figsize=(10, 6))
    ax.hist(simulated_rates, bins=50, density=True, alpha=0.7,
            color="steelblue", edgecolor="white",
            label="Simulated null distribution")
    ax.axvline(x=observed_rate, color="red", linestyle="--", linewidth=2,
               label=f"Observed: {observed_rate:.3f}")

    proportion_above = np.mean(simulated_rates >= observed_rate)
    ax.set_xlabel("Win Rate", fontsize=12)
    ax.set_ylabel("Density", fontsize=12)
    ax.set_title(
        f"Null Distribution (n={total_bets}): "
        f"{proportion_above:.1%} of simulations >= {observed_rate:.3f}",
        fontsize=13,
    )
    ax.legend(fontsize=10)
    plt.tight_layout()
    plt.savefig("null_simulation.png", dpi=150)
    plt.show()


# ─── Main Analysis ───────────────────────────────────────────────────────────

if __name__ == "__main__":

    # Mike's record
    WINS = 428
    TOTAL_BETS = 800
    BREAKEVEN_RATE = 0.5238

    # Test 1: Is Mike better than a coin flip?
    results_skill = evaluate_bettor_record(WINS, TOTAL_BETS, null_proportion=0.50)
    print_evaluation_report(results_skill, label="(Skill Test: p0 = 0.50)")

    # Test 2: Is Mike profitable (beating the vig)?
    results_profit = evaluate_bettor_record(
        WINS, TOTAL_BETS, null_proportion=BREAKEVEN_RATE
    )
    print_evaluation_report(results_profit, label="(Profitability Test: p0 = 0.5238)")

    # Subsample analysis
    subsamples = [
        ("NFL", 150, 275),
        ("NBA", 278, 525),
        ("Season 1", 139, 255),
        ("Season 2", 146, 275),
        ("Season 3", 143, 270),
    ]

    print("\n" + "=" * 60)
    print("  SUBSAMPLE ANALYSIS")
    print("=" * 60)
    for label, wins, bets in subsamples:
        p = wins / bets
        se = np.sqrt(0.25 / bets)
        z = (p - 0.50) / se
        pval = 1 - stats.norm.cdf(z)
        sig = "*" if pval < 0.05 else ""
        print(f"  {label:12s}: {wins}/{bets} = {p:.3f}  "
              f"z = {z:.3f}  p = {pval:.3f} {sig}")
    print("=" * 60)

    # Generate visualizations
    plot_sampling_distribution(WINS, TOTAL_BETS)
    plot_null_simulation(WINS / TOTAL_BETS, TOTAL_BETS)

    # Simulate progressive confidence intervals
    # (Using Mike's approximate running totals)
    cum_bets = [50, 100, 150, 200, 300, 400, 500, 600, 700, 800]
    cum_wins = [28, 55, 83, 110, 162, 216, 268, 322, 375, 428]
    plot_confidence_interval_progression(cum_wins, cum_bets)

Step 9: Considering the Base Rate

Here is where hypothesis testing alone can be misleading. We need to think about the base rate of skilled bettors.

Suppose that among all regular sports bettors who believe they have an edge: - 5% are truly skilled (have a genuine long-term edge) - 95% are unskilled (their true win rate is 50%)

If we test 1000 bettors at alpha = 0.05: - Of the 50 truly skilled bettors, perhaps 60% would test as significant (depending on their edge and our power). That gives us roughly 30 true positives. - Of the 950 unskilled bettors, 5% would test as significant by chance. That gives us roughly 48 false positives.

So of the 78 "significant" results, only 30 (38%) would represent genuinely skilled bettors. Mike's significant p-value does not make it probable that he is skilled -- it just means he is more likely to be skilled than a bettor with a non-significant result.

This is the base rate fallacy in action, and it underscores why a single hypothesis test should not be the final word.

Step 10: Overall Assessment

Let us weigh all the evidence:

Evidence FOR skill: - Win rate of 53.5% is statistically significant at the 5% level (z = 1.98, p = 0.024) against the null of 50% - Consistent performance across sports and seasons (no evidence of a single lucky streak) - Profitable in absolute terms (+$3,527)

Evidence AGAINST (or at least not conclusive): - The p-value (0.024) is significant but not overwhelming; it would not survive a more stringent threshold (alpha = 0.01, p = 0.024 > 0.01) - The confidence interval for win rate (50.04% to 56.96%) barely excludes 50% - Cannot reject the null that his true win rate equals the breakeven rate (p = 0.263) - No individual sport reaches significance on its own - Base rate considerations suggest many bettors could achieve this record by chance

Verdict: The evidence is suggestive but not conclusive. Mike's record is consistent with genuine skill, but it is also consistent with being a slightly lucky unskilled bettor. The most responsible assessment is:

"Mike's results are promising and mildly statistically significant. However, the evidence is not strong enough to confidently conclude genuine skill, particularly when considering the profitability threshold rather than the 50% threshold. Continued tracking with an additional 500-1000 bets would substantially clarify the picture. If his win rate remains above 53% over the next 500 bets, the combined evidence would become much more compelling."

Key Takeaways

Statistical significance is not binary wisdom. A p-value of 0.024 is different from 0.001, even though both are "significant" at alpha = 0.05.
Test the right null hypothesis. Beating 50% is not the same as beating the vig. The economically meaningful test is profitability, not just above-random performance.
Consider base rates. In a world with many bettors, even low false positive rates can generate many false positives in absolute terms.
Look at consistency. A bettor whose edge appears across multiple sports, seasons, and bet types is more credible than one whose results are concentrated in a small subsample.
More data resolves ambiguity. The single best recommendation for any bettor in Mike's position is to keep tracking and keep betting. Time and sample size are the ultimate arbiters.
Confidence intervals are more informative than p-values. The interval (50.04%, 56.96%) tells us far more than "p < 0.05" alone. It shows the range of plausible skill levels and highlights the remaining uncertainty.

End of Case Study 1