39 min read

> "The first principle is that you must not fool yourself --- and you are the easiest person to fool."

Learning Objectives

Formulate null and alternative hypotheses for common betting questions such as bettor skill, home field advantage, and model improvement
Calculate and correctly interpret p-values, avoiding the most common misinterpretations that plague both academic research and betting analysis
Determine the sample size required to detect a given betting edge with specified power and confidence, and understand why most bettors drastically underestimate this number
Apply chi-squared tests to categorical sports data including score distributions, key number analysis, and tests of independence between sporting variables
Recognize and correct for the multiple comparisons problem when evaluating many betting angles simultaneously, using Bonferroni and Benjamini-Hochberg procedures

In This Chapter

Chapter Overview
8.1 Framing Betting Questions as Hypotheses
8.2 P-Values and Their Interpretation
8.3 Sample Size Requirements for Betting Claims
8.4 Chi-Squared Tests for Categorical Sports Data
8.5 Multiple Testing Corrections and Data Snooping
8.6 Chapter Summary
What's Next: Chapter 9 (Regression Analysis)
Chapter 8 Exercises
Further Reading

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 8: Hypothesis Testing and Statistical Significance

"The first principle is that you must not fool yourself --- and you are the easiest person to fool." --- Richard Feynman, Cargo Cult Science (1974)

Chapter Overview

Every serious sports bettor eventually confronts the same uncomfortable question: Is my edge real, or am I just lucky? You have built a model that hits at 54% against the spread over 400 bets. Your friend claims he has a profitable NFL totals system based on weather data. A tout service advertises a 60% win rate over their "last 200 picks." In each case, the central problem is the same: how do you distinguish a genuine signal from the random noise that is guaranteed to produce apparent patterns in any sufficiently large dataset?

Hypothesis testing is the formal statistical framework for answering exactly this kind of question. Developed over the first half of the twentieth century by Ronald Fisher, Jerzy Neyman, and Egon Pearson, hypothesis testing provides a disciplined procedure for evaluating claims against the backdrop of random variation. It is, at its core, a method for quantifying how surprised you should be by an observed result if nothing interesting is actually happening.

For the sports bettor, hypothesis testing is not merely an academic exercise. It is the difference between deploying capital on a genuine edge and hemorrhaging money on an illusion. It is the tool that separates the disciplined analyst from the gambler who mistakes a hot streak for skill. And it is the framework that protects you from the most insidious trap in betting analysis: finding patterns in historical data that exist only because you looked hard enough.

This chapter will equip you with the mathematical machinery and, more importantly, the interpretive discipline to make these determinations rigorously. We begin with the formulation of hypotheses, proceed through p-value calculation and interpretation, tackle the critical question of sample size, apply chi-squared tests to categorical sports data, and conclude with what may be the single most important topic for the aspiring quantitative bettor: the multiple testing problem and data snooping.

In this chapter, you will learn to: - Translate betting questions into formal statistical hypotheses and select appropriate tests - Compute p-values from scratch and interpret them correctly in betting contexts - Calculate the number of bets required to confirm edges of various sizes - Apply chi-squared tests to real sports data questions - Protect yourself from false discoveries when evaluating multiple betting angles

8.1 Framing Betting Questions as Hypotheses

The Logic of Hypothesis Testing

Hypothesis testing works by a kind of indirect reasoning that may feel unnatural at first. Rather than directly measuring the probability that a bettor is skilled, we ask: If this bettor had no skill whatsoever, how likely would we be to observe results at least this good? If that probability is very small, we conclude that the "no skill" assumption is implausible and reject it in favor of the alternative --- that something real is going on.

This logic has two components:

The null hypothesis ($H_0$): A statement of "nothing interesting is happening." In betting contexts, this is typically: the bettor has no edge, the model adds no value, the factor has no effect on outcomes.
The alternative hypothesis ($H_a$ or $H_1$): A statement that something interesting is happening. The bettor does have skill, the model does improve predictions, the factor does affect outcomes.

The entire framework is built around the null hypothesis. We assume $H_0$ is true, compute the probability of observing data as extreme as what we actually observed, and then make a judgment about whether that probability is small enough to warrant rejecting $H_0$.

Common Betting Hypotheses

Let us translate several common betting questions into formal hypothesis tests.

Question 1: "Is this bettor skilled or lucky?"

Suppose a bettor has won 270 out of 500 against-the-spread bets (54.0% win rate). Standard ATS bets at -110 require approximately 52.4% accuracy to break even.

$$H_0: p = 0.50 \quad \text{(the bettor has no skill; wins are coin flips)}$$

$$H_a: p > 0.50 \quad \text{(the bettor has genuine predictive ability)}$$

Here, $p$ represents the bettor's true long-run win probability on ATS bets. Notice that we set $H_0$ at 0.50, not at 0.524. This is a deliberate choice: we are testing whether the bettor can pick winners at a rate better than chance (50%), which is a necessary condition for profitability. An alternative approach would set $H_0: p = 0.524$ to test directly whether the bettor is profitable after vig, but the 50% null is the more standard formulation and the more conservative test.

Question 2: "Does home field advantage exist in the NFL?"

Using historical data, we observe that the home team has won 57.1% of NFL games over a given period.

$$H_0: p_{\text{home}} = 0.50 \quad \text{(no home field advantage)}$$

$$H_a: p_{\text{home}} > 0.50 \quad \text{(home teams win more often than chance)}$$

Question 3: "Did the rule change affect scoring?"

The NFL moved the extra-point distance from the 2-yard line to the 15-yard line before the 2015 season. We want to test whether this changed the rate of successful conversions.

$$H_0: p_{\text{before}} = p_{\text{after}} \quad \text{(rule change had no effect)}$$

$$H_a: p_{\text{before}} \neq p_{\text{after}} \quad \text{(conversion rates differ)}$$

Question 4: "Does my new model feature improve predictions?"

You add a weather variable to your NFL totals model. The model's log-likelihood improves slightly.

$$H_0: \beta_{\text{weather}} = 0 \quad \text{(weather variable adds no predictive power)}$$

$$H_a: \beta_{\text{weather}} \neq 0 \quad \text{(weather variable is informatively related to totals)}$$

One-Tailed vs. Two-Tailed Tests

The choice between a one-tailed and two-tailed test depends on the specificity of your alternative hypothesis.

A one-tailed test is appropriate when you have a directional hypothesis --- you expect the effect to go in a specific direction. In Question 1 above, we test $H_a: p > 0.50$ because we are specifically interested in whether the bettor wins more than 50%. We do not care about the possibility that the bettor is significantly worse than a coin flip (though that would also be informative in a different way).

A two-tailed test is appropriate when the effect could plausibly go in either direction. In Question 3, the rule change could have either increased or decreased the conversion rate, so we test $H_a: p_{\text{before}} \neq p_{\text{after}}$.

The practical difference is that a one-tailed test is more powerful (more likely to detect a real effect) for a given sample size, but only in the specified direction. A two-tailed test splits its rejection region across both tails of the distribution, effectively requiring stronger evidence to reject $H_0$.

Rule of thumb for betting applications: Use one-tailed tests when evaluating bettor skill (you care about positive edge), model improvement (you care about better, not worse predictions), and directional market hypotheses. Use two-tailed tests when evaluating rule changes, comparing two populations, or any situation where a significant effect in either direction would be meaningful.

Python Code: Hypothesis Testing Framework

The following code establishes a reusable framework for setting up and executing common hypothesis tests in betting contexts.

import numpy as np
from scipy import stats
from dataclasses import dataclass
from typing import Optional


@dataclass
class HypothesisTest:
    """
    A framework for setting up and executing hypothesis tests
    in sports betting contexts.

    Attributes:
        name: A descriptive name for the test.
        null_value: The parameter value under H0.
        alternative: Direction of H_a ('greater', 'less', or 'two-sided').
        alpha: Significance level (default 0.05).
    """
    name: str
    null_value: float
    alternative: str = "greater"  # 'greater', 'less', or 'two-sided'
    alpha: float = 0.05

    def test_proportion(self, successes: int, trials: int) -> dict:
        """
        One-sample z-test for a proportion.

        Tests whether an observed proportion differs significantly
        from the null hypothesis value.

        Args:
            successes: Number of successes (e.g., winning bets).
            trials: Total number of trials (e.g., total bets).

        Returns:
            Dictionary with test statistic, p-value, and conclusion.
        """
        p_hat = successes / trials
        p_0 = self.null_value

        # Standard error under the null hypothesis
        se = np.sqrt(p_0 * (1 - p_0) / trials)

        # Z test statistic
        z = (p_hat - p_0) / se

        # P-value depends on alternative hypothesis direction
        if self.alternative == "greater":
            p_value = 1 - stats.norm.cdf(z)
        elif self.alternative == "less":
            p_value = stats.norm.cdf(z)
        else:  # two-sided
            p_value = 2 * (1 - stats.norm.cdf(abs(z)))

        reject = p_value < self.alpha

        return {
            "test_name": self.name,
            "observed_proportion": round(p_hat, 4),
            "null_value": p_0,
            "n": trials,
            "z_statistic": round(z, 4),
            "p_value": round(p_value, 6),
            "alpha": self.alpha,
            "reject_null": reject,
            "conclusion": (
                f"Reject H0 at alpha={self.alpha}. Evidence supports H_a."
                if reject else
                f"Fail to reject H0 at alpha={self.alpha}. "
                f"Insufficient evidence against the null."
            ),
        }

    def test_two_proportions(
        self, successes_1: int, trials_1: int,
        successes_2: int, trials_2: int
    ) -> dict:
        """
        Two-sample z-test for the difference between two proportions.

        Useful for comparing win rates across different conditions,
        time periods, or groups.

        Args:
            successes_1: Successes in group 1.
            trials_1: Trials in group 1.
            successes_2: Successes in group 2.
            trials_2: Trials in group 2.

        Returns:
            Dictionary with test statistic, p-value, and conclusion.
        """
        p1 = successes_1 / trials_1
        p2 = successes_2 / trials_2

        # Pooled proportion under H0
        p_pool = (successes_1 + successes_2) / (trials_1 + trials_2)

        se = np.sqrt(p_pool * (1 - p_pool) * (1/trials_1 + 1/trials_2))

        z = (p1 - p2) / se

        if self.alternative == "greater":
            p_value = 1 - stats.norm.cdf(z)
        elif self.alternative == "less":
            p_value = stats.norm.cdf(z)
        else:
            p_value = 2 * (1 - stats.norm.cdf(abs(z)))

        reject = p_value < self.alpha

        return {
            "test_name": self.name,
            "proportion_1": round(p1, 4),
            "proportion_2": round(p2, 4),
            "difference": round(p1 - p2, 4),
            "z_statistic": round(z, 4),
            "p_value": round(p_value, 6),
            "reject_null": reject,
        }


# --- Example usage ---

# Test 1: Is a bettor with 270/500 wins skilled?
bettor_test = HypothesisTest(
    name="Bettor Skill Test (ATS)",
    null_value=0.50,
    alternative="greater",
    alpha=0.05,
)
result = bettor_test.test_proportion(successes=270, trials=500)
print("=" * 60)
print(f"Test: {result['test_name']}")
print(f"Observed win rate: {result['observed_proportion']:.1%}")
print(f"Null hypothesis: p = {result['null_value']}")
print(f"Z-statistic: {result['z_statistic']}")
print(f"P-value: {result['p_value']:.6f}")
print(f"Decision: {result['conclusion']}")
print("=" * 60)

# Test 2: Home field advantage in NFL
# Suppose home teams won 2,856 out of 5,003 games
home_test = HypothesisTest(
    name="NFL Home Field Advantage",
    null_value=0.50,
    alternative="greater",
    alpha=0.05,
)
result_home = home_test.test_proportion(successes=2856, trials=5003)
print(f"\nTest: {result_home['test_name']}")
print(f"Observed home win rate: {result_home['observed_proportion']:.1%}")
print(f"Z-statistic: {result_home['z_statistic']}")
print(f"P-value: {result_home['p_value']:.6f}")
print(f"Decision: {result_home['conclusion']}")

# Test 3: Did a rule change affect conversion rates?
# Before: 1,230/1,260 successful PATs. After: 1,100/1,180.
rule_test = HypothesisTest(
    name="Extra Point Rule Change Effect",
    null_value=0.0,
    alternative="two-sided",
    alpha=0.05,
)
result_rule = rule_test.test_two_proportions(
    successes_1=1230, trials_1=1260,
    successes_2=1100, trials_2=1180,
)
print(f"\nTest: {result_rule['test_name']}")
print(f"Before rate: {result_rule['proportion_1']:.1%}")
print(f"After rate: {result_rule['proportion_2']:.1%}")
print(f"Z-statistic: {result_rule['z_statistic']}")
print(f"P-value: {result_rule['p_value']:.6f}")
print(f"Reject null: {result_rule['reject_null']}")

Running this code produces output that immediately contextualizes each betting question. The bettor with 270/500 wins generates a z-statistic of approximately 1.789, yielding a one-tailed p-value of about 0.037 --- enough to reject the null at the 5% level, but not at the 1% level. The NFL home field advantage test, with its much larger sample size, produces overwhelming evidence. The rule change test reveals a statistically significant difference in conversion rates.

Key Insight: Notice how the same fundamental procedure --- compute a test statistic, obtain a p-value, compare to a threshold --- applies to vastly different betting questions. The art lies not in the mechanics of the test but in the careful framing of the hypotheses and the honest interpretation of the results.

8.2 P-Values and Their Interpretation

What P-Values Actually Mean

The p-value is the single most important --- and most misunderstood --- quantity in all of statistical inference. Let us state its definition precisely:

The p-value is the probability of observing a test statistic at least as extreme as the one actually observed, assuming the null hypothesis is true.

Mathematically, for a one-tailed test with test statistic $Z$:

$$p = P(Z \geq z_{\text{obs}} \mid H_0 \text{ is true})$$

For a two-tailed test:

$$p = P(|Z| \geq |z_{\text{obs}}| \mid H_0 \text{ is true})$$

The p-value answers a very specific question: Given that there is no real effect, how unusual is what I observed? A small p-value means the observed result would be very unlikely under the null hypothesis, which we interpret as evidence against $H_0$.

What P-Values Do NOT Mean

The misinterpretation of p-values is so pervasive that the American Statistical Association issued an unprecedented formal statement on the topic in 2016. The following are the most common --- and most dangerous --- misinterpretations:

Misinterpretation 1: "The p-value is the probability that the null hypothesis is true."

This is flatly wrong. The p-value is $P(\text{data} \mid H_0)$, not $P(H_0 \mid \text{data})$. Confusing these two is the "prosecutor's fallacy" and it can lead to wildly incorrect conclusions. A p-value of 0.03 does not mean there is a 3% chance the bettor is unskilled. It means that if the bettor were unskilled, there would be a 3% chance of seeing results this good or better.

Misinterpretation 2: "A p-value of 0.05 means there is a 95% chance the effect is real."

This follows from the first misinterpretation. The p-value tells you nothing directly about the probability that your hypothesis is correct. The probability that an effect is real depends on the prior probability of the effect (how plausible was it before you looked at the data?) combined with the evidence, as formalized by Bayes' theorem.

Misinterpretation 3: "If p > 0.05, there is no effect."

Failure to reject the null hypothesis is not the same as accepting it. A non-significant result may simply reflect insufficient sample size rather than the absence of an effect. A bettor with a genuine 52% edge may easily produce a p-value above 0.05 in a sample of 200 bets.

Misinterpretation 4: "A smaller p-value means a larger effect."

The p-value is a function of both effect size and sample size. A tiny, practically meaningless edge can produce a highly significant p-value with enough data. An NFL home field advantage of 50.5% (barely distinguishable from a coin flip) would yield a p-value well below 0.001 if measured across 100,000 games.

Misinterpretation 5: "P-values from different studies are directly comparable."

A p-value of 0.01 from a study of 100 bets and a p-value of 0.01 from a study of 10,000 bets carry very different implications about effect size. The smaller study requires a much larger observed effect to achieve the same p-value.

Calculating P-Values from Test Statistics

The mechanics of p-value calculation follow a consistent pattern:

Step 1: Choose an appropriate test statistic based on the type of data and hypothesis.

Step 2: Compute the observed value of the test statistic from the data.

Step 3: Determine the sampling distribution of the test statistic under $H_0$.

Step 4: Calculate the probability of observing a test statistic at least as extreme as the observed value.

For proportions (the most common case in betting), the z-test statistic is:

$$z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1 - p_0)}{n}}}$$

where $\hat{p}$ is the observed proportion, $p_0$ is the hypothesized proportion under $H_0$, and $n$ is the sample size. Under $H_0$, this statistic follows a standard normal distribution $N(0, 1)$ for large $n$.

For means with known variance:

$$z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}$$

For means with unknown variance (more realistic):

$$t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}$$

where $s$ is the sample standard deviation. This statistic follows a $t$-distribution with $n - 1$ degrees of freedom.

Significance Levels and Their Arbitrary Nature

The significance level $\alpha$ is the threshold below which we reject $H_0$. The conventional choice is $\alpha = 0.05$ (5%), though $\alpha = 0.01$ (1%) and $\alpha = 0.10$ (10%) are also common. It is essential to understand that this threshold is entirely arbitrary --- a social convention, not a mathematical law.

The choice of $\alpha$ represents a trade-off:

$\alpha$ Level	Type I Error Rate	Interpretation	When to Use
0.10	10%	Lenient; more discoveries, more false positives	Exploratory analysis, preliminary screening
0.05	5%	Conventional; balance of sensitivity and specificity	Standard hypothesis tests, general reporting
0.01	1%	Stringent; fewer discoveries, fewer false positives	High-stakes decisions, expensive follow-up
0.005	0.5%	Very stringent; recently proposed as new standard	Replicability-focused research
0.001	0.1%	Extremely stringent	Extraordinary claims, physics-style discovery

For betting applications, the appropriate $\alpha$ depends on the cost of a false positive. If rejecting $H_0$ means you will deploy real capital on an apparent edge, you should demand stringent evidence --- perhaps $\alpha = 0.01$ or lower. If you are merely screening angles for further investigation, a more lenient threshold is acceptable.

The P-Value Controversy and Bayesian Alternatives

The frequentist p-value framework has come under increasing criticism, culminating in calls from some statisticians to abandon the concept entirely. The core objection is that p-values answer the wrong question: bettors want to know $P(\text{edge is real} \mid \text{data})$, but the p-value gives $P(\text{data} \mid \text{no edge})$.

The Bayesian alternative addresses this directly through Bayes' theorem:

$$P(H_1 \mid \text{data}) = \frac{P(\text{data} \mid H_1) \cdot P(H_1)}{P(\text{data})}$$

where $P(H_1)$ is the prior probability that the alternative hypothesis is true --- your belief before seeing the data that the bettor is skilled, the model works, or the angle is real.

This prior matters enormously. If a random person off the street claims a 54% ATS win rate over 500 bets, the prior probability that they have genuine skill is very low (most bettors lose), so even a p-value of 0.03 should not be particularly convincing. If a known professional bettor with a decade of verified results shows the same numbers, the prior is much higher.

We will not develop full Bayesian hypothesis testing here (that requires Chapter 10's treatment of Bayesian methods), but the key takeaway is this: a p-value should never be interpreted in isolation. It must be weighed against the prior plausibility of the claim.

Python Code: P-Value Calculation for Common Scenarios

import numpy as np
from scipy import stats


def bettor_skill_test(wins: int, total_bets: int, null_prob: float = 0.50,
                      alpha: float = 0.05) -> dict:
    """
    Test whether a bettor's win rate is significantly better than chance.

    Performs a one-sample z-test for proportions, reporting the p-value,
    confidence interval, and a plain-English interpretation.

    Args:
        wins: Number of winning bets.
        total_bets: Total number of bets placed.
        null_prob: Win probability under H0 (default 0.50).
        alpha: Significance level.

    Returns:
        Dictionary with comprehensive test results.
    """
    p_hat = wins / total_bets
    se_null = np.sqrt(null_prob * (1 - null_prob) / total_bets)
    se_obs = np.sqrt(p_hat * (1 - p_hat) / total_bets)

    z = (p_hat - null_prob) / se_null
    p_value_one_tail = 1 - stats.norm.cdf(z)
    p_value_two_tail = 2 * (1 - stats.norm.cdf(abs(z)))

    # Confidence interval for the true proportion
    z_crit = stats.norm.ppf(1 - alpha / 2)
    ci_lower = p_hat - z_crit * se_obs
    ci_upper = p_hat + z_crit * se_obs

    return {
        "win_rate": p_hat,
        "wins": wins,
        "total_bets": total_bets,
        "z_statistic": z,
        "p_value_one_tailed": p_value_one_tail,
        "p_value_two_tailed": p_value_two_tail,
        "confidence_interval": (ci_lower, ci_upper),
        "significant_one_tailed": p_value_one_tail < alpha,
        "significant_two_tailed": p_value_two_tail < alpha,
    }


def profit_significance_test(profits: list, alpha: float = 0.05) -> dict:
    """
    Test whether a bettor's average profit per bet is significantly
    different from zero using a one-sample t-test.

    This is often more appropriate than testing win rate because
    it accounts for varying odds and stake sizes.

    Args:
        profits: List of profit/loss values for each bet.
        alpha: Significance level.

    Returns:
        Dictionary with t-test results.
    """
    profits = np.array(profits)
    n = len(profits)
    mean_profit = np.mean(profits)
    se = stats.sem(profits)
    t_stat = mean_profit / se
    p_value = 1 - stats.t.cdf(t_stat, df=n - 1)  # One-tailed

    return {
        "n_bets": n,
        "mean_profit": mean_profit,
        "std_profit": np.std(profits, ddof=1),
        "t_statistic": t_stat,
        "p_value_one_tailed": p_value,
        "significant": p_value < alpha,
    }


# --- Worked Example: 54% win rate on 500 bets ---

print("=" * 65)
print("WORKED EXAMPLE: Testing a 54% Win Rate Over 500 ATS Bets")
print("=" * 65)

result = bettor_skill_test(wins=270, total_bets=500)

print(f"\nObserved record: {result['wins']}-{result['total_bets'] - result['wins']}"
      f" ({result['win_rate']:.1%})")
print(f"Null hypothesis: p = 0.50 (no skill, pure coin flip)")
print(f"Alternative: p > 0.50 (bettor has skill)")
print(f"\nZ-statistic: {result['z_statistic']:.4f}")
print(f"P-value (one-tailed): {result['p_value_one_tailed']:.6f}")
print(f"P-value (two-tailed): {result['p_value_two_tailed']:.6f}")
print(f"95% CI for true win rate: ({result['confidence_interval'][0]:.3f}, "
      f"{result['confidence_interval'][1]:.3f})")

print(f"\nAt alpha = 0.05 (one-tailed): "
      f"{'REJECT H0' if result['significant_one_tailed'] else 'FAIL TO REJECT H0'}")
print(f"At alpha = 0.01 (one-tailed): "
      f"{'REJECT H0' if result['p_value_one_tailed'] < 0.01 else 'FAIL TO REJECT H0'}")

print("\nInterpretation:")
print("The bettor's 54% win rate over 500 bets produces a z-statistic")
print("of approximately 1.79 and a one-tailed p-value of about 0.037.")
print("This means: IF the bettor had no skill at all, there would be")
print("only a 3.7% chance of seeing 270+ wins in 500 fair bets.")
print("\nThis is significant at the 5% level but NOT at the 1% level.")
print("A cautious analyst would want more data before committing capital.")

# --- Comparison across different sample sizes ---
print("\n" + "=" * 65)
print("HOW SAMPLE SIZE AFFECTS SIGNIFICANCE (all at 54% win rate)")
print("=" * 65)
print(f"{'Bets':>8} {'Wins':>8} {'Z-stat':>10} {'P-value':>12} {'Sig (5%)?':>12}")
print("-" * 55)

for n_bets in [100, 200, 500, 1000, 2000, 5000]:
    wins = int(0.54 * n_bets)
    r = bettor_skill_test(wins=wins, total_bets=n_bets)
    sig = "YES" if r['significant_one_tailed'] else "NO"
    print(f"{n_bets:>8} {wins:>8} {r['z_statistic']:>10.3f}"
          f" {r['p_value_one_tailed']:>12.6f} {sig:>12}")

This example illustrates a crucial point: the same 54% win rate is not significant at 100 bets (p = 0.212), borderline at 500 bets (p = 0.037), and overwhelmingly significant at 5,000 bets (p < 0.000001). The win rate has not changed --- only the evidence for it has accumulated.

Worked Example: Testing Whether a Bettor's 54% Win Rate on 500 Bets Is Significant

Let us walk through this example with full detail, as it represents the single most common hypothesis test a sports bettor will encounter.

Setup: A bettor places 500 against-the-spread bets at standard -110 odds and wins 270, for a 54.0% win rate.

Step 1: State the hypotheses.

$$H_0: p = 0.50 \quad \text{(no skill)}$$ $$H_a: p > 0.50 \quad \text{(genuine skill)}$$

Step 2: Compute the test statistic.

$$z = \frac{0.54 - 0.50}{\sqrt{\frac{0.50 \times 0.50}{500}}} = \frac{0.04}{0.02236} = 1.789$$

Step 3: Find the p-value.

Using the standard normal distribution:

$$p = P(Z \geq 1.789) = 1 - \Phi(1.789) = 0.0368$$

Step 4: Make a decision.

At $\alpha = 0.05$: Reject $H_0$. The result is statistically significant. At $\alpha = 0.01$: Fail to reject $H_0$. The result is not significant at this more stringent level.

Step 5: Compute a confidence interval.

A 95% confidence interval for the true win rate:

$$\hat{p} \pm z_{0.025} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} = 0.54 \pm 1.96 \times 0.02229 = (0.496, 0.584)$$

Step 6: Interpret practically.

The 95% confidence interval ranges from 49.6% to 58.4%. This means the true win rate could plausibly be below 50% --- barely. The evidence is suggestive but not overwhelming. A disciplined bettor would continue tracking results rather than aggressively scaling up position sizes based on this evidence alone.

Furthermore, note the profitability question. At -110 odds, the breakeven win rate is:

$$p_{\text{breakeven}} = \frac{110}{110 + 100} = 0.5238 = 52.38\%$$

If we test $H_0: p = 0.5238$ instead:

$$z = \frac{0.54 - 0.5238}{\sqrt{\frac{0.5238 \times 0.4762}{500}}} = \frac{0.0162}{0.02234} = 0.725$$

$$p = P(Z \geq 0.725) = 0.234$$

Against the profitability null, the evidence is far weaker. The bettor's results are fully consistent with a true win rate at or near breakeven. This underscores a critical point: statistical significance relative to 50% does not imply profitability, and profitability is much harder to establish statistically than mere predictive ability.

Common Pitfall: Many bettors celebrate when their win rate is "significantly above 50%" without realizing that the relevant question is whether it is significantly above the breakeven rate. The vig makes this a much harder bar to clear.

8.3 Sample Size Requirements for Betting Claims

The Bettor's Most Painful Question

"How long until I know if I'm good?" This is perhaps the most frequently asked --- and most frequently underestimated --- question in sports betting. The answer is almost always: much longer than you think.

The reason is fundamental: sports betting edges are small, and small effects require large samples to detect. A card counter in blackjack might have a 1-2% edge. A sharp sports bettor might have a 2-5% edge over closing lines. Detecting an effect of this size against the background noise of random variation requires hundreds or thousands of observations.

Power Analysis for Betting Contexts

Statistical power is the probability of correctly rejecting $H_0$ when the alternative hypothesis is true --- that is, the probability of detecting a real edge when one exists.

$$\text{Power} = 1 - \beta = P(\text{Reject } H_0 \mid H_a \text{ is true})$$

where $\beta$ is the Type II error rate (failing to detect a real effect).

The four quantities in a power analysis are interrelated:

Effect size ($\delta$): The magnitude of the true effect. In betting, this is the difference between the true win rate and the null hypothesis rate.
Sample size ($n$): The number of bets.
Significance level ($\alpha$): The Type I error threshold.
Power ($1 - \beta$): The desired probability of detecting the effect.

Given any three of these, you can solve for the fourth.

Type I and Type II Errors in Betting

	$H_0$ True (No Edge)	$H_0$ False (Edge Exists)
Reject $H_0$	Type I Error ($\alpha$): False positive --- you think you have an edge but you don't	Correct: You detect a real edge
Fail to Reject $H_0$	Correct: You correctly conclude no edge	Type II Error ($\beta$): False negative --- you miss a real edge

In betting, the costs of these errors are asymmetric:

Type I Error (false positive): You deploy capital on a non-existent edge. You lose money through the vig on bets that are no better than coin flips. Cost: real financial loss plus opportunity cost.
Type II Error (false negative): You fail to detect a genuine edge. You miss potential profits but do not lose money. Cost: opportunity cost only.

This asymmetry suggests that bettors should generally demand high confidence (low $\alpha$) before deploying capital, even at the cost of reduced power. It is better to miss some real edges than to bet aggressively on illusory ones.

Sample Size Formula

For a one-sided z-test of a proportion, the required sample size to detect a true proportion $p_1$ against a null proportion $p_0$ with significance level $\alpha$ and power $1 - \beta$ is:

$$n = \left(\frac{z_{\alpha}\sqrt{p_0(1-p_0)} + z_{\beta}\sqrt{p_1(1-p_1)}}{p_1 - p_0}\right)^2$$

where $z_{\alpha}$ is the critical value for significance level $\alpha$ (e.g., 1.645 for $\alpha = 0.05$ one-tailed) and $z_{\beta}$ is the critical value for power $1 - \beta$ (e.g., 0.842 for 80% power).

For a simplified approximation when $p_0 = 0.50$:

$$n \approx \left(\frac{z_{\alpha} + z_{\beta}}{2(p_1 - 0.50)}\right)^2$$

How Many Bets to Confirm a 2% Edge?

Let us work through the most practically relevant calculation. Suppose a bettor has a true win rate of 52% (a 2% edge over the 50% null). How many bets are needed to detect this at $\alpha = 0.05$ with 80% power?

$$n = \left(\frac{1.645\sqrt{0.50 \times 0.50} + 0.842\sqrt{0.52 \times 0.48}}{0.52 - 0.50}\right)^2$$

$$n = \left(\frac{1.645 \times 0.500 + 0.842 \times 0.4996}{0.02}\right)^2$$

$$n = \left(\frac{0.8225 + 0.4207}{0.02}\right)^2 = \left(\frac{1.2432}{0.02}\right)^2 = (62.16)^2 \approx 3864$$

Nearly 4,000 bets are required to detect a 2% edge with 80% power at the 5% significance level. If you place 5 bets per day, this is over two years of betting. At 2 bets per day, it is over five years.

This result is sobering, and it explains why most bettors never truly know whether they are skilled. The edges in sports betting are small enough that the sample sizes required for statistical confirmation exceed the patience --- and often the bankroll --- of most participants.

Comprehensive Sample Size Table

The following table shows the required number of bets for various edge sizes, significance levels, and power levels:

True Win Rate	Edge Over 50%	$\alpha = 0.05$, Power = 80%	$\alpha = 0.05$, Power = 90%	$\alpha = 0.01$, Power = 80%	$\alpha = 0.01$, Power = 90%
51%	1%	15,366	20,510	22,548	28,726
52%	2%	3,842	5,128	5,637	7,182
53%	3%	1,708	2,280	2,506	3,192
54%	4%	961	1,283	1,410	1,796
55%	5%	615	821	903	1,150
56%	6%	427	570	627	799
57%	7%	314	419	461	587
58%	8%	241	321	353	450
60%	10%	154	206	226	288

Study this table carefully. A bettor with a 52% true win rate --- which would be considered a strong edge in most markets --- needs nearly 4,000 bets to achieve statistical significance at standard thresholds. At 1%, the threshold used for merely marginal edges, the required sample exceeds 15,000 bets. These numbers explain why serious quantitative bettors treat all short-term results with deep skepticism.

Python Code: Sample Size Calculator for Betting Edge Detection

import numpy as np
from scipy import stats
import warnings


def required_sample_size(
    true_win_rate: float,
    null_win_rate: float = 0.50,
    alpha: float = 0.05,
    power: float = 0.80,
    alternative: str = "greater",
) -> int:
    """
    Calculate the number of bets required to detect a given edge.

    Uses the exact formula for the sample size of a one-sample
    z-test for proportions.

    Args:
        true_win_rate: The bettor's assumed true win probability.
        null_win_rate: The win probability under H0 (default 0.50).
        alpha: Significance level (Type I error rate).
        power: Desired power (1 - Type II error rate).
        alternative: 'greater' for one-tailed, 'two-sided' for two-tailed.

    Returns:
        Required sample size (number of bets), rounded up.
    """
    if true_win_rate <= null_win_rate:
        raise ValueError("True win rate must exceed null win rate for "
                         "a 'greater' alternative test.")

    beta = 1 - power

    if alternative == "greater":
        z_alpha = stats.norm.ppf(1 - alpha)
    else:
        z_alpha = stats.norm.ppf(1 - alpha / 2)

    z_beta = stats.norm.ppf(power)

    numerator = (z_alpha * np.sqrt(null_win_rate * (1 - null_win_rate))
                 + z_beta * np.sqrt(true_win_rate * (1 - true_win_rate)))
    denominator = true_win_rate - null_win_rate

    n = (numerator / denominator) ** 2
    return int(np.ceil(n))


def time_to_significance(
    true_win_rate: float,
    bets_per_day: float,
    null_win_rate: float = 0.50,
    alpha: float = 0.05,
    power: float = 0.80,
) -> dict:
    """
    Calculate how long it takes to statistically confirm a betting edge.

    Translates abstract sample size requirements into practical
    timelines that bettors can plan around.

    Args:
        true_win_rate: Assumed true win probability.
        bets_per_day: Average number of bets placed per day.
        null_win_rate: Win probability under H0.
        alpha: Significance level.
        power: Desired power.

    Returns:
        Dictionary with sample size and time estimates.
    """
    n = required_sample_size(true_win_rate, null_win_rate, alpha, power)
    days = n / bets_per_day
    weeks = days / 7
    months = days / 30.44
    years = days / 365.25

    return {
        "required_bets": n,
        "bets_per_day": bets_per_day,
        "days": round(days, 1),
        "weeks": round(weeks, 1),
        "months": round(months, 1),
        "years": round(years, 2),
    }


# --- Generate the comprehensive sample size table ---
print("REQUIRED SAMPLE SIZES FOR DETECTING BETTING EDGES")
print("=" * 75)
print(f"{'Win Rate':>10} {'Edge':>6} {'a=.05':>10} {'a=.05':>10} "
      f"{'a=.01':>10} {'a=.01':>10}")
print(f"{'':>10} {'':>6} {'Pwr=80%':>10} {'Pwr=90%':>10} "
      f"{'Pwr=80%':>10} {'Pwr=90%':>10}")
print("-" * 75)

for wr in [0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.60]:
    edge = wr - 0.50
    n1 = required_sample_size(wr, alpha=0.05, power=0.80)
    n2 = required_sample_size(wr, alpha=0.05, power=0.90)
    n3 = required_sample_size(wr, alpha=0.01, power=0.80)
    n4 = required_sample_size(wr, alpha=0.01, power=0.90)
    print(f"{wr:>10.0%} {edge:>6.0%} {n1:>10,} {n2:>10,} {n3:>10,} {n4:>10,}")

# --- Practical timeline for a 53% bettor ---
print("\n" + "=" * 75)
print("TIME TO SIGNIFICANCE: 53% True Win Rate, alpha=0.05, power=80%")
print("=" * 75)

for bpd in [1, 2, 3, 5, 10]:
    timeline = time_to_significance(0.53, bpd)
    print(f"  At {bpd} bet(s)/day: {timeline['required_bets']:,} bets "
          f"= {timeline['months']:.1f} months ({timeline['years']:.1f} years)")

# --- Visualize the relationship between edge size and required sample ---
print("\n" + "=" * 75)
print("EDGE SIZE vs. REQUIRED BETS (alpha=0.05, power=80%)")
print("=" * 75)

edges = np.arange(0.01, 0.11, 0.005)
for edge in edges:
    wr = 0.50 + edge
    n = required_sample_size(wr)
    bar = "#" * max(1, int(np.log10(n) * 10))
    print(f"  Edge = {edge:>5.1%}  |  n = {n:>6,}  {bar}")

Key Insight: The relationship between edge size and required sample size is approximately inverse-square. Halving the edge quadruples the required sample. This mathematical fact is the single most important reason why sports betting is so difficult to evaluate rigorously: the edges are small, so the samples must be enormous.

The Implications for Betting Practice

These sample size requirements have profound practical implications:

Most bettors will never achieve statistical significance. A recreational bettor placing 2 bets per day with a genuine 2% edge needs over 5 years of data to confirm that edge. Most bettors do not maintain consistent records for this long, let alone maintain a consistent strategy.
Tout services cannot prove their claims. A service advertising a 55% ATS record needs at least 615 bets to achieve significance at the 5% level. Many tout services cherry-pick time periods, sports, or bet types to construct favorable-looking samples that are far too small for any statistical conclusion.
Backtesting results must be interpreted with extreme caution. A backtested system that shows 54% over 300 historical bets is nowhere near statistical significance. It is more likely to represent noise, overfitting, or data snooping than a genuine edge.
Bankroll management must account for uncertainty. Even if you have a genuine edge, the statistical uncertainty in your estimated win rate means your true edge could be substantially smaller than your point estimate. Bankroll strategies should be based on conservative estimates of edge, not optimistic ones.

8.4 Chi-Squared Tests for Categorical Sports Data

When Proportions Are Not Enough

The z-tests we have used so far are designed for binary outcomes: win or lose, over or under, home or away. But many interesting sports betting questions involve categorical data with more than two categories. For these situations, we turn to the chi-squared ($\chi^2$) test.

The chi-squared test comes in three main flavors:

Goodness of fit: Does observed data match an expected distribution?
Test of independence: Are two categorical variables independent?
Test of homogeneity: Do different populations have the same distribution of a categorical variable?

The Chi-Squared Statistic

The chi-squared test statistic is:

$$\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i}$$

where $O_i$ is the observed count in category $i$, $E_i$ is the expected count under $H_0$, and $k$ is the number of categories. Under $H_0$, this statistic follows a chi-squared distribution with degrees of freedom that depend on the specific test:

Goodness of fit: $df = k - 1$ (where $k$ is the number of categories)
Independence: $df = (r - 1)(c - 1)$ (where $r$ and $c$ are the numbers of rows and columns)

Large values of $\chi^2$ indicate poor fit between observed and expected data, leading to rejection of $H_0$.

Goodness of Fit: Are NFL Scores Uniformly Distributed Across Quarters?

A natural question for totals bettors: is scoring in NFL games evenly distributed across the four quarters? If not, the pattern of scoring could inform live betting strategies or quarter-specific props.

Suppose we observe the following distribution of points scored across 256 regular-season games in a given NFL season:

Quarter	Points Scored	Percentage
Q1	2,847	21.2%
Q2	3,652	27.2%
Q3	2,683	20.0%
Q4	4,244	31.6%
Total	13,426	100%

If scoring were uniformly distributed, we would expect 25% in each quarter ($E_i = 13,426 / 4 = 3,356.5$).

$$\chi^2 = \frac{(2847 - 3356.5)^2}{3356.5} + \frac{(3652 - 3356.5)^2}{3356.5} + \frac{(2683 - 3356.5)^2}{3356.5} + \frac{(4244 - 3356.5)^2}{3356.5}$$

$$\chi^2 = \frac{(-509.5)^2}{3356.5} + \frac{(295.5)^2}{3356.5} + \frac{(-673.5)^2}{3356.5} + \frac{(887.5)^2}{3356.5}$$

$$\chi^2 = 77.35 + 26.02 + 135.15 + 234.64 = 473.16$$

With $df = 4 - 1 = 3$ degrees of freedom, the critical value at $\alpha = 0.05$ is 7.815. Our observed $\chi^2 = 473.16$ vastly exceeds this, giving a p-value that is essentially zero. NFL scoring is emphatically not uniform across quarters.

The practical implication: fourth quarters produce significantly more scoring than other quarters (driven by two-minute drills, garbage time touchdowns, and strategic changes), while third quarters tend to be lowest-scoring. This pattern is relevant to quarter-specific prop bets and live betting strategies.

Another common question: does team performance depend on the day of the week? This matters for bettors who specialize in specific scheduling situations.

Suppose we observe the following results for a particular NFL team over several seasons:

	Win	Loss	Total
Sunday (1pm)	45	35	80
Sunday (4pm)	22	18	40
Sunday Night	8	12	20
Monday Night	6	9	15
Thursday Night	4	8	12
Total	85	82	167

$H_0$: Win/loss outcome is independent of game time slot.

$H_a$: Win/loss outcome depends on game time slot.

Under independence, the expected count for each cell is:

$$E_{ij} = \frac{(\text{row total}_i) \times (\text{column total}_j)}{\text{grand total}}$$

For example, the expected number of Sunday 1pm wins:

$$E = \frac{80 \times 85}{167} = 40.72$$

Chi-Squared Critical Values Reference

Degrees of Freedom	$\alpha = 0.10$	$\alpha = 0.05$	$\alpha = 0.01$	$\alpha = 0.001$
1	2.706	3.841	6.635	10.828
2	4.605	5.991	9.210	13.816
3	6.251	7.815	11.345	16.266
4	7.779	9.488	13.277	18.467
5	9.236	11.070	15.086	20.515
6	10.645	12.592	16.812	22.458
8	13.362	15.507	20.090	26.124
10	15.987	18.307	23.209	29.588
15	22.307	24.996	30.578	37.697
20	28.412	31.410	37.566	45.315

Python Code: Chi-Squared Analysis of Sports Categorical Data

import numpy as np
from scipy import stats
from typing import Optional


def chi_squared_goodness_of_fit(
    observed: list,
    expected: Optional[list] = None,
    categories: Optional[list] = None,
    alpha: float = 0.05,
) -> dict:
    """
    Chi-squared goodness-of-fit test for sports data.

    Tests whether observed categorical data matches an expected
    distribution (default: uniform).

    Args:
        observed: List of observed counts per category.
        expected: List of expected counts (default: uniform distribution).
        categories: Optional labels for each category.
        alpha: Significance level.

    Returns:
        Dictionary with test results and per-category contributions.
    """
    observed = np.array(observed)
    k = len(observed)
    n_total = observed.sum()

    if expected is None:
        expected = np.full(k, n_total / k)
    else:
        expected = np.array(expected)

    if categories is None:
        categories = [f"Cat_{i+1}" for i in range(k)]

    # Chi-squared statistic
    contributions = (observed - expected) ** 2 / expected
    chi2_stat = contributions.sum()
    df = k - 1
    p_value = 1 - stats.chi2.cdf(chi2_stat, df)

    # Per-category analysis
    detail = []
    for i in range(k):
        detail.append({
            "category": categories[i],
            "observed": int(observed[i]),
            "expected": round(expected[i], 1),
            "residual": round(observed[i] - expected[i], 1),
            "std_residual": round(
                (observed[i] - expected[i]) / np.sqrt(expected[i]), 2
            ),
            "contribution": round(contributions[i], 2),
        })

    return {
        "chi2_statistic": round(chi2_stat, 4),
        "degrees_of_freedom": df,
        "p_value": p_value,
        "reject_null": p_value < alpha,
        "detail": detail,
    }


def chi_squared_independence(
    contingency_table: list,
    row_labels: Optional[list] = None,
    col_labels: Optional[list] = None,
    alpha: float = 0.05,
) -> dict:
    """
    Chi-squared test of independence for a contingency table.

    Tests whether two categorical variables are independent.

    Args:
        contingency_table: 2D list of observed counts.
        row_labels: Labels for rows.
        col_labels: Labels for columns.
        alpha: Significance level.

    Returns:
        Dictionary with test results, expected values, and residuals.
    """
    table = np.array(contingency_table)
    chi2_stat, p_value, dof, expected = stats.chi2_contingency(table)

    # Cramér's V for effect size
    n = table.sum()
    min_dim = min(table.shape) - 1
    cramers_v = np.sqrt(chi2_stat / (n * min_dim)) if min_dim > 0 else 0

    return {
        "chi2_statistic": round(chi2_stat, 4),
        "degrees_of_freedom": dof,
        "p_value": p_value,
        "reject_null": p_value < alpha,
        "expected_table": np.round(expected, 1),
        "cramers_v": round(cramers_v, 4),
        "effect_size_interpretation": (
            "negligible" if cramers_v < 0.1 else
            "small" if cramers_v < 0.3 else
            "medium" if cramers_v < 0.5 else
            "large"
        ),
    }


# --- Worked Example: NFL Scoring by Quarter ---
print("=" * 65)
print("GOODNESS OF FIT: NFL Scoring Distribution Across Quarters")
print("=" * 65)

quarters = ["Q1", "Q2", "Q3", "Q4"]
points = [2847, 3652, 2683, 4244]

result = chi_squared_goodness_of_fit(
    observed=points,
    categories=quarters,
)

print(f"\nChi-squared statistic: {result['chi2_statistic']}")
print(f"Degrees of freedom: {result['degrees_of_freedom']}")
print(f"P-value: {result['p_value']:.2e}")
print(f"Reject null (uniform scoring): {result['reject_null']}")

print(f"\n{'Quarter':>10} {'Observed':>10} {'Expected':>10} "
      f"{'Std Resid':>12} {'Contribution':>14}")
print("-" * 60)
for d in result["detail"]:
    print(f"{d['category']:>10} {d['observed']:>10} {d['expected']:>10.1f} "
          f"{d['std_residual']:>12.2f} {d['contribution']:>14.2f}")

print("\nConclusion: NFL scoring is emphatically non-uniform across")
print("quarters. Q4 is the highest-scoring period, Q3 the lowest.")

# --- Worked Example: NFL Key Numbers Test ---
print("\n" + "=" * 65)
print("GOODNESS OF FIT: Are NFL Key Numbers Real?")
print("=" * 65)
print("Testing whether NFL final margins follow a uniform distribution")
print("(they should not, if key numbers are real)")

margins = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10"]
observed_games = [58, 52, 192, 64, 45, 77, 115, 38, 28, 70]

result_margins = chi_squared_goodness_of_fit(
    observed=observed_games,
    categories=margins,
)

print(f"\nChi-squared statistic: {result_margins['chi2_statistic']}")
print(f"P-value: {result_margins['p_value']:.2e}")
print(f"Reject uniform distribution: {result_margins['reject_null']}")

print(f"\n{'Margin':>8} {'Observed':>10} {'Expected':>10} {'Std Resid':>12}")
print("-" * 45)
for d in result_margins["detail"]:
    flag = " ***" if abs(d["std_residual"]) > 3 else ""
    print(f"{d['category']:>8} {d['observed']:>10} {d['expected']:>10.1f} "
          f"{d['std_residual']:>12.2f}{flag}")

print("\n*** = Standardized residual > 3 (highly unusual category)")
print("\nThe margins of 3 and 7 show extremely large positive residuals,")
print("confirming that NFL key numbers are statistically real.")

# --- Test of Independence: Game Slot vs. Outcome ---
print("\n" + "=" * 65)
print("TEST OF INDEPENDENCE: Game Slot vs. Win/Loss")
print("=" * 65)

table = [
    [45, 35],   # Sunday 1pm
    [22, 18],   # Sunday 4pm
    [8, 12],    # Sunday Night
    [6, 9],     # Monday Night
    [4, 8],     # Thursday Night
]
slots = ["Sun 1pm", "Sun 4pm", "Sun Night", "Mon Night", "Thu Night"]

result_indep = chi_squared_independence(
    contingency_table=table,
    row_labels=slots,
    col_labels=["Win", "Loss"],
)

print(f"\nChi-squared statistic: {result_indep['chi2_statistic']}")
print(f"Degrees of freedom: {result_indep['degrees_of_freedom']}")
print(f"P-value: {result_indep['p_value']:.4f}")
print(f"Cramer's V (effect size): {result_indep['cramers_v']}")
print(f"Effect size: {result_indep['effect_size_interpretation']}")
print(f"Reject independence: {result_indep['reject_null']}")

print("\nExpected counts under independence:")
for i, slot in enumerate(slots):
    exp_w = result_indep["expected_table"][i, 0]
    exp_l = result_indep["expected_table"][i, 1]
    print(f"  {slot:>12}: Win={exp_w:.1f}, Loss={exp_l:.1f}")

Worked Example: Testing Whether NFL Key Numbers Are Real

The concept of "key numbers" --- that certain final margins (especially 3 and 7) occur with disproportionate frequency in NFL games --- is one of the most well-known claims in sports betting. Let us test it rigorously.

Setup: We examine a sample of NFL games and record the final margin of victory for margins 1 through 10 (excluding ties and margins greater than 10 for simplicity).

Hypotheses:

$$H_0: \text{Final margins are uniformly distributed across 1-10}$$

$$H_a: \text{Some margins occur more frequently than others}$$

Using the data from our Python example above (which reflects realistic NFL proportions), the chi-squared test produces a test statistic that vastly exceeds the critical value, confirming what every experienced NFL bettor knows: key numbers are real, and they are not even close to questionable.

The standardized residuals tell us which margins drive the result. A margin of 3 shows a standardized residual above +10, and a margin of 7 shows a standardized residual above +5. These are enormous deviations from uniformity.

Betting implication: The statistical reality of key numbers means that half-point line movements crossing 3 or 7 are worth more than half-point movements at other numbers. A bet at +3 is substantially more valuable than a bet at +2.5, while the difference between +4 and +3.5 is much smaller. Sharp bettors consistently exploit this by buying half-points across key numbers and by placing specific teaser strategies that cross both 3 and 7.

Common Pitfall: While key numbers are statistically real at the game-margin level, this does not automatically mean that betting strategies based on key numbers are profitable. The market is aware of key numbers, and line movements across these numbers are priced accordingly. The edge, if any, comes from finding specific situations where the market undervalues the key number effect.

8.5 Multiple Testing Corrections and Data Snooping

The Most Dangerous Trap in Betting Analysis

Everything we have discussed so far assumes you are testing a single, pre-specified hypothesis. But this is not how most bettors actually operate. In practice, a bettor might screen dozens or hundreds of potential angles:

Home underdogs on Monday Night Football
Teams coming off a bye week playing at home
Unders in games where both teams rank in the top 10 in pace
Divisional road favorites of 3 or more points
Teams with a losing record against teams with a winning record in September

Each of these is a hypothesis test. And here lies the trap: if you test enough hypotheses, some will appear significant by pure chance, even when no real effects exist.

This is the multiple comparisons problem (also called the multiple testing problem or the look-elsewhere effect), and it is arguably the single greatest source of false discoveries in sports betting analysis.

Understanding the Problem Mathematically

If you conduct a single hypothesis test at $\alpha = 0.05$, the probability of a false positive is 5%. But if you conduct $m$ independent tests, each at $\alpha = 0.05$, the probability of at least one false positive is:

$$P(\text{at least one false positive}) = 1 - (1 - \alpha)^m$$

For various numbers of tests:

Tests ($m$)	$P(\geq 1$ false positive$)$
1	5.0%
5	22.6%
10	40.1%
20	64.2%
50	92.3%
100	99.4%

If you test 100 potential betting angles, there is a 99.4% chance that at least one will appear significant at the 5% level even if none of them are real. If you then cherry-pick this one "significant" result and present it as a proven system, you have committed one of the most common analytical sins in sports betting.

This is not a theoretical concern. It is the primary business model of most tout services and the primary self-deception of most amateur system builders. The process is predictable:

Screen hundreds of situational angles using historical data.
Find a handful that show "significant" results.
Ignore the hundreds that failed.
Market the surviving angles as "proven systems."

The statistical term for this is data snooping (or data dredging). The informal term is "p-hacking." Whatever you call it, it produces apparent edges that vanish the moment real money is at stake.

The Bonferroni Correction

The simplest and most conservative correction for multiple testing is the Bonferroni correction. If you are conducting $m$ tests and want to maintain a family-wise error rate (FWER) of $\alpha$, you test each individual hypothesis at level $\alpha / m$.

$$\alpha_{\text{adjusted}} = \frac{\alpha}{m}$$

For $m = 20$ tests with a desired FWER of 0.05:

$$\alpha_{\text{adjusted}} = \frac{0.05}{20} = 0.0025$$

Each individual test must achieve a p-value below 0.0025 to be considered significant.

The Bonferroni correction is simple and guarantees that the probability of any false positive is at most $\alpha$. Its disadvantage is that it is very conservative --- it substantially reduces power, especially when $m$ is large. If you test 100 angles, each must achieve a p-value below 0.0005, which requires an enormous effect size or sample.

The False Discovery Rate (Benjamini-Hochberg Procedure)

An alternative approach controls not the probability of any false positive but the false discovery rate (FDR) --- the expected proportion of rejected hypotheses that are false positives.

The Benjamini-Hochberg (BH) procedure works as follows:

Conduct all $m$ tests and obtain their p-values.
Sort the p-values in ascending order: $p_{(1)} \leq p_{(2)} \leq \ldots \leq p_{(m)}$.
For each $i$, compute the BH threshold: $\alpha_i = \frac{i}{m} \times q$, where $q$ is the desired FDR (e.g., 0.05).
Find the largest $k$ such that $p_{(k)} \leq \alpha_k$.
Reject all hypotheses with p-values $\leq p_{(k)}$.

The BH procedure is less conservative than Bonferroni and is generally preferred when you are testing many hypotheses and expect some to be true. It controls the FDR at level $q$, meaning that among all the angles you declare significant, no more than $q$ fraction are expected to be false discoveries (on average).

Comparison of Correction Methods

Method	Controls	Formula	Conservatism	Best For
None (uncorrected)	Nothing	$\alpha$	None	Single pre-specified test
Bonferroni	Family-wise error rate	$\alpha / m$	Very high	Few tests, high-stakes decisions
Holm-Bonferroni	Family-wise error rate	Sequential step-down	High	Few-to-moderate tests
Benjamini-Hochberg	False discovery rate	$\frac{i}{m} q$	Moderate	Many tests, screening
Benjamini-Yekutieli	FDR (dependent tests)	$\frac{i}{m \cdot c_m} q$	Moderate-High	Correlated tests

Data Snooping and Overfitting in Angle Hunting

The multiple testing problem is exacerbated in sports betting by several features of the data:

Large numbers of potential variables. Modern sports databases contain hundreds of variables per game: team statistics, player statistics, weather, rest days, travel distance, time of day, historical matchup data, referee assignments, and more. The number of potential "angles" derived from combinations of these variables is astronomical.

Survivorship bias. You hear about the angles that "work" and never about the hundreds that failed during screening. This applies to your own analysis (you remember your good finds, not your dead ends) and especially to information from others.

Overfitting to noise. With enough variables, you can always find a combination that fits historical data perfectly. A system that says "bet unders on Thursday Night Football in October when the road team is favored by 1-3 points and the over/under is between 42 and 45" may have a perfect historical record --- because the sample size is 7 games. This is not a system; it is a coincidence.

The garden of forking paths. Even when you test only one hypothesis, the choices you make along the way (how to define the sample, how to handle ties, which seasons to include, which teams to exclude) can create implicit multiple testing. If any of these decisions are influenced by peeking at the results, the effective number of tests is larger than 1.

Pre-Registration of Betting Hypotheses

The gold standard for avoiding data snooping is pre-registration: formally specifying your hypothesis, test, and decision rule before looking at the data. In academic research, this involves registering your study plan on a public platform. In betting, it means:

Specify the angle before examining performance data. "I believe teams with 10+ days of rest playing at home against divisional opponents are undervalued."
Define the test. "I will test this by examining ATS records for such games over the last 5 seasons using a one-tailed z-test at alpha = 0.05."
Collect the data and run the test. No modifications, no "well, let me also check if it works for non-divisional games."
Report the result. Whether significant or not.

Pre-registration does not prevent you from conducting exploratory analysis. It simply requires you to clearly distinguish between exploratory findings (which need confirmation) and confirmatory tests (which provide genuine evidence). This distinction is the single most important analytical habit a sports bettor can develop.

Python Code: Multiple Testing Correction Applied to Betting Angle Search

import numpy as np
from scipy import stats


def simulate_angle_search(
    n_angles: int = 100,
    n_bets_per_angle: int = 200,
    n_real_edges: int = 3,
    real_edge_size: float = 0.05,
    alpha: float = 0.05,
    seed: int = 42,
) -> dict:
    """
    Simulate searching through many potential betting angles,
    where most have no edge and a few have a genuine edge.

    This demonstrates the multiple testing problem in a realistic
    betting context.

    Args:
        n_angles: Total number of angles tested.
        n_bets_per_angle: Sample size for each angle.
        n_real_edges: Number of angles with a genuine edge.
        real_edge_size: True edge for the real angles (e.g., 0.05 = 55%).
        alpha: Nominal significance level.
        seed: Random seed for reproducibility.

    Returns:
        Dictionary with raw and corrected results.
    """
    np.random.seed(seed)

    true_win_rates = np.full(n_angles, 0.50)
    true_win_rates[:n_real_edges] = 0.50 + real_edge_size

    # Simulate betting results for each angle
    wins = np.array([
        np.random.binomial(n_bets_per_angle, p)
        for p in true_win_rates
    ])
    observed_rates = wins / n_bets_per_angle

    # Calculate p-values for each angle (one-tailed test: p > 0.50)
    z_stats = (observed_rates - 0.50) / np.sqrt(0.50 * 0.50 / n_bets_per_angle)
    p_values = 1 - stats.norm.cdf(z_stats)

    # --- Uncorrected ---
    uncorrected_sig = p_values < alpha
    uncorrected_discoveries = np.sum(uncorrected_sig)
    uncorrected_true_pos = np.sum(uncorrected_sig[:n_real_edges])
    uncorrected_false_pos = np.sum(uncorrected_sig[n_real_edges:])

    # --- Bonferroni correction ---
    bonf_alpha = alpha / n_angles
    bonf_sig = p_values < bonf_alpha
    bonf_discoveries = np.sum(bonf_sig)
    bonf_true_pos = np.sum(bonf_sig[:n_real_edges])
    bonf_false_pos = np.sum(bonf_sig[n_real_edges:])

    # --- Benjamini-Hochberg correction ---
    sorted_indices = np.argsort(p_values)
    sorted_pvals = p_values[sorted_indices]
    m = n_angles
    bh_thresholds = np.arange(1, m + 1) / m * alpha

    # Find the largest k where p_(k) <= threshold_k
    bh_sig = np.zeros(m, dtype=bool)
    rejected = sorted_pvals <= bh_thresholds
    if np.any(rejected):
        max_k = np.max(np.where(rejected)[0])
        bh_sig_sorted = np.zeros(m, dtype=bool)
        bh_sig_sorted[:max_k + 1] = True
        # Map back to original indices
        bh_sig[sorted_indices[bh_sig_sorted]] = True

    bh_discoveries = np.sum(bh_sig)
    bh_true_pos = np.sum(bh_sig[:n_real_edges])
    bh_false_pos = np.sum(bh_sig[n_real_edges:])

    return {
        "n_angles": n_angles,
        "n_real_edges": n_real_edges,
        "n_null_true": n_angles - n_real_edges,
        "p_values": p_values,
        "observed_rates": observed_rates,
        "uncorrected": {
            "discoveries": uncorrected_discoveries,
            "true_positives": uncorrected_true_pos,
            "false_positives": uncorrected_false_pos,
            "fdr": (uncorrected_false_pos / uncorrected_discoveries
                    if uncorrected_discoveries > 0 else 0),
        },
        "bonferroni": {
            "adjusted_alpha": bonf_alpha,
            "discoveries": bonf_discoveries,
            "true_positives": bonf_true_pos,
            "false_positives": bonf_false_pos,
            "fdr": (bonf_false_pos / bonf_discoveries
                    if bonf_discoveries > 0 else 0),
        },
        "benjamini_hochberg": {
            "discoveries": bh_discoveries,
            "true_positives": bh_true_pos,
            "false_positives": bh_false_pos,
            "fdr": (bh_false_pos / bh_discoveries
                    if bh_discoveries > 0 else 0),
        },
    }


# --- Run the simulation ---
print("=" * 70)
print("THE 100 ANGLES PROBLEM: Searching for Significance")
print("=" * 70)
print("\nScenario: A bettor tests 100 potential betting angles.")
print("Reality: Only 3 angles have a genuine 5% edge (55% true win rate).")
print("The other 97 angles have NO edge (50% true win rate).")
print("Each angle is evaluated over 200 bets.\n")

result = simulate_angle_search(
    n_angles=100,
    n_bets_per_angle=200,
    n_real_edges=3,
    real_edge_size=0.05,
    alpha=0.05,
    seed=42,
)

print("-" * 70)
print(f"{'Method':<25} {'Discoveries':>12} {'True Pos':>10} "
      f"{'False Pos':>11} {'FDR':>8}")
print("-" * 70)

for method_name, method_key in [
    ("Uncorrected (alpha=.05)", "uncorrected"),
    ("Bonferroni", "bonferroni"),
    ("Benjamini-Hochberg", "benjamini_hochberg"),
]:
    m = result[method_key]
    fdr_str = f"{m['fdr']:.1%}" if m['discoveries'] > 0 else "N/A"
    print(f"{method_name:<25} {m['discoveries']:>12} {m['true_positives']:>10} "
          f"{m['false_positives']:>11} {fdr_str:>8}")

print("-" * 70)

print(f"\nKey observations:")
print(f"  - Without correction, ~{result['uncorrected']['false_positives']} "
      f"of the 97 null angles appear 'significant'")
print(f"  - This means {result['uncorrected']['fdr']:.0%} of 'discoveries' "
      f"are FALSE (the false discovery rate)")
print(f"  - Bonferroni eliminates false positives but may miss real edges")
print(f"  - BH procedure provides a middle ground")

# --- Show the most 'significant' angles ---
print("\n" + "=" * 70)
print("TOP 10 ANGLES BY P-VALUE (sorted)")
print("=" * 70)
print(f"{'Rank':>5} {'Angle':>8} {'Win Rate':>10} {'P-value':>12} "
      f"{'Real Edge?':>12} {'Uncorr':>8} {'Bonf':>8} {'BH':>8}")
print("-" * 70)

sorted_idx = np.argsort(result["p_values"])
for rank, idx in enumerate(sorted_idx[:10], 1):
    wr = result["observed_rates"][idx]
    pv = result["p_values"][idx]
    is_real = "YES" if idx < 3 else "NO"
    uncorr = "Sig" if pv < 0.05 else "-"
    bonf = "Sig" if pv < 0.05 / 100 else "-"
    # For BH, we check if this p-value would be rejected
    bh_status = "Sig" if pv <= (rank / 100) * 0.05 else "-"
    print(f"{rank:>5} {idx+1:>8} {wr:>10.1%} {pv:>12.6f} "
          f"{is_real:>12} {uncorr:>8} {bonf:>8} {bh_status:>8}")

# --- Multiple runs to show average behavior ---
print("\n" + "=" * 70)
print("AVERAGE RESULTS OVER 1,000 SIMULATIONS")
print("=" * 70)

n_simulations = 1000
results_summary = {
    "uncorrected": {"fp": [], "tp": [], "disc": []},
    "bonferroni": {"fp": [], "tp": [], "disc": []},
    "benjamini_hochberg": {"fp": [], "tp": [], "disc": []},
}

for sim in range(n_simulations):
    r = simulate_angle_search(
        n_angles=100, n_bets_per_angle=200,
        n_real_edges=3, real_edge_size=0.05,
        seed=sim * 7 + 13,
    )
    for method in results_summary:
        results_summary[method]["fp"].append(r[method]["false_positives"])
        results_summary[method]["tp"].append(r[method]["true_positives"])
        results_summary[method]["disc"].append(r[method]["discoveries"])

print(f"\n{'Method':<25} {'Avg Disc':>10} {'Avg TP':>10} "
      f"{'Avg FP':>10} {'Avg FDR':>10}")
print("-" * 70)
for method_name, method_key in [
    ("Uncorrected", "uncorrected"),
    ("Bonferroni", "bonferroni"),
    ("Benjamini-Hochberg", "benjamini_hochberg"),
]:
    s = results_summary[method_key]
    avg_disc = np.mean(s["disc"])
    avg_tp = np.mean(s["tp"])
    avg_fp = np.mean(s["fp"])
    avg_fdr = avg_fp / avg_disc if avg_disc > 0 else 0
    print(f"{method_name:<25} {avg_disc:>10.1f} {avg_tp:>10.1f} "
          f"{avg_fp:>10.1f} {avg_fdr:>10.1%}")

print("\nThe uncorrected approach produces ~5 false positives per search")
print("on average --- meaning most 'discoveries' are illusions.")
print("The BH procedure maintains the false discovery rate near 5%.")

The "100 Angles" Problem: Searching for Significance

Let us make the multiple testing problem as concrete as possible with a scenario every sports bettor should internalize.

The scenario: You have a database of 10 years of NFL data. You decide to search for profitable betting angles by testing 100 different situational factors: home/away, rest days, division vs. non-division, indoor vs. outdoor, grass vs. turf, temperature ranges, wind ranges, time of day, day of week, prime-time vs. non-prime-time, and various combinations thereof. For each angle, you check the ATS record and compute a p-value.

The reality: Suppose that none of these 100 angles represent a genuine edge. Every single one has a true win rate of exactly 50%.

The outcome: At $\alpha = 0.05$, you expect 5 of the 100 tests to produce "significant" results purely by chance. These 5 false positives will look indistinguishable from real edges. They will have p-values below 0.05, they may have interesting narrative explanations ("of course Thursday Night Football unders work --- short rest suppresses offense!"), and they will seem like genuine discoveries.

If you then bet on these 5 "angles" going forward, you will discover to your financial detriment that they have no predictive power. The historical pattern was noise; the narrative was post-hoc rationalization; the statistical significance was an artifact of multiple testing.

The solution: Apply correction methods before drawing conclusions:

Bonferroni: Require $p < 0.0005$ for each individual test. This almost certainly eliminates all false positives but may also eliminate some real edges (if they exist).
Benjamini-Hochberg: Control the false discovery rate at 5%. Among the angles you declare significant, expect roughly 5% to be false. This is a more practical approach when you are screening many angles and can tolerate some false positives in exchange for greater sensitivity.
Out-of-sample validation: The most practical approach for bettors. Split your data into a "discovery" sample and a "confirmation" sample. Find angles in the discovery sample, then test them (without modification) in the confirmation sample. Any angle that is significant in both samples has much stronger evidence in its favor.
Pre-registration: As discussed above, specify your hypothesis before looking at the data.

A Framework for Disciplined Angle Evaluation

Given everything we have discussed, here is a practical workflow for evaluating potential betting angles:

Phase 1: Exploratory Analysis (Discovery) - Search freely through historical data for patterns. - Use the BH correction to control false discoveries. - Treat all findings as provisional hypotheses, not confirmed edges.

Phase 2: Confirmation - Test provisional hypotheses on held-out data (different seasons, different leagues, or prospective tracking). - Apply pre-registered, one-shot hypothesis tests. - Require significance at $\alpha = 0.01$ or stricter.

Phase 3: Economic Validation - Even statistically significant angles may not be profitable after accounting for vig, line movement, and execution costs. - Simulate actual betting (with realistic odds, bet timing, and bankroll constraints). - Require positive expected value after all costs.

Phase 4: Monitored Deployment - Begin betting the angle with small stakes. - Track results prospectively. - Compare ongoing results to pre-deployment expectations. - Establish clear criteria for abandoning the angle if results deteriorate.

Key Insight: The multiple testing problem is not just a statistical technicality. It is the primary reason most "systems" fail in live betting. The bettor who internalizes this lesson --- who demands out-of-sample confirmation for every angle and treats in-sample results with appropriate skepticism --- has a profound advantage over the vast majority of bettors who do not.

8.6 Chapter Summary

This chapter has provided the formal statistical framework for evaluating betting claims. Let us consolidate the key takeaways:

Core Concepts

Hypothesis testing is the disciplined process of evaluating claims against the null hypothesis of "nothing interesting is happening." In betting, the null hypothesis typically asserts that a bettor has no edge, a model adds no value, or a situational factor has no effect on outcomes.

P-values quantify the probability of observing results at least as extreme as the actual data, assuming the null hypothesis is true. They are not the probability that the null hypothesis is true, and they should never be interpreted in isolation from sample size, effect size, and prior plausibility.

Sample size requirements are the sobering mathematical reality that confronts every sports bettor. Detecting a 2% edge over the 50% null requires nearly 4,000 bets at standard significance and power levels. Detecting a 1% edge requires over 15,000. Most bettors do not have enough data to draw any statistically valid conclusions about their skill.

Chi-squared tests extend hypothesis testing to categorical data, enabling analysis of score distributions, key number frequency, and the independence of sporting variables. The test of NFL key numbers provides a textbook example of a chi-squared goodness-of-fit test with an unambiguous result.

Multiple testing corrections are essential whenever more than one hypothesis is tested. The Bonferroni correction controls the family-wise error rate but is conservative. The Benjamini-Hochberg procedure controls the false discovery rate and is better suited to screening many angles. Failure to apply corrections is the single most common source of illusory edges in sports betting.

Key Formulas

Z-test for a proportion:

$$z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1 - p_0)}{n}}}$$

Required sample size (one-tailed z-test for proportion):

$$n = \left(\frac{z_{\alpha}\sqrt{p_0(1-p_0)} + z_{\beta}\sqrt{p_1(1-p_1)}}{p_1 - p_0}\right)^2$$

Chi-squared statistic:

$$\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i}$$

Bonferroni corrected threshold:

$$\alpha_{\text{adjusted}} = \frac{\alpha}{m}$$

Family-wise error rate for $m$ independent tests:

$$P(\text{at least one false positive}) = 1 - (1 - \alpha)^m$$

Practical Guidelines

Situation	Recommended Approach
Testing a single, pre-specified bettor claim	One-sample z-test at $\alpha = 0.01$
Comparing two groups (e.g., before/after rule change)	Two-sample z-test or t-test, two-tailed
Evaluating categorical distributions	Chi-squared goodness-of-fit
Testing whether two variables are related	Chi-squared test of independence
Screening many potential angles	BH correction with out-of-sample confirmation
Deciding how many bets to track before concluding	Power analysis with conservative edge estimate
Determining if a tout service's record is meaningful	Sample size table --- almost certainly insufficient

The Hierarchy of Evidence

Not all statistical evidence is created equal. In order from weakest to strongest:

Anecdotal: "I've been winning lately." (No formal test; meaningless.)
In-sample, uncorrected: "This angle is 58% over 150 games." (Likely data-mined; unreliable.)
In-sample, corrected: "This angle survives Bonferroni correction across 50 tests." (Interesting but still in-sample.)
Out-of-sample confirmation: "This angle was identified in 2015-2019 data and confirmed in 2020-2024 data." (Substantially more credible.)
Prospective validation: "I pre-registered this angle and it has been profitable over 500 tracked bets." (Strong evidence.)
Mechanistic understanding + prospective validation: "I understand why this angle works (market inefficiency mechanism), and prospective results confirm it." (Strongest evidence.)

The vast majority of betting claims fall into categories 1 and 2. Your goal as a quantitative bettor is to operate exclusively at levels 4 through 6.

Common Errors to Avoid

Confusing statistical significance with practical significance. A result can be statistically significant (low p-value) but practically meaningless (tiny edge eaten by vig).
Ignoring the base rate. Most betting systems do not work. Most bettors are not profitable. The prior probability that any given claim is true is low, which means p-values must be very small to be convincing.
Testing after peeking. If you check your results partway through and then decide to "keep going until significant," you have invalidated the test. The significance level applies only to a pre-specified sample size.
Failing to correct for multiple tests. If you examined 20 angles before finding this one, your effective significance level is not 0.05 --- it is much higher.
Treating non-significance as evidence of no effect. A non-significant result at $n = 200$ says almost nothing about whether a 2% edge exists. It simply says you do not yet have enough data.
Over-relying on p-values. P-values are one tool among many. Confidence intervals, effect sizes, Bayesian posterior probabilities, and out-of-sample validation all provide complementary information. No single number captures the totality of evidence.

What's Next: Chapter 9 (Regression Analysis)

With hypothesis testing in hand, you can now determine whether individual effects are statistically significant. But betting outcomes are influenced by many factors simultaneously: team strength, home field advantage, rest, weather, injuries, and more. Evaluating these factors in isolation --- as hypothesis testing does --- misses the interactions and confounding relationships between them.

Chapter 9: Regression Analysis for Sports Betting introduces the framework for modeling outcomes as a function of multiple predictors simultaneously. You will learn to build linear regression models for point spreads and totals, logistic regression models for win probabilities, and to interpret regression coefficients as the marginal effect of each factor while controlling for all others. Regression analysis transforms the question from "Is this factor significant?" to "How much does this factor matter, after accounting for everything else?" It is the bridge between hypothesis testing and the predictive models that form the core of quantitative sports betting.

Chapter 8 Exercises

Exercise 8.1: A bettor claims a 56% win rate on NBA totals over 300 bets. Set up and carry out the appropriate hypothesis test. What is the p-value? Would you be willing to invest in this bettor's picks? What additional information would you want?

Exercise 8.2: Calculate the number of bets required to detect a 3% edge (53% true win rate) with 90% power at the 1% significance level. How long would this take at 3 bets per day? What does this imply about annual performance evaluation?

Exercise 8.3: Using the chi-squared test, determine whether the following distribution of NBA game outcomes by day of week is consistent with uniform play quality across days (data represents home win rates): Monday: 142/250, Tuesday: 168/290, Wednesday: 155/280, Thursday: 88/160, Friday: 130/240, Saturday: 145/250, Sunday: 72/130.

Exercise 8.4: A tout service tests 40 different NFL betting angles and reports that 3 of them are significant at the 5% level. Apply both the Bonferroni correction and the BH procedure. How many angles survive each correction? What would you conclude?

Exercise 8.5: Design a pre-registered hypothesis test for a betting angle of your choice. Specify: (a) the angle and its theoretical justification, (b) the null and alternative hypotheses, (c) the test statistic and significance level, (d) the sample you will use, and (e) the decision rule. Then explain why pre-registration matters.

Exercise 8.6 (Programming): Modify the simulate_angle_search function to explore how the false discovery rate changes as you vary: (a) the number of angles tested (10, 50, 100, 500), (b) the proportion of real edges (0%, 5%, 10%, 20%), and (c) the true edge size (2%, 5%, 10%). Create a summary table of your findings.

Learning Objectives

In This Chapter

Chapter 8: Hypothesis Testing and Statistical Significance

Chapter Overview

8.1 Framing Betting Questions as Hypotheses

The Logic of Hypothesis Testing

Common Betting Hypotheses

One-Tailed vs. Two-Tailed Tests

Python Code: Hypothesis Testing Framework

8.2 P-Values and Their Interpretation

What P-Values Actually Mean

What P-Values Do NOT Mean

Calculating P-Values from Test Statistics

Significance Levels and Their Arbitrary Nature

The P-Value Controversy and Bayesian Alternatives

Python Code: P-Value Calculation for Common Scenarios

Worked Example: Testing Whether a Bettor's 54% Win Rate on 500 Bets Is Significant

8.3 Sample Size Requirements for Betting Claims

The Bettor's Most Painful Question

Power Analysis for Betting Contexts

Type I and Type II Errors in Betting

Sample Size Formula

How Many Bets to Confirm a 2% Edge?

Comprehensive Sample Size Table

Python Code: Sample Size Calculator for Betting Edge Detection

The Implications for Betting Practice

8.4 Chi-Squared Tests for Categorical Sports Data

When Proportions Are Not Enough

The Chi-Squared Statistic

Goodness of Fit: Are NFL Scores Uniformly Distributed Across Quarters?

Test of Independence: Is Performance Related to Day of Week?

Chi-Squared Critical Values Reference

Python Code: Chi-Squared Analysis of Sports Categorical Data

Worked Example: Testing Whether NFL Key Numbers Are Real

8.5 Multiple Testing Corrections and Data Snooping

The Most Dangerous Trap in Betting Analysis

Understanding the Problem Mathematically

The Bonferroni Correction

The False Discovery Rate (Benjamini-Hochberg Procedure)

Comparison of Correction Methods

Data Snooping and Overfitting in Angle Hunting

Pre-Registration of Betting Hypotheses

Python Code: Multiple Testing Correction Applied to Betting Angle Search

The "100 Angles" Problem: Searching for Significance

A Framework for Disciplined Angle Evaluation

8.6 Chapter Summary

Core Concepts

Key Formulas

Practical Guidelines

The Hierarchy of Evidence

Common Errors to Avoid

What's Next: Chapter 9 (Regression Analysis)

Chapter 8 Exercises

Further Reading