Chapter 8: Key Takeaways

Hypothesis Testing and Statistical Significance


1. The Foundation: Null and Alternative Hypotheses

  • Every statistical evaluation of a betting strategy begins with a clearly stated null hypothesis (typically: "this bettor or system has no edge") and an alternative hypothesis ("there is a genuine edge").
  • The null hypothesis is the default assumption. We require sufficient evidence to overturn it, much like the presumption of innocence in a legal trial.
  • Choosing between one-sided tests (bettor is better than random) and two-sided tests (bettor is different from random in either direction) should be decided before looking at the data.

2. P-Values: What They Are and What They Are Not

  • A p-value is the probability of observing results as extreme or more extreme than those obtained, assuming the null hypothesis is true.
  • A p-value is not the probability that the null hypothesis is true, the probability that the bettor is unskilled, or the probability that the results occurred by chance.
  • Small p-values indicate the data is unlikely under the null hypothesis, but they do not quantify the magnitude or practical importance of the effect.
  • The threshold for significance (alpha, typically 0.05) is a convention, not a law of nature. Results with p = 0.049 and p = 0.051 should not be treated as fundamentally different.

3. The Breakeven Threshold Matters More Than 50%

  • In sports betting, demonstrating a win rate above 50% is necessary but not sufficient for profitability.
  • The relevant null hypothesis for profitability is the breakeven rate, which depends on the vig (e.g., approximately 52.38% for standard -110 odds).
  • A bettor can be statistically significantly above 50% while still having no evidence of being above the breakeven rate. Always test against the economically meaningful threshold.

4. Sample Size Is the Biggest Challenge in Betting

  • Detecting small edges requires enormous sample sizes. A bettor with a genuine 53% win rate may need 2,000 to 4,000 bets to achieve statistical significance.
  • At 5 bets per week, this represents 8 to 15 years of betting — far longer than most bettors' patience or most strategies' lifespans.
  • Small samples produce wide confidence intervals and low statistical power, making it difficult to distinguish skill from luck over typical timeframes.

5. Type I and Type II Errors Have Real Costs

  • Type I error (false positive): Concluding a bettor is skilled when they are not. Cost: betting real money on a system that has no edge, leading to long-term losses from the vig.
  • Type II error (false negative): Failing to identify a skilled bettor. Cost: missing a profitable opportunity.
  • The appropriate balance between these errors depends on context. For personal betting decisions, being conservative (lower alpha) protects your bankroll. For sportsbook risk management, different considerations apply.

6. Statistical Power Determines What You Can Detect

  • Power is the probability of correctly identifying a real effect (rejecting H0 when H0 is false).
  • Power increases with larger sample sizes, larger true effects, and higher significance levels.
  • Before conducting any analysis, perform a power calculation to determine whether your dataset is large enough to detect the effect you are looking for. There is little value in testing a hypothesis if you lack the power to detect realistic alternatives.
  • A non-significant result with low power is uninformative — it does not mean the effect does not exist; it means you could not detect it even if it did.

7. Multiple Testing Is the Silent Killer of Betting Research

  • Testing many strategies, sports, timeframes, or angles simultaneously inflates the probability of finding at least one "significant" result by chance.
  • If you test 20 strategies at alpha = 0.05, you expect 1 false positive even if none of the strategies have a real edge.
  • Correction methods (Bonferroni, Holm-Bonferroni, Benjamini-Hochberg) adjust for multiple testing but reduce power.
  • The Bonferroni correction controls the family-wise error rate but is conservative. The Benjamini-Hochberg procedure controls the false discovery rate and is generally more powerful.
  • Pre-registration — specifying your hypotheses and analysis plan before looking at the data — is the strongest defense against multiple testing problems.

8. Confidence Intervals Tell a Richer Story Than P-Values

  • A confidence interval provides a range of plausible values for the true parameter (e.g., true win rate), conveying both the estimate and its precision.
  • A 95% CI that barely excludes 50% carries a different message than one that excludes 55%. Both might yield "significant" p-values, but the practical implications differ dramatically.
  • The Wilson score interval is preferred over the Wald interval for proportions, especially with small samples or extreme proportions.

9. The Chi-Squared Test for Categorical Betting Data

  • The chi-squared test assesses whether observed frequencies match expected frequencies across categories (goodness-of-fit) or whether two categorical variables are independent.
  • It is useful for testing whether cover rates vary across spread ranges, whether outcomes differ by day of week, or whether a sportsbook's lines are well-calibrated across categories.
  • The chi-squared test is inherently two-sided and requires adequate expected cell counts (typically at least 5 per cell).

10. Base Rates and the False Discovery Problem

  • Even with a low alpha, the proportion of "significant" results that are truly real depends on the base rate of true effects.
  • If most betting strategies have no edge (high base rate of true null hypotheses), then a large fraction of significant findings will be false positives, even at alpha = 0.05.
  • This is why skepticism toward published betting systems is warranted: publication bias and the file drawer problem ensure that you see the winners but not the losers.

11. Sequential Testing and Optional Stopping

  • Repeatedly checking your p-value and stopping when you reach significance inflates the false positive rate far beyond the nominal alpha.
  • If you plan to monitor results over time, use formal sequential testing methods (e.g., group sequential designs, alpha spending functions) that control the overall Type I error rate.
  • Pre-committing to a fixed sample size before analysis is the simplest way to avoid this problem.

12. Practical Guidelines for the Sports Bettor

  • Track everything. Record every bet with date, sport, odds, stake, and result. You cannot evaluate what you do not measure.
  • Set a sample size target before testing. Do not test your system after 50 bets and conclude it works. Determine in advance how many bets are needed.
  • Test against the breakeven rate, not 50%. The question is not "Am I better than a coin flip?" but "Am I profitable after the vig?"
  • Look for consistency. A system that works across multiple sports, seasons, and conditions is more credible than one concentrated in a narrow subsample.
  • Be honest about how many things you tested. If you tried 10 strategies before finding one that works, your evidence is much weaker than if you pre-specified a single strategy.
  • Use confidence intervals, not just p-values. They tell you both the direction and the plausible magnitude of your edge.
  • Remember: absence of evidence is not evidence of absence. A non-significant result with a small sample does not prove you have no skill. It means you need more data.

These takeaways form the essential statistical foundation for evaluating any betting strategy, system, or track record. The tools introduced in this chapter — z-tests, chi-squared tests, confidence intervals, power analysis, and multiple testing corrections — are the primary instruments for separating signal from noise in sports betting.


End of Chapter 8 Key Takeaways