Chapter 8 Exercises: Hypothesis Testing and Statistical Significance
Part A: Conceptual Questions (8 Exercises)
Exercise A-1: Null and Alternative Hypotheses in Betting
A sports bettor claims they have a profitable system for betting on NBA totals (over/under). They provide you with their last 200 bets and ask you to evaluate whether their results demonstrate genuine skill.
- State the null hypothesis (H0) and alternative hypothesis (H1) for this scenario in both plain language and mathematical notation.
- Explain why we assume the null hypothesis is true until evidence suggests otherwise. Why is this particularly important in sports betting?
- If the bettor has a win rate of 52%, would you frame the alternative hypothesis as one-sided or two-sided? Justify your choice.
- How would the hypotheses change if you were testing whether the bettor's system is profitable (accounting for the vig) rather than simply testing their win rate?
Exercise A-2: Understanding P-Values
A researcher tests whether home teams in the English Premier League cover the spread more often than expected. They find a p-value of 0.03.
- Explain what this p-value means in precise statistical language. What common misinterpretation should be avoided?
- Does this p-value tell us the probability that home teams truly have an advantage? Why or why not?
- If the researcher had chosen a significance level of alpha = 0.01 before conducting the study, what would their conclusion be?
- Another researcher repeats the study with a larger dataset and finds a p-value of 0.08. Does this mean the first study was wrong? Discuss what might explain the discrepancy.
- Can a result be statistically significant but practically meaningless in a betting context? Provide a concrete example.
Exercise A-3: Type I and Type II Errors in Betting Strategy Evaluation
You are an analyst at a sportsbook evaluating whether a customer is a "sharp" bettor (someone with a genuine edge) versus a recreational bettor who has been lucky.
- Define Type I and Type II errors in the context of this specific scenario.
- What is the real-world consequence of each type of error for the sportsbook?
- If the sportsbook sets a very low significance level (alpha = 0.001), what happens to the rate of Type I errors? What happens to Type II errors?
- From the sportsbook's perspective, which type of error is more costly? How might this influence the choice of significance level?
- From the bettor's perspective (trying to prove their own skill), which error matters more?
Exercise A-4: Sample Size and Statistical Power
A bettor has been betting on NFL point spreads for three seasons (approximately 150 bets per season).
- After one season (150 bets) with a 55% win rate, they cannot reject the null hypothesis at alpha = 0.05. Explain intuitively why 150 bets may not be enough.
- Calculate the approximate standard error of the win rate for n = 150 bets under the null hypothesis (p = 0.50). What win rate would be needed to reject H0 at the 5% level?
- If the bettor truly has a 54% win rate (a genuine edge), approximately how many bets would they need to demonstrate statistical significance at the 5% level with 80% power?
- Why is this sample size problem particularly acute in sports betting compared to, say, medical trials?
- A bettor claims "I don't need statistics, I've been profitable for 10 years." Discuss the strengths and limitations of this argument from a hypothesis testing perspective.
Exercise A-5: The Multiple Testing Problem
A data analyst tests 20 different NFL betting systems simultaneously (e.g., bet on home underdogs, bet against teams on short rest, bet the under in cold weather, etc.).
- If all 20 systems have no real edge (all null hypotheses are true), what is the probability that at least one system appears significant at alpha = 0.05?
- Explain the concept of "data snooping" or "p-hacking" in the context of sports betting research.
- What is the family-wise error rate (FWER) and how does it differ from the individual test significance level?
- Describe the Bonferroni correction and apply it to this 20-test scenario. What would the adjusted significance level be?
- Why might the Bonferroni correction be overly conservative? What is an alternative approach?
- A bettor tells you: "I tested 50 different strategies and found 3 that are significant at the 5% level." Should you be impressed? Explain quantitatively.
Exercise A-6: One-Sided vs. Two-Sided Tests
Consider the following betting scenarios and determine whether a one-sided or two-sided test is more appropriate for each. Justify every answer.
- Testing whether a new predictive model for NBA games performs better than picking at random.
- Testing whether the "hot hand" effect exists in basketball free throw shooting (i.e., whether making a free throw changes the probability of making the next one).
- Testing whether a sportsbook's closing lines are perfectly calibrated (i.e., teams listed as -3 win by exactly 3 points on average).
- Testing whether your personal betting record over the past year shows genuine skill.
- Testing whether the introduction of legalized sports betting in a state has changed game attendance.
- Under what circumstances might choosing a one-sided test be considered "cheating" or intellectually dishonest?
Exercise A-7: Confidence Intervals vs. Hypothesis Tests
A bettor has won 268 out of 500 bets against the spread (53.6% win rate).
- Construct a 95% confidence interval for the bettor's true win rate.
- Does this confidence interval include 50%? What does this tell you about the result of a hypothesis test at the 5% significance level?
- Does this confidence interval include 52.4% (the approximate breakeven rate with standard -110 vig)? What is the practical significance of this observation?
- Explain why confidence intervals often provide more useful information than a simple "reject/fail to reject" decision.
- How would the confidence interval change if the bettor had the same win rate (53.6%) but over 2000 bets instead of 500?
Exercise A-8: Bayesian vs. Frequentist Approaches
Two analysts evaluate the same bettor who has won 540 out of 1000 bets (54% win rate).
Analyst A (frequentist) performs a z-test and reports a p-value. Analyst B (Bayesian) starts with a prior belief that most bettors are not skilled (prior centered at 50% with moderate uncertainty) and updates this prior with the observed data.
- What would Analyst A's p-value be? What would their conclusion be at alpha = 0.05?
- Describe qualitatively what Analyst B's posterior distribution would look like. Would it be centered at exactly 54%? Why or why not?
- If Analyst B used a very skeptical prior (strongly centered at 50%), how would this affect the posterior compared to using a flat (uninformative) prior?
- In what ways is the Bayesian approach more natural for evaluating betting skill? In what ways might the frequentist approach be preferred?
- How does the concept of "extraordinary claims require extraordinary evidence" relate to the Bayesian framework?
Part B: Calculation Exercises (7 Exercises)
Exercise B-1: Z-Test for Betting Records
A bettor has the following record betting on NFL sides at -110 odds:
- Total bets: 600
- Wins: 324
- Losses: 276
Perform the following calculations:
- Calculate the observed win rate and the expected win rate under the null hypothesis (no skill, p = 0.50).
- Calculate the standard error under the null hypothesis.
- Compute the z-statistic.
- Find the one-sided p-value (testing whether the bettor is better than random).
- Find the two-sided p-value.
- State your conclusion at alpha = 0.05 for both one-sided and two-sided tests.
- Calculate the 95% and 99% confidence intervals for the true win rate.
- Determine whether the bettor is profitable after accounting for the standard -110 vig (breakeven at approximately 52.38%).
Exercise B-2: Required Sample Size Calculations
For each of the following scenarios, calculate the minimum number of bets required to achieve statistical significance at alpha = 0.05 with 80% power:
- A bettor with a true win rate of 55% against the spread (testing against p0 = 0.50).
- A bettor with a true win rate of 53% against the spread (testing against p0 = 0.50).
- A bettor with a true win rate of 52% against the spread (testing against p0 = 0.50).
- A totals bettor with a true win rate of 56% (testing against p0 = 0.50).
- Plot or describe the relationship between true win rate and required sample size. What pattern do you observe?
- A bettor wants to prove profitability (not just above 50%, but above the 52.38% breakeven). How does this change the required sample sizes for scenarios 1-4?
Use the formula: n = ((z_alpha + z_beta)^2 * p0 * (1 - p0)) / (p1 - p0)^2, where z_alpha = 1.645 (one-sided) or 1.96 (two-sided), z_beta = 0.842 for 80% power, p0 is the null proportion, and p1 is the true proportion.
Exercise B-3: Chi-Squared Test for Betting Market Efficiency
The following table shows the results of 1000 NFL games categorized by the point spread and the actual outcome (cover or not cover):
| Spread Range | Games | Covers | Expected Covers (50%) |
|---|---|---|---|
| 1 to 3 | 350 | 185 | 175 |
| 3.5 to 6.5 | 300 | 148 | 150 |
| 7 to 10 | 200 | 94 | 100 |
| 10.5 to 14 | 100 | 56 | 50 |
| 14.5+ | 50 | 29 | 25 |
- State the null hypothesis for this chi-squared test.
- Calculate the chi-squared test statistic using the formula: chi2 = sum((O - E)^2 / E).
- How many degrees of freedom does this test have?
- Find the critical value at alpha = 0.05 for this number of degrees of freedom.
- What is the approximate p-value? What is your conclusion?
- Which spread range contributes most to the chi-squared statistic? What might this indicate about market efficiency?
Exercise B-4: Comparing Two Proportions
You want to test whether a bettor performs differently on favorites versus underdogs:
- Favorites: 180 wins out of 350 bets (51.4%)
- Underdogs: 165 wins out of 280 bets (58.9%)
- State the null and alternative hypotheses.
- Calculate the pooled proportion.
- Calculate the standard error of the difference in proportions.
- Compute the z-statistic for the difference.
- Find the two-sided p-value.
- Construct a 95% confidence interval for the difference in win rates.
- Is there statistically significant evidence that the bettor performs differently on favorites vs. underdogs? Discuss both statistical and practical significance.
Exercise B-5: Sequential Testing and Stopping Rules
A bettor decides to track their results and stop betting on a system as soon as either: - They achieve a statistically significant positive result (p < 0.05), or - They have placed 500 bets without significance.
After every 50 bets, they calculate a running p-value. Their results:
| Bets | Cumulative Wins | Cumulative Win % | Running p-value |
|---|---|---|---|
| 50 | 29 | 58.0% | 0.129 |
| 100 | 56 | 56.0% | 0.115 |
| 150 | 83 | 55.3% | 0.074 |
| 200 | 112 | 56.0% | 0.028 |
| 250 | 136 | 54.4% | 0.048 |
- Verify the p-value calculation for n = 200 bets (112 wins).
- The bettor stops at 250 bets, claiming significance. What is wrong with this approach?
- Explain the concept of "optional stopping" and why it inflates Type I error rates.
- If the bettor had pre-committed to exactly 500 bets, what would the significance threshold be?
- Describe how a sequential testing procedure (such as the O'Brien-Fleming method or alpha spending function) could be properly applied here.
- Calculate the adjusted significance thresholds using the Pocock boundary for 5 interim analyses.
Exercise B-6: Power Analysis for Betting Research
A researcher wants to study whether NFL home underdogs cover the spread more than 50% of the time. Historical data suggests the true cover rate might be 52.5%.
- Calculate the power of a test with n = 500 games at alpha = 0.05 (one-sided) to detect a true proportion of 52.5%.
- Calculate the power for n = 1000, 2000, and 5000 games.
- What sample size is needed to achieve 90% power?
- The researcher has access to 15 years of NFL data (approximately 4000 games). Is this sufficient to detect a 52.5% cover rate with 80% power?
- Create a power curve showing power as a function of sample size for true proportions of 51%, 52%, 53%, 54%, and 55%.
- Discuss the practical implications: if the effect is real but requires 5000+ games to detect, what does this mean for the individual bettor?
Exercise B-7: Multiple Testing Correction
A researcher tests 15 different betting angles in college football. The p-values obtained are:
0.003, 0.012, 0.024, 0.031, 0.048, 0.055, 0.067, 0.089, 0.112, 0.156, 0.234, 0.345, 0.456, 0.678, 0.891
- How many of these are significant at the uncorrected alpha = 0.05 level?
- Apply the Bonferroni correction. What is the adjusted significance threshold? How many results remain significant?
- Apply the Holm-Bonferroni (step-down) procedure. Show each step and identify which results remain significant.
- Apply the Benjamini-Hochberg procedure to control the False Discovery Rate at 5%. Show each step and identify which results are significant.
- Compare the results of all three correction methods. Which is most conservative? Which is most liberal?
- In the context of sports betting research, which correction method would you recommend and why?
Part C: Programming Exercises (5 Exercises)
Exercise C-1: Hypothesis Testing Framework
Build a comprehensive Python class called BettingHypothesisTest that implements the following:
Requirements: - Accept betting records as input (wins, losses, pushes, odds for each bet) - Implement a z-test for proportions (one-sided and two-sided) - Implement a t-test for profit/loss per bet (one-sided and two-sided) - Calculate exact binomial test p-values - Generate confidence intervals (both Wald and Wilson score intervals) - Produce a summary report including: - Test statistics and p-values - Confidence intervals - Effect size measures - Plain-language interpretation of results - Handle edge cases (small samples, extreme win rates, all wins/all losses)
Testing: - Test with a simulated bettor who has a 54% true win rate over 500 bets - Test with a simulated bettor who has a 50% true win rate over 500 bets (should usually fail to reject) - Test with a small sample (30 bets) to observe wide confidence intervals
Exercise C-2: Power and Sample Size Calculator
Build a Python tool called BettingSampleSizeCalculator that performs power analysis specifically for sports betting scenarios.
Requirements: - Calculate required sample size given: significance level, power, null proportion, alternative proportion - Calculate power given: significance level, sample size, null proportion, alternative proportion - Handle both one-sided and two-sided tests - Include preset scenarios for common betting situations: - "Can this bettor beat the no-vig line?" (p0 = 0.50) - "Can this bettor beat the vig?" (p0 = 0.5238 for -110 lines) - "Can this bettor beat the vig on heavy juice?" (p0 = 0.5350 for -115 lines) - Generate power curves (plot power vs. sample size for various true win rates) - Generate a "years to significance" table: given bets per week and true win rate, how many years would it take to achieve significance?
Exercise C-3: Multiple Testing Correction Tool
Build a Python class called MultipleTestingCorrector that implements various correction methods for multiple hypothesis testing.
Requirements: - Bonferroni correction - Holm-Bonferroni (step-down) procedure - Benjamini-Hochberg (BH) procedure for FDR control - Benjamini-Yekutieli procedure (for dependent tests) - Simulation of "data snooping": generate N strategies with no edge, show how many appear significant before and after correction - Visualization: plot raw p-values vs. adjusted p-values for each method - Summary table showing which hypotheses are rejected under each method
Data Snooping Simulation: - Simulate 100 betting strategies, each with 200 bets at true 50% win rate - Show the distribution of p-values (should be approximately uniform) - Demonstrate how many strategies appear "significant" by chance - Apply corrections and show the reduction in false discoveries
Exercise C-4: Comprehensive Betting Record Significance Tester
Build an end-to-end tool called BettingRecordAnalyzer that takes a bettor's complete record and produces a thorough statistical evaluation.
Requirements: - Input: CSV file or list of bets with columns (date, sport, bet_type, odds, stake, result) - Tests to perform: - Overall win rate significance test - Profitability test (is ROI significantly greater than zero?) - Subsample consistency (does skill persist across time periods?) - Sport-by-sport breakdown with multiple testing correction - Bet type breakdown (spreads, totals, moneylines) with correction - Streak analysis (are winning/losing streaks consistent with randomness?) - Closing line value analysis (if closing odds are available) - Output: formatted report with visualizations - Include a "skepticism score" that summarizes the overall evidence for skill
Exercise C-5: Monte Carlo Hypothesis Testing Simulator
Build a simulation tool that demonstrates key hypothesis testing concepts through Monte Carlo methods.
Requirements: - Simulate the distribution of test statistics under H0 to verify theoretical p-values - Demonstrate Type I error rates: run 10,000 experiments with no true effect and count false positives - Demonstrate Type II error rates: run 10,000 experiments with a known true effect and count false negatives - Demonstrate the multiple testing problem: simulate testing many strategies simultaneously - Demonstrate optional stopping bias: simulate the effect of peeking at results - Demonstrate the relationship between sample size and power empirically - All simulations should be configurable (sample sizes, true proportions, significance levels) - Produce publication-quality plots for each demonstration
Part D: Analysis Exercises (5 Exercises)
Exercise D-1: Real-World Betting Record Evaluation
Below is a simulated record for a sports bettor over three seasons:
| Season | Sport | Bets | Wins | Losses | Avg Odds | Profit/Loss |
|---|---|---|---|---|---|---|
| 2022 | NFL | 180 | 98 | 82 | -110 | +$1,236 |
| 2022 | NBA | 420 | 218 | 202 | -108 | +$2,845 |
| 2022 | MLB | 300 | 152 | 148 | +102 | +$1,567 |
| 2023 | NFL | 175 | 90 | 85 | -110 | +$312 |
| 2023 | NBA | 450 | 225 | 225 | -110 | -$2,250 |
| 2023 | MLB | 280 | 148 | 132 | +105 | +$3,920 |
| 2024 | NFL | 190 | 105 | 85 | -110 | +$3,045 |
| 2024 | NBA | 400 | 212 | 188 | -110 | +$2,618 |
| 2024 | MLB | 310 | 165 | 145 | +100 | +$3,300 |
- For each sport, calculate the overall win rate across all three seasons.
- Test whether the overall win rate for each sport is significantly above 50%.
- Test whether the overall win rate for each sport is significantly above the breakeven threshold given the average odds.
- Apply multiple testing correction (since you are testing three sports simultaneously).
- Analyze whether the bettor's performance is consistent across seasons or shows significant variation.
- What is your overall assessment? Is this bettor skilled, lucky, or is the evidence inconclusive?
Exercise D-2: Market Efficiency Analysis
The following data shows the ATS (against the spread) records for NFL teams grouped by point spread magnitude over 10 seasons:
| Category | Total Games | Home Covers | Away Covers | Push |
|---|---|---|---|---|
| Pick'em (0) | 120 | 64 | 52 | 4 |
| Small favorites (1-3) | 580 | 275 | 290 | 15 |
| Medium favorites (3.5-7) | 820 | 398 | 405 | 17 |
| Large favorites (7.5-10) | 350 | 182 | 160 | 8 |
| Very large favorites (10.5+) | 230 | 125 | 98 | 7 |
- For each category, test whether the home team covers at a rate significantly different from 50% (excluding pushes).
- Is there a trend in home team cover rates as the spread increases? Perform a test for trend.
- The "large favorites" category shows a home cover rate above 53%. Is this significant? What is the p-value?
- Apply the Benjamini-Hochberg correction across all five categories. Do any categories show significant deviations from 50%?
- Discuss the practical implications. Even if some results are statistically significant, are they exploitable given the vig?
- What additional data or tests would you want to conduct before concluding that a market inefficiency exists?
Exercise D-3: Before-and-After Study
A sportsbook changes its line-setting algorithm. You have data from before and after the change:
Before (old algorithm): 2000 games, bettors collectively won 1020 (51.0%) After (new algorithm): 1500 games, bettors collectively won 735 (49.0%)
- Test whether the bettor win rate decreased significantly after the algorithm change.
- Construct a 95% confidence interval for the change in bettor win rate.
- From the sportsbook's perspective, estimate the additional profit per 1000 bets generated by the new algorithm (assuming $100 average bet at -110).
- Could the difference be due to factors other than the algorithm change? List at least three confounding variables.
- Design a more rigorous study to evaluate the algorithm change. What would an ideal experimental design look like?
- If the sportsbook wants to detect a 1-percentage-point improvement in their hold with 90% power, how many games do they need to observe with each algorithm?
Exercise D-4: Publication Bias and the File Drawer Problem
A sports analytics website publishes 10 articles per year, each presenting a "profitable betting system." Assume: - Each article tests one system using historical data. - For every published article, 9 systems were tested but not published because they didn't show significant results. - The significance level used is alpha = 0.05.
- If none of the tested systems have a real edge, how many would you expect to appear significant per year?
- What proportion of the published systems are likely to be false positives?
- This is related to the "positive predictive value" of a statistical test. Express this mathematically using Bayes' theorem, given a prior probability pi that any given system has a real edge.
- If you believe 5% of tested systems might have a real edge, what is the probability that a published significant result represents a truly effective system?
- How would requiring alpha = 0.01 change the positive predictive value?
- What practices should consumers of sports betting research adopt to protect themselves from publication bias?
Exercise D-5: Evaluating a Tipster Service
A sports betting tipster claims to have a verified 56% win rate over 400 bets at average odds of -110.
- Test this claim statistically. Is 56% over 400 bets significant?
- The tipster has been active for 2 years. How many total picks would they likely have made? Does the 400-bet sample represent their complete record or a subset?
- If the tipster selected their best 400 out of 800 total bets to advertise, how would this affect the interpretation?
- Calculate the expected win rate of the "best half" of bets from a bettor with no skill (true p = 0.50) when selecting the best 400 out of 800 bets. (Hint: think about order statistics or simulation.)
- The tipster offers a subscription for $200/month. If you bet $100 per game and follow all their picks (approximately 20 per month), what win rate do you need just to cover the subscription cost plus vig?
- Design a prospective evaluation plan for this tipster. How many months of tracked picks would you need, and what criteria would you use to assess their skill?
Part E: Research Exercises (5 Exercises)
Exercise E-1: Literature Review on Betting Market Efficiency
Conduct a literature review on hypothesis testing in the context of betting market efficiency.
- Find and summarize at least three academic papers that test the efficient market hypothesis in sports betting markets.
- What statistical methods are most commonly used? (z-tests, chi-squared, regression, etc.)
- What sample sizes are typical in these studies?
- How do the papers address the multiple testing problem?
- What are the most commonly reported "inefficiencies," and do they survive out-of-sample testing?
- Write a 500-word synthesis of the current state of knowledge on this topic.
Exercise E-2: Replication Study Design
Design a replication study for the following published finding: "NFL home underdogs of 7 or more points cover the spread 55% of the time."
- State the precise null and alternative hypotheses for your replication.
- Conduct a power analysis: how many games do you need to replicate this finding with 80% power at alpha = 0.05?
- Describe your data collection plan (data sources, time period, inclusion/exclusion criteria).
- Pre-register your analysis plan: specify exactly what tests you will run, what corrections you will apply, and what criteria you will use to evaluate the replication.
- Discuss potential threats to validity (changes in the market over time, data quality issues, definitional differences).
- What would a "successful" replication look like? What would a "failed" replication tell us?
Exercise E-3: Monte Carlo Study of Test Properties
Design and conduct a Monte Carlo simulation study to investigate the following question: "When evaluating sports bettors, how does the choice of test statistic affect the probability of correctly identifying skilled bettors?"
Compare the following approaches: 1. Z-test for proportions (win rate vs. 50%) 2. T-test for profit per bet (mean profit vs. 0) 3. Exact binomial test 4. Bootstrap test 5. Bayesian test with a skeptical prior
For each approach, estimate: - Type I error rate (using simulated 50% bettors) - Power (using simulated 54% bettors) - Power for a "realistic" edge (52% bettors) - Sensitivity to model misspecification (e.g., varying odds across bets)
Write up your findings in a 500-word report.
Exercise E-4: Historical Analysis of a Betting System
Choose one well-known betting system or angle (examples: bet against the public, fade the Monday Night Football favorite, bet the under in September NFL games).
- Clearly state the system's rules and the hypotheses you will test.
- Collect or simulate historical data for at least 5 seasons.
- Perform a rigorous hypothesis test, including: - Primary test with pre-specified significance level - Multiple testing correction if you examine sub-categories - Robustness checks (different time periods, different sub-samples)
- Conduct an out-of-sample test: use the first half of your data to identify the system and the second half to test it.
- Report your results following best practices (effect sizes, confidence intervals, not just p-values).
- Discuss whether the system would have been profitable after transaction costs (vig).
Exercise E-5: Developing a Personal Hypothesis Testing Protocol
Create a personal statistical protocol that you will use to evaluate any future betting strategy.
Your protocol should include:
- Pre-registration template: A form you fill out before testing any strategy, including hypotheses, sample size justification, analysis plan, and stopping rules.
- Significance standards: Your chosen significance level and justification. Consider whether a single threshold (like 0.05) is appropriate or whether a tiered approach (suggestive, significant, highly significant) is better.
- Multiple testing policy: How you will handle the fact that you may test many strategies over your career.
- Minimum sample sizes: A table of minimum bets required before you will even consider a result meaningful, for different assumed edges.
- Reporting standards: What information you will record and report for every strategy evaluation.
- Decision criteria: Clear rules for when you will begin betting real money on a strategy, when you will increase stakes, and when you will abandon a strategy.
- Annual review process: How you will periodically re-evaluate your active strategies.
Write this protocol as a 1-2 page document that you could actually use in practice.
End of Chapter 8 Exercises