Chapter 8 Exercises: Hypothesis Testing and Statistical Significance


Part A: Conceptual Questions (8 Exercises)

Exercise A-1: Null and Alternative Hypotheses in Betting

A sports bettor claims they have a profitable system for betting on NBA totals (over/under). They provide you with their last 200 bets and ask you to evaluate whether their results demonstrate genuine skill.

  1. State the null hypothesis (H0) and alternative hypothesis (H1) for this scenario in both plain language and mathematical notation.
  2. Explain why we assume the null hypothesis is true until evidence suggests otherwise. Why is this particularly important in sports betting?
  3. If the bettor has a win rate of 52%, would you frame the alternative hypothesis as one-sided or two-sided? Justify your choice.
  4. How would the hypotheses change if you were testing whether the bettor's system is profitable (accounting for the vig) rather than simply testing their win rate?

Exercise A-2: Understanding P-Values

A researcher tests whether home teams in the English Premier League cover the spread more often than expected. They find a p-value of 0.03.

  1. Explain what this p-value means in precise statistical language. What common misinterpretation should be avoided?
  2. Does this p-value tell us the probability that home teams truly have an advantage? Why or why not?
  3. If the researcher had chosen a significance level of alpha = 0.01 before conducting the study, what would their conclusion be?
  4. Another researcher repeats the study with a larger dataset and finds a p-value of 0.08. Does this mean the first study was wrong? Discuss what might explain the discrepancy.
  5. Can a result be statistically significant but practically meaningless in a betting context? Provide a concrete example.

Exercise A-3: Type I and Type II Errors in Betting Strategy Evaluation

You are an analyst at a sportsbook evaluating whether a customer is a "sharp" bettor (someone with a genuine edge) versus a recreational bettor who has been lucky.

  1. Define Type I and Type II errors in the context of this specific scenario.
  2. What is the real-world consequence of each type of error for the sportsbook?
  3. If the sportsbook sets a very low significance level (alpha = 0.001), what happens to the rate of Type I errors? What happens to Type II errors?
  4. From the sportsbook's perspective, which type of error is more costly? How might this influence the choice of significance level?
  5. From the bettor's perspective (trying to prove their own skill), which error matters more?

Exercise A-4: Sample Size and Statistical Power

A bettor has been betting on NFL point spreads for three seasons (approximately 150 bets per season).

  1. After one season (150 bets) with a 55% win rate, they cannot reject the null hypothesis at alpha = 0.05. Explain intuitively why 150 bets may not be enough.
  2. Calculate the approximate standard error of the win rate for n = 150 bets under the null hypothesis (p = 0.50). What win rate would be needed to reject H0 at the 5% level?
  3. If the bettor truly has a 54% win rate (a genuine edge), approximately how many bets would they need to demonstrate statistical significance at the 5% level with 80% power?
  4. Why is this sample size problem particularly acute in sports betting compared to, say, medical trials?
  5. A bettor claims "I don't need statistics, I've been profitable for 10 years." Discuss the strengths and limitations of this argument from a hypothesis testing perspective.

Exercise A-5: The Multiple Testing Problem

A data analyst tests 20 different NFL betting systems simultaneously (e.g., bet on home underdogs, bet against teams on short rest, bet the under in cold weather, etc.).

  1. If all 20 systems have no real edge (all null hypotheses are true), what is the probability that at least one system appears significant at alpha = 0.05?
  2. Explain the concept of "data snooping" or "p-hacking" in the context of sports betting research.
  3. What is the family-wise error rate (FWER) and how does it differ from the individual test significance level?
  4. Describe the Bonferroni correction and apply it to this 20-test scenario. What would the adjusted significance level be?
  5. Why might the Bonferroni correction be overly conservative? What is an alternative approach?
  6. A bettor tells you: "I tested 50 different strategies and found 3 that are significant at the 5% level." Should you be impressed? Explain quantitatively.

Exercise A-6: One-Sided vs. Two-Sided Tests

Consider the following betting scenarios and determine whether a one-sided or two-sided test is more appropriate for each. Justify every answer.

  1. Testing whether a new predictive model for NBA games performs better than picking at random.
  2. Testing whether the "hot hand" effect exists in basketball free throw shooting (i.e., whether making a free throw changes the probability of making the next one).
  3. Testing whether a sportsbook's closing lines are perfectly calibrated (i.e., teams listed as -3 win by exactly 3 points on average).
  4. Testing whether your personal betting record over the past year shows genuine skill.
  5. Testing whether the introduction of legalized sports betting in a state has changed game attendance.
  6. Under what circumstances might choosing a one-sided test be considered "cheating" or intellectually dishonest?

Exercise A-7: Confidence Intervals vs. Hypothesis Tests

A bettor has won 268 out of 500 bets against the spread (53.6% win rate).

  1. Construct a 95% confidence interval for the bettor's true win rate.
  2. Does this confidence interval include 50%? What does this tell you about the result of a hypothesis test at the 5% significance level?
  3. Does this confidence interval include 52.4% (the approximate breakeven rate with standard -110 vig)? What is the practical significance of this observation?
  4. Explain why confidence intervals often provide more useful information than a simple "reject/fail to reject" decision.
  5. How would the confidence interval change if the bettor had the same win rate (53.6%) but over 2000 bets instead of 500?

Exercise A-8: Bayesian vs. Frequentist Approaches

Two analysts evaluate the same bettor who has won 540 out of 1000 bets (54% win rate).

Analyst A (frequentist) performs a z-test and reports a p-value. Analyst B (Bayesian) starts with a prior belief that most bettors are not skilled (prior centered at 50% with moderate uncertainty) and updates this prior with the observed data.

  1. What would Analyst A's p-value be? What would their conclusion be at alpha = 0.05?
  2. Describe qualitatively what Analyst B's posterior distribution would look like. Would it be centered at exactly 54%? Why or why not?
  3. If Analyst B used a very skeptical prior (strongly centered at 50%), how would this affect the posterior compared to using a flat (uninformative) prior?
  4. In what ways is the Bayesian approach more natural for evaluating betting skill? In what ways might the frequentist approach be preferred?
  5. How does the concept of "extraordinary claims require extraordinary evidence" relate to the Bayesian framework?

Part B: Calculation Exercises (7 Exercises)

Exercise B-1: Z-Test for Betting Records

A bettor has the following record betting on NFL sides at -110 odds:

  • Total bets: 600
  • Wins: 324
  • Losses: 276

Perform the following calculations:

  1. Calculate the observed win rate and the expected win rate under the null hypothesis (no skill, p = 0.50).
  2. Calculate the standard error under the null hypothesis.
  3. Compute the z-statistic.
  4. Find the one-sided p-value (testing whether the bettor is better than random).
  5. Find the two-sided p-value.
  6. State your conclusion at alpha = 0.05 for both one-sided and two-sided tests.
  7. Calculate the 95% and 99% confidence intervals for the true win rate.
  8. Determine whether the bettor is profitable after accounting for the standard -110 vig (breakeven at approximately 52.38%).

Exercise B-2: Required Sample Size Calculations

For each of the following scenarios, calculate the minimum number of bets required to achieve statistical significance at alpha = 0.05 with 80% power:

  1. A bettor with a true win rate of 55% against the spread (testing against p0 = 0.50).
  2. A bettor with a true win rate of 53% against the spread (testing against p0 = 0.50).
  3. A bettor with a true win rate of 52% against the spread (testing against p0 = 0.50).
  4. A totals bettor with a true win rate of 56% (testing against p0 = 0.50).
  5. Plot or describe the relationship between true win rate and required sample size. What pattern do you observe?
  6. A bettor wants to prove profitability (not just above 50%, but above the 52.38% breakeven). How does this change the required sample sizes for scenarios 1-4?

Use the formula: n = ((z_alpha + z_beta)^2 * p0 * (1 - p0)) / (p1 - p0)^2, where z_alpha = 1.645 (one-sided) or 1.96 (two-sided), z_beta = 0.842 for 80% power, p0 is the null proportion, and p1 is the true proportion.


Exercise B-3: Chi-Squared Test for Betting Market Efficiency

The following table shows the results of 1000 NFL games categorized by the point spread and the actual outcome (cover or not cover):

Spread Range Games Covers Expected Covers (50%)
1 to 3 350 185 175
3.5 to 6.5 300 148 150
7 to 10 200 94 100
10.5 to 14 100 56 50
14.5+ 50 29 25
  1. State the null hypothesis for this chi-squared test.
  2. Calculate the chi-squared test statistic using the formula: chi2 = sum((O - E)^2 / E).
  3. How many degrees of freedom does this test have?
  4. Find the critical value at alpha = 0.05 for this number of degrees of freedom.
  5. What is the approximate p-value? What is your conclusion?
  6. Which spread range contributes most to the chi-squared statistic? What might this indicate about market efficiency?

Exercise B-4: Comparing Two Proportions

You want to test whether a bettor performs differently on favorites versus underdogs:

  • Favorites: 180 wins out of 350 bets (51.4%)
  • Underdogs: 165 wins out of 280 bets (58.9%)
  1. State the null and alternative hypotheses.
  2. Calculate the pooled proportion.
  3. Calculate the standard error of the difference in proportions.
  4. Compute the z-statistic for the difference.
  5. Find the two-sided p-value.
  6. Construct a 95% confidence interval for the difference in win rates.
  7. Is there statistically significant evidence that the bettor performs differently on favorites vs. underdogs? Discuss both statistical and practical significance.

Exercise B-5: Sequential Testing and Stopping Rules

A bettor decides to track their results and stop betting on a system as soon as either: - They achieve a statistically significant positive result (p < 0.05), or - They have placed 500 bets without significance.

After every 50 bets, they calculate a running p-value. Their results:

Bets Cumulative Wins Cumulative Win % Running p-value
50 29 58.0% 0.129
100 56 56.0% 0.115
150 83 55.3% 0.074
200 112 56.0% 0.028
250 136 54.4% 0.048
  1. Verify the p-value calculation for n = 200 bets (112 wins).
  2. The bettor stops at 250 bets, claiming significance. What is wrong with this approach?
  3. Explain the concept of "optional stopping" and why it inflates Type I error rates.
  4. If the bettor had pre-committed to exactly 500 bets, what would the significance threshold be?
  5. Describe how a sequential testing procedure (such as the O'Brien-Fleming method or alpha spending function) could be properly applied here.
  6. Calculate the adjusted significance thresholds using the Pocock boundary for 5 interim analyses.

Exercise B-6: Power Analysis for Betting Research

A researcher wants to study whether NFL home underdogs cover the spread more than 50% of the time. Historical data suggests the true cover rate might be 52.5%.

  1. Calculate the power of a test with n = 500 games at alpha = 0.05 (one-sided) to detect a true proportion of 52.5%.
  2. Calculate the power for n = 1000, 2000, and 5000 games.
  3. What sample size is needed to achieve 90% power?
  4. The researcher has access to 15 years of NFL data (approximately 4000 games). Is this sufficient to detect a 52.5% cover rate with 80% power?
  5. Create a power curve showing power as a function of sample size for true proportions of 51%, 52%, 53%, 54%, and 55%.
  6. Discuss the practical implications: if the effect is real but requires 5000+ games to detect, what does this mean for the individual bettor?

Exercise B-7: Multiple Testing Correction

A researcher tests 15 different betting angles in college football. The p-values obtained are:

0.003, 0.012, 0.024, 0.031, 0.048, 0.055, 0.067, 0.089, 0.112, 0.156, 0.234, 0.345, 0.456, 0.678, 0.891

  1. How many of these are significant at the uncorrected alpha = 0.05 level?
  2. Apply the Bonferroni correction. What is the adjusted significance threshold? How many results remain significant?
  3. Apply the Holm-Bonferroni (step-down) procedure. Show each step and identify which results remain significant.
  4. Apply the Benjamini-Hochberg procedure to control the False Discovery Rate at 5%. Show each step and identify which results are significant.
  5. Compare the results of all three correction methods. Which is most conservative? Which is most liberal?
  6. In the context of sports betting research, which correction method would you recommend and why?

Part C: Programming Exercises (5 Exercises)

Exercise C-1: Hypothesis Testing Framework

Build a comprehensive Python class called BettingHypothesisTest that implements the following:

Requirements: - Accept betting records as input (wins, losses, pushes, odds for each bet) - Implement a z-test for proportions (one-sided and two-sided) - Implement a t-test for profit/loss per bet (one-sided and two-sided) - Calculate exact binomial test p-values - Generate confidence intervals (both Wald and Wilson score intervals) - Produce a summary report including: - Test statistics and p-values - Confidence intervals - Effect size measures - Plain-language interpretation of results - Handle edge cases (small samples, extreme win rates, all wins/all losses)

Testing: - Test with a simulated bettor who has a 54% true win rate over 500 bets - Test with a simulated bettor who has a 50% true win rate over 500 bets (should usually fail to reject) - Test with a small sample (30 bets) to observe wide confidence intervals


Exercise C-2: Power and Sample Size Calculator

Build a Python tool called BettingSampleSizeCalculator that performs power analysis specifically for sports betting scenarios.

Requirements: - Calculate required sample size given: significance level, power, null proportion, alternative proportion - Calculate power given: significance level, sample size, null proportion, alternative proportion - Handle both one-sided and two-sided tests - Include preset scenarios for common betting situations: - "Can this bettor beat the no-vig line?" (p0 = 0.50) - "Can this bettor beat the vig?" (p0 = 0.5238 for -110 lines) - "Can this bettor beat the vig on heavy juice?" (p0 = 0.5350 for -115 lines) - Generate power curves (plot power vs. sample size for various true win rates) - Generate a "years to significance" table: given bets per week and true win rate, how many years would it take to achieve significance?


Exercise C-3: Multiple Testing Correction Tool

Build a Python class called MultipleTestingCorrector that implements various correction methods for multiple hypothesis testing.

Requirements: - Bonferroni correction - Holm-Bonferroni (step-down) procedure - Benjamini-Hochberg (BH) procedure for FDR control - Benjamini-Yekutieli procedure (for dependent tests) - Simulation of "data snooping": generate N strategies with no edge, show how many appear significant before and after correction - Visualization: plot raw p-values vs. adjusted p-values for each method - Summary table showing which hypotheses are rejected under each method

Data Snooping Simulation: - Simulate 100 betting strategies, each with 200 bets at true 50% win rate - Show the distribution of p-values (should be approximately uniform) - Demonstrate how many strategies appear "significant" by chance - Apply corrections and show the reduction in false discoveries


Exercise C-4: Comprehensive Betting Record Significance Tester

Build an end-to-end tool called BettingRecordAnalyzer that takes a bettor's complete record and produces a thorough statistical evaluation.

Requirements: - Input: CSV file or list of bets with columns (date, sport, bet_type, odds, stake, result) - Tests to perform: - Overall win rate significance test - Profitability test (is ROI significantly greater than zero?) - Subsample consistency (does skill persist across time periods?) - Sport-by-sport breakdown with multiple testing correction - Bet type breakdown (spreads, totals, moneylines) with correction - Streak analysis (are winning/losing streaks consistent with randomness?) - Closing line value analysis (if closing odds are available) - Output: formatted report with visualizations - Include a "skepticism score" that summarizes the overall evidence for skill


Exercise C-5: Monte Carlo Hypothesis Testing Simulator

Build a simulation tool that demonstrates key hypothesis testing concepts through Monte Carlo methods.

Requirements: - Simulate the distribution of test statistics under H0 to verify theoretical p-values - Demonstrate Type I error rates: run 10,000 experiments with no true effect and count false positives - Demonstrate Type II error rates: run 10,000 experiments with a known true effect and count false negatives - Demonstrate the multiple testing problem: simulate testing many strategies simultaneously - Demonstrate optional stopping bias: simulate the effect of peeking at results - Demonstrate the relationship between sample size and power empirically - All simulations should be configurable (sample sizes, true proportions, significance levels) - Produce publication-quality plots for each demonstration


Part D: Analysis Exercises (5 Exercises)

Exercise D-1: Real-World Betting Record Evaluation

Below is a simulated record for a sports bettor over three seasons:

Season Sport Bets Wins Losses Avg Odds Profit/Loss
2022 NFL 180 98 82 -110 +$1,236
2022 NBA 420 218 202 -108 +$2,845
2022 MLB 300 152 148 +102 +$1,567
2023 NFL 175 90 85 -110 +$312
2023 NBA 450 225 225 -110 -$2,250
2023 MLB 280 148 132 +105 +$3,920
2024 NFL 190 105 85 -110 +$3,045
2024 NBA 400 212 188 -110 +$2,618
2024 MLB 310 165 145 +100 +$3,300
  1. For each sport, calculate the overall win rate across all three seasons.
  2. Test whether the overall win rate for each sport is significantly above 50%.
  3. Test whether the overall win rate for each sport is significantly above the breakeven threshold given the average odds.
  4. Apply multiple testing correction (since you are testing three sports simultaneously).
  5. Analyze whether the bettor's performance is consistent across seasons or shows significant variation.
  6. What is your overall assessment? Is this bettor skilled, lucky, or is the evidence inconclusive?

Exercise D-2: Market Efficiency Analysis

The following data shows the ATS (against the spread) records for NFL teams grouped by point spread magnitude over 10 seasons:

Category Total Games Home Covers Away Covers Push
Pick'em (0) 120 64 52 4
Small favorites (1-3) 580 275 290 15
Medium favorites (3.5-7) 820 398 405 17
Large favorites (7.5-10) 350 182 160 8
Very large favorites (10.5+) 230 125 98 7
  1. For each category, test whether the home team covers at a rate significantly different from 50% (excluding pushes).
  2. Is there a trend in home team cover rates as the spread increases? Perform a test for trend.
  3. The "large favorites" category shows a home cover rate above 53%. Is this significant? What is the p-value?
  4. Apply the Benjamini-Hochberg correction across all five categories. Do any categories show significant deviations from 50%?
  5. Discuss the practical implications. Even if some results are statistically significant, are they exploitable given the vig?
  6. What additional data or tests would you want to conduct before concluding that a market inefficiency exists?

Exercise D-3: Before-and-After Study

A sportsbook changes its line-setting algorithm. You have data from before and after the change:

Before (old algorithm): 2000 games, bettors collectively won 1020 (51.0%) After (new algorithm): 1500 games, bettors collectively won 735 (49.0%)

  1. Test whether the bettor win rate decreased significantly after the algorithm change.
  2. Construct a 95% confidence interval for the change in bettor win rate.
  3. From the sportsbook's perspective, estimate the additional profit per 1000 bets generated by the new algorithm (assuming $100 average bet at -110).
  4. Could the difference be due to factors other than the algorithm change? List at least three confounding variables.
  5. Design a more rigorous study to evaluate the algorithm change. What would an ideal experimental design look like?
  6. If the sportsbook wants to detect a 1-percentage-point improvement in their hold with 90% power, how many games do they need to observe with each algorithm?

Exercise D-4: Publication Bias and the File Drawer Problem

A sports analytics website publishes 10 articles per year, each presenting a "profitable betting system." Assume: - Each article tests one system using historical data. - For every published article, 9 systems were tested but not published because they didn't show significant results. - The significance level used is alpha = 0.05.

  1. If none of the tested systems have a real edge, how many would you expect to appear significant per year?
  2. What proportion of the published systems are likely to be false positives?
  3. This is related to the "positive predictive value" of a statistical test. Express this mathematically using Bayes' theorem, given a prior probability pi that any given system has a real edge.
  4. If you believe 5% of tested systems might have a real edge, what is the probability that a published significant result represents a truly effective system?
  5. How would requiring alpha = 0.01 change the positive predictive value?
  6. What practices should consumers of sports betting research adopt to protect themselves from publication bias?

Exercise D-5: Evaluating a Tipster Service

A sports betting tipster claims to have a verified 56% win rate over 400 bets at average odds of -110.

  1. Test this claim statistically. Is 56% over 400 bets significant?
  2. The tipster has been active for 2 years. How many total picks would they likely have made? Does the 400-bet sample represent their complete record or a subset?
  3. If the tipster selected their best 400 out of 800 total bets to advertise, how would this affect the interpretation?
  4. Calculate the expected win rate of the "best half" of bets from a bettor with no skill (true p = 0.50) when selecting the best 400 out of 800 bets. (Hint: think about order statistics or simulation.)
  5. The tipster offers a subscription for $200/month. If you bet $100 per game and follow all their picks (approximately 20 per month), what win rate do you need just to cover the subscription cost plus vig?
  6. Design a prospective evaluation plan for this tipster. How many months of tracked picks would you need, and what criteria would you use to assess their skill?

Part E: Research Exercises (5 Exercises)

Exercise E-1: Literature Review on Betting Market Efficiency

Conduct a literature review on hypothesis testing in the context of betting market efficiency.

  1. Find and summarize at least three academic papers that test the efficient market hypothesis in sports betting markets.
  2. What statistical methods are most commonly used? (z-tests, chi-squared, regression, etc.)
  3. What sample sizes are typical in these studies?
  4. How do the papers address the multiple testing problem?
  5. What are the most commonly reported "inefficiencies," and do they survive out-of-sample testing?
  6. Write a 500-word synthesis of the current state of knowledge on this topic.

Exercise E-2: Replication Study Design

Design a replication study for the following published finding: "NFL home underdogs of 7 or more points cover the spread 55% of the time."

  1. State the precise null and alternative hypotheses for your replication.
  2. Conduct a power analysis: how many games do you need to replicate this finding with 80% power at alpha = 0.05?
  3. Describe your data collection plan (data sources, time period, inclusion/exclusion criteria).
  4. Pre-register your analysis plan: specify exactly what tests you will run, what corrections you will apply, and what criteria you will use to evaluate the replication.
  5. Discuss potential threats to validity (changes in the market over time, data quality issues, definitional differences).
  6. What would a "successful" replication look like? What would a "failed" replication tell us?

Exercise E-3: Monte Carlo Study of Test Properties

Design and conduct a Monte Carlo simulation study to investigate the following question: "When evaluating sports bettors, how does the choice of test statistic affect the probability of correctly identifying skilled bettors?"

Compare the following approaches: 1. Z-test for proportions (win rate vs. 50%) 2. T-test for profit per bet (mean profit vs. 0) 3. Exact binomial test 4. Bootstrap test 5. Bayesian test with a skeptical prior

For each approach, estimate: - Type I error rate (using simulated 50% bettors) - Power (using simulated 54% bettors) - Power for a "realistic" edge (52% bettors) - Sensitivity to model misspecification (e.g., varying odds across bets)

Write up your findings in a 500-word report.


Exercise E-4: Historical Analysis of a Betting System

Choose one well-known betting system or angle (examples: bet against the public, fade the Monday Night Football favorite, bet the under in September NFL games).

  1. Clearly state the system's rules and the hypotheses you will test.
  2. Collect or simulate historical data for at least 5 seasons.
  3. Perform a rigorous hypothesis test, including: - Primary test with pre-specified significance level - Multiple testing correction if you examine sub-categories - Robustness checks (different time periods, different sub-samples)
  4. Conduct an out-of-sample test: use the first half of your data to identify the system and the second half to test it.
  5. Report your results following best practices (effect sizes, confidence intervals, not just p-values).
  6. Discuss whether the system would have been profitable after transaction costs (vig).

Exercise E-5: Developing a Personal Hypothesis Testing Protocol

Create a personal statistical protocol that you will use to evaluate any future betting strategy.

Your protocol should include:

  1. Pre-registration template: A form you fill out before testing any strategy, including hypotheses, sample size justification, analysis plan, and stopping rules.
  2. Significance standards: Your chosen significance level and justification. Consider whether a single threshold (like 0.05) is appropriate or whether a tiered approach (suggestive, significant, highly significant) is better.
  3. Multiple testing policy: How you will handle the fact that you may test many strategies over your career.
  4. Minimum sample sizes: A table of minimum bets required before you will even consider a result meaningful, for different assumed edges.
  5. Reporting standards: What information you will record and report for every strategy evaluation.
  6. Decision criteria: Clear rules for when you will begin betting real money on a strategy, when you will increase stakes, and when you will abandon a strategy.
  7. Annual review process: How you will periodically re-evaluate your active strategies.

Write this protocol as a 1-2 page document that you could actually use in practice.


End of Chapter 8 Exercises