Chapter 7 Exercises: Probability Distributions in Betting
Part A: Conceptual Questions (8 Exercises)
Exercise A.1: Choosing the Right Distribution for Goals Scored
A soccer analyst needs to model the number of goals scored by a team in a single match. The team averages 1.4 goals per game over the season.
- Which probability distribution is most appropriate for modeling goals scored in a single match? Justify your choice by referencing the properties of the distribution and the nature of the data.
- What assumptions does this distribution require, and are they reasonably satisfied in the context of soccer goals?
- Under what circumstances might this distribution become a poor fit? Describe at least two scenarios where the assumptions break down.
Exercise A.2: Modeling Season Win Totals
An NFL bettor wants to model the total number of wins a team will achieve across an 17-game season. The team has a per-game win probability estimated at 0.65.
- Which distribution is appropriate for modeling total season wins? Explain why.
- What are the key parameters of this distribution, and how would you estimate them in this context?
- If we instead modeled the number of wins across a much larger number of games (say, 170), and the win probability were lower (say, 0.10), which alternative distribution could serve as a good approximation? Explain the connection between the two distributions.
Exercise A.3: Point Spread Margins
Historical data shows that the margin of victory in NFL games (actual score difference minus the spread) follows an approximately symmetric, bell-shaped distribution centered near zero with a standard deviation of about 13.5 points.
- Which continuous distribution would you use to model spread margins? Why?
- What key properties of this distribution make it suitable for spread analysis?
- Describe a situation in sports betting where the tails of this distribution (extreme outcomes) are more important than the center. How might the chosen distribution underestimate or overestimate tail probabilities?
Exercise A.4: Updating Beliefs About a Team's Strength
A Bayesian analyst wants to model their uncertainty about a team's true win rate. Before the season, they believe the team wins about 55% of games but are not highly confident. After observing the first 20 games (12 wins, 8 losses), they want to update their belief.
- Which distribution is used to represent prior and posterior beliefs about a probability parameter? Why is this distribution uniquely suited for this purpose?
- How does this distribution change shape as we accumulate more data? Describe what happens to the distribution after 20 games, 80 games, and 500 games.
- Explain the concept of conjugate priors in the context of this problem. Why is conjugacy computationally convenient?
Exercise A.5: Distribution Selection for Different Sports Metrics
For each of the following sports metrics, identify the most appropriate probability distribution and briefly justify your choice:
- The number of three-pointers made by an NBA player in a single game (the player attempts approximately 8 per game and makes them at a 38% rate).
- The total number of penalty minutes in an NHL game.
- The difference between a basketball team's actual total points and the over/under line.
- A bettor's belief about the true probability that a tennis player wins a given match.
- The number of aces served by a tennis player in a match where they serve approximately 80 times with a 10% ace rate.
- Whether a football team converts a specific fourth-down attempt (a single yes/no event).
Exercise A.6: Poisson vs. Negative Binomial for Scoring Events
Some analysts argue that the Poisson distribution underestimates the variance of goals scored in soccer because it assumes the mean equals the variance (equidispersion).
- Explain the concept of overdispersion and why it might occur in sports scoring data.
- How does the Negative Binomial distribution address this limitation? What additional parameter does it introduce?
- Describe a practical test you could perform on historical data to determine whether the Poisson or Negative Binomial provides a better fit.
Exercise A.7: The Role of the Normal Distribution in Betting Markets
The normal (Gaussian) distribution appears frequently in sports betting, even when the underlying data is not perfectly normal.
- Explain how the Central Limit Theorem justifies using the normal distribution for point totals in basketball, even though individual scoring events are not normally distributed.
- Why do sportsbooks implicitly rely on normal distribution assumptions when they set point spread lines?
- Describe a scenario where assuming normality for a betting market would lead to significantly incorrect probability estimates. What distribution might be more appropriate?
Exercise A.8: Comparing Discrete and Continuous Models
A bettor is analyzing the total number of runs scored in a baseball game. The historical average is about 8.5 runs per game.
- A colleague suggests using a Poisson distribution. Another suggests a normal distribution. Compare and contrast these two approaches for this specific application.
- What are the practical consequences of using a continuous distribution (normal) to model a discrete outcome (integer runs)? When does the continuity correction matter?
- Under what sample size conditions does the Poisson distribution converge toward the normal distribution? Show how the parameters relate.
Part B: Calculation Exercises (7 Exercises)
Exercise B.1: Poisson Probability Calculations
A soccer team averages 1.3 goals per match. Assuming goals follow a Poisson distribution:
- Calculate the probability that the team scores exactly 0, 1, 2, 3, and 4 goals in a single match.
- Calculate the probability that the team scores 3 or more goals.
- If the opposing team averages 0.9 goals per match (and goal scoring is independent), calculate the probability of a 0-0 draw, a 1-0 home win, and a 2-1 away win.
- Calculate the probability that the total goals in the match exceeds 2.5 (the Over 2.5 goals line).
- If the Poisson rate for the home team increases by 15% due to home advantage, recalculate the probability of Over 2.5 goals.
Exercise B.2: Binomial Season Simulation
An NBA team has an estimated per-game win probability of 0.585 across an 82-game season. Using the binomial distribution:
- Calculate the expected number of wins and the standard deviation of wins.
- Calculate the probability that the team wins exactly 48 games.
- Calculate the probability that the team wins 50 or more games (the Over on a season win total of 49.5).
- Calculate the probability that the team wins between 44 and 52 games (inclusive).
- The sportsbook sets the Over/Under at 47.5 wins with -110 on both sides. Determine the implied probability of the Over and compare it to your calculated probability. Is there value on either side?
Exercise B.3: Normal Distribution for Spread Analysis
NFL point spread margins (actual margin minus the line) follow a normal distribution with mean 0 and standard deviation 13.86.
- A team is favored by 7 points. Calculate the probability that they cover the spread (win by more than 7).
- Calculate the probability that the game lands exactly on the spread (a "push") using a reasonable interval (e.g., the actual margin is between 6.5 and 7.5).
- A bettor takes an alternate spread of -3. What is the probability of covering?
- Calculate the probability that the favorite wins the game outright (margin > 0).
- If a teaser moves the spread from -7 to -1, calculate the increase in cover probability.
Exercise B.4: Beta Distribution Parameters
A Bayesian bettor uses a Beta distribution to model their belief about an NFL team's win probability.
- Starting with a uniform prior Beta(1, 1), update the posterior after observing 7 wins in 10 games. Write the posterior distribution parameters.
- Starting with an informative prior Beta(8, 6) (representing a prior belief of roughly 57% win rate), update after the same 7 wins in 10 games. Write the posterior distribution parameters.
- For each posterior in parts (1) and (2), calculate the posterior mean and the 95% credible interval. (Use the Beta quantile function or the normal approximation.)
- Calculate the probability that the team's true win rate exceeds 0.60 under each posterior.
- Explain why the two posteriors differ and describe what would happen as you observe more and more games.
Exercise B.5: Combined Distribution Calculations
A soccer match has the following estimated Poisson rates: Home team = 1.6 goals, Away team = 1.1 goals.
- Construct the full scoreline probability matrix for scores from 0-0 to 5-5.
- Calculate the probability of a home win, draw, and away win (the 1X2 market).
- Calculate the probabilities for Over/Under 2.5, Over/Under 1.5, and Over/Under 3.5 goals.
- Calculate the probability of Both Teams to Score (BTTS) — Yes and No.
- If the bookmaker offers the following decimal odds — Home 2.10, Draw 3.40, Away 3.80 — calculate the implied probabilities (adjusting for overround) and identify any value bets relative to your model.
Exercise B.6: The Poisson Approximation to the Binomial
A basketball player shoots 25 free throws in a game at a success rate of 92%.
- Calculate the exact binomial probability of making all 25 free throws.
- Instead of modeling successes, model the number of misses. What are the Poisson approximation parameters for the number of misses?
- Using the Poisson approximation, calculate the probability of 0 misses (i.e., making all 25).
- Compare the exact binomial and Poisson approximation answers. How close are they?
- At what success rate does the Poisson approximation begin to deviate significantly from the binomial (say, by more than 1 percentage point for the "make all" probability)?
Exercise B.7: Normal Approximation to the Binomial
An MLB team has a 0.540 win probability and plays 162 games.
- Using the exact binomial, calculate P(Wins >= 90).
- Using the normal approximation (with continuity correction), calculate P(Wins >= 90).
- Compare the two answers. How close is the approximation?
- The sportsbook sets the season Over/Under at 87.5 wins. Using the normal approximation, calculate the probability of going Over.
- If the team's true win probability is actually 0.555 (the market underestimates them), calculate the expected value of a $100 bet on Over 87.5 at -110 odds.
Part C: Programming Exercises (5 Exercises)
Exercise C.1: Poisson Goal Model for Soccer
Build a complete Poisson-based soccer match prediction model in Python.
Requirements:
- Write a function
estimate_team_rates(matches)that takes a list of match results (home_team, away_team, home_goals, away_goals) and estimates attack and defense strength parameters for each team using maximum likelihood. Include a home advantage factor. - Write a function
predict_match(home_attack, home_defense, away_attack, away_defense, home_advantage)that returns a dictionary containing: - The scoreline probability matrix (up to 8 goals each) - 1X2 probabilities (home win, draw, away win) - Over/Under probabilities for 0.5, 1.5, 2.5, 3.5, and 4.5 goals - Both Teams to Score probability - Correct Score probabilities for the 20 most likely scorelines - Write a function
find_value_bets(model_probs, market_odds, min_edge=0.05)that compares model probabilities to market odds and returns bets where the model edge exceeds the minimum threshold. - Test your model on at least one full season of data (you may simulate or use a publicly available dataset).
- Include visualization: plot the scoreline probability matrix as a heatmap.
Exercise C.2: Binomial Season Simulator
Build a Monte Carlo simulator for season win totals using the binomial distribution.
Requirements:
- Write a function
simulate_season(win_prob, num_games, num_simulations=100000)that simulates season outcomes and returns the distribution of win totals. - Create a visualization showing the simulated distribution overlaid with the theoretical binomial PMF.
- Write a function
over_under_analysis(win_prob, num_games, line, vig=-110)that calculates: - The probability of going Over and Under - The expected value of each bet at the given vig - The Kelly criterion bet size for each side - Extend the simulator to handle variable win probabilities across a season (e.g., a team that starts at 0.55 win probability but improves to 0.65 by season's end). Compare results to the fixed-probability model.
- Generate a report for a specific team showing the probability distribution of possible season outcomes, including probabilities of reaching specific win totals (e.g., 50+ wins in the NBA, 10+ wins in the NFL).
Exercise C.3: Normal Spread Model
Build a complete point spread analysis tool using the normal distribution.
Requirements:
- Write a function
spread_cover_probability(spread, mean_margin=0, std_margin=13.86)that returns the probability of covering the spread. - Write a function
teaser_analysis(original_spread, teaser_points, std_margin=13.86)that calculates how much a teaser improves cover probability and whether the improved probability justifies the reduced payout. - Write a function
alternate_spread_fair_odds(spread_range, mean_margin, std_margin)that generates fair odds for a range of alternate spreads. - Implement a function
backtest_spread_model(historical_data, std_margin)that backtests the normal spread model against historical results and reports calibration metrics (e.g., Brier score, calibration plot). - Create visualizations including: - The probability density of spread margins with the spread line marked - A calibration plot comparing predicted probabilities to observed frequencies - A profit/loss chart for a flat-betting strategy on model-identified value bets
Exercise C.4: Distribution Fitter
Build a general-purpose tool that fits multiple probability distributions to sports data and selects the best fit.
Requirements:
- Write a function
fit_distributions(data, distributions=None)that fits a list of candidate distributions to the data using maximum likelihood estimation. Default distributions should include: Normal, Poisson, Negative Binomial, Log-Normal, Gamma, and Beta (where applicable). - For each fitted distribution, compute: - Parameter estimates - Log-likelihood - AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) - Chi-squared goodness-of-fit test statistic and p-value - Kolmogorov-Smirnov test statistic and p-value (for continuous distributions)
- Write a function
compare_distributions(fit_results)that ranks distributions by AIC/BIC and displays a comparison table. - Create a visualization function that overlays fitted PDFs/PMFs on the empirical histogram of the data.
- Test your tool on at least three different sports datasets: - Soccer goals per match (discrete) - NFL point spread margins (continuous) - NBA player points per game (continuous)
Exercise C.5: Bayesian Beta Updater
Build an interactive Bayesian updating tool using the Beta-Binomial model.
Requirements:
- Write a class
BetaUpdaterwith: - An__init__method that takes prior alpha and beta parameters - Anupdate(successes, failures)method that updates the posterior - Aposterior_mean()method - Aposterior_mode()method - Acredible_interval(confidence=0.95)method - Aprobability_above(threshold)method - Aplot_posterior()method that shows the current posterior distribution - Write a function
animate_bayesian_updates(results, prior_alpha, prior_beta)that creates a visualization showing the posterior distribution evolving as new data arrives (e.g., game by game through a season). - Implement a function
compare_priors(results, priors_list)that shows how different prior choices lead to different posteriors and at what sample size the priors "wash out." - Apply the tool to a real scenario: Track an NFL team's win probability through a season, updating after each game. Show how uncertainty decreases over the course of the season.
- Implement a simple betting strategy that bets when the posterior probability of a team winning exceeds the implied probability from market odds by a threshold. Backtest this strategy on simulated seasons.
Part D: Analysis Exercises (5 Exercises)
Exercise D.1: Evaluating the Poisson Assumption for Soccer Goals
Obtain or simulate a dataset of at least 380 soccer matches (one full season of a major league).
- Calculate the observed frequency of 0, 1, 2, 3, 4, 5, and 6+ goals per team per match.
- Estimate the Poisson parameter (lambda) from the data and calculate the expected frequencies under the Poisson model.
- Perform a chi-squared goodness-of-fit test. Does the Poisson model provide a statistically adequate fit?
- Calculate the observed variance-to-mean ratio. Is the data equidispersed, overdispersed, or underdispersed?
- If the Poisson model is rejected, fit a Negative Binomial distribution and compare its fit to the Poisson. Discuss the practical implications for a betting model.
Exercise D.2: Are Closing Spreads Normally Distributed Margins?
Using historical NFL game data (or a simulated dataset based on known parameters):
- Collect the difference between actual game margins and closing point spreads for at least 256 games (one full season).
- Plot a histogram of spread margins and overlay a fitted normal distribution.
- Perform the Shapiro-Wilk test and the Anderson-Darling test for normality. Report the test statistics and p-values.
- Calculate the kurtosis and skewness of the spread margin distribution. Is the distribution leptokurtic (heavy-tailed) or platykurtic (light-tailed)?
- Discuss the implications of any deviation from normality for spread betting. If the distribution is heavy-tailed, how would this affect strategies that rely on tail probabilities (e.g., teaser bets, alternate spreads)?
Exercise D.3: Binomial Model for Parlay Success Rates
A bettor consistently picks winners at a 55% rate against the spread.
- Model the number of successful picks in a 5-leg parlay using the binomial distribution. What is the probability that all 5 legs win?
- Model the number of successful 5-leg parlays across 100 attempts. What is the expected number of winning parlays?
- Calculate the expected value of each parlay if the standard payout for a 5-leg parlay is 25:1. Is this a positive or negative expected value strategy?
- Compare the expected value and variance of: (a) 100 flat bets at -110, (b) 20 five-leg parlays, and (c) 10 ten-leg parlays, all with a 55% individual pick rate.
- At what individual pick success rate does a 5-leg parlay become a positive expected value bet at 25:1 odds? What about a 10-leg parlay at 700:1?
Exercise D.4: Distribution of Player Props
Analyze the statistical distribution of a specific player prop market.
- Choose a player prop (e.g., NBA player points, NFL quarterback passing yards, MLB strikeouts). Collect or simulate at least 50 game observations.
- Fit at least three candidate distributions to the data (e.g., Normal, Poisson, Negative Binomial, Log-Normal, Gamma).
- Use AIC/BIC to rank the candidate distributions. Which provides the best fit?
- Using your best-fit distribution, calculate the probability of going Over and Under for a specific prop line.
- Compare your model's probabilities to the market odds. Discuss whether any discrepancies represent genuine value or model error.
Exercise D.5: The Impact of Prior Selection on Bayesian Win Probability
Investigate how different prior distributions affect posterior estimates of a team's win probability, especially early in the season.
- Define three Beta priors: (a) Uninformative: Beta(1, 1), (b) Weakly informative: Beta(5, 5), (c) Strongly informative: Beta(40, 30) — encoding a prior belief of approximately 57%.
- Simulate a team that has a true win probability of 0.60 over a 16-game NFL season. After each game, update all three posteriors.
- Plot all three posterior means over the course of the season. At what point do they converge?
- Calculate the 95% credible interval width for each prior at game 1, game 4, game 8, and game 16.
- Discuss the tradeoffs between informative and uninformative priors in the context of sports betting. When would a bettor prefer one over the other?
Part E: Research Exercises (5 Exercises)
Exercise E.1: The Dixon-Coles Model
Research the Dixon-Coles model, a refinement of the basic Poisson model for soccer.
- Read the original paper: Dixon, M. & Coles, S. (1997). "Modelling Association Football Scores and Inefficiencies in the Football Betting Market." Applied Statistics, 46(2), 265-280.
- Explain the key modification Dixon and Coles made to the independent Poisson model. What specific weakness of the basic model does it address?
- Describe the correlation adjustment factor (tau) and explain how it modifies probabilities for low-scoring outcomes (0-0, 1-0, 0-1, 1-1).
- Implement the Dixon-Coles model in Python and compare its predictions to the basic Poisson model on a dataset of at least 200 matches.
- Evaluate whether the Dixon-Coles modification meaningfully improves betting performance (measured by log-loss, Brier score, or simulated profit) compared to the basic Poisson model.
Exercise E.2: Extreme Value Theory in Sports Betting
Research Extreme Value Theory (EVT) and its potential applications in sports betting.
- Read an introductory text on EVT and summarize the three types of extreme value distributions (Gumbel, Frechet, Weibull).
- Explain the Generalized Extreme Value (GEV) distribution and the Generalized Pareto Distribution (GPD). When is each appropriate?
- Describe at least three sports betting scenarios where extreme value theory would be more appropriate than standard distributions (Normal, Poisson).
- Using either historical data or simulated data, fit a GPD to the tail of NFL spread margins (e.g., games where the margin exceeds 20 points). Compare tail probability estimates from the GPD to those from a normal distribution.
- Discuss how a bettor could use EVT to improve their assessment of rare but high-impact outcomes, such as large underdogs covering or extreme totals.
Exercise E.3: Mixture Models for Sports Outcomes
Research Gaussian mixture models and their applications in sports analytics.
- Explain the concept of a mixture model. Why might a single distribution be insufficient to model certain sports outcomes?
- Describe at least two sports scenarios where a mixture model would provide a meaningfully better fit than a single distribution. (Example: The distribution of NFL game margins may be a mixture of "competitive games" and "blowouts.")
- Implement a two-component Gaussian mixture model for NFL point margins. Estimate the parameters using the Expectation-Maximization (EM) algorithm or a library such as scikit-learn.
- Compare the fit of the mixture model to a single Gaussian using AIC/BIC and visual inspection.
- Discuss how mixture models could improve betting market analysis. Could identifying the "component" a game belongs to (competitive vs. blowout) improve spread betting strategies?
Exercise E.4: The Weibull Distribution for Time-Based Events
Research the Weibull distribution and its application to time-between-events in sports.
- Describe the Weibull distribution, its parameters, and its relationship to the exponential distribution.
- Identify at least three sports scenarios where the Weibull distribution could model time-based data (e.g., time between goals in soccer, time to first score in a basketball game).
- Fit a Weibull distribution to time-between-goals data from at least 50 soccer matches. Compare the fit to an exponential distribution.
- If the Weibull shape parameter is significantly different from 1, what does this imply about the scoring process? (Is it memoryless? Does the rate increase or decrease over time?)
- Discuss how a non-exponential time-between-events model could create betting opportunities in live/in-play betting markets.
Exercise E.5: Copulas for Modeling Dependent Outcomes
Research copula functions and their potential for modeling dependence between sports outcomes.
- Explain what a copula is and why it is useful for modeling multivariate distributions where the marginal distributions are known but the dependence structure is not.
- Describe the Gaussian copula, the Clayton copula, and the Frank copula. What types of dependence does each capture?
- Identify at least two sports betting scenarios where dependence between outcomes matters and a copula approach could improve modeling. (Example: The correlation between goals scored by two teams in a match, or the dependence between a quarterback's passing yards and the team's total score.)
- Implement a simple bivariate model using a Gaussian copula: Model the joint distribution of home and away goals in soccer, where each margin follows a Poisson distribution but they are correlated. Compare to the standard independent Poisson model.
- Discuss the practical challenges and benefits of using copulas in sports betting models.
End of Chapter 7 Exercises