Chapter 42 Exercises: Research Frontiers
Instructions: Complete all exercises in the parts assigned by your instructor. These exercises explore frontier topics and require creative problem formulation, research design, and independent thinking. For programming challenges, include comments explaining your logic. Several exercises are intentionally open-ended to encourage original approaches.
Part A: Conceptual Understanding
Each problem is worth 5 points. Answer in complete sentences unless otherwise directed.
Exercise A.1 --- Open Problems Classification
Section 42.1 identifies six categories of open problems in sports betting research. For each category, classify it as primarily (a) a modeling problem, (b) a market structure problem, or (c) an operational problem. Justify each classification in one to two sentences and identify one category that could be argued to belong to a different class.
Exercise A.2 --- Causal vs. Predictive Reasoning
Explain the difference between a predictive question and a causal question using a concrete sports example. Then explain why a model that performs well for prediction might give incorrect answers to the corresponding causal question.
Exercise A.3 --- DAG Construction
Draw a Directed Acyclic Graph (DAG) for soccer match outcomes that includes at least eight variables. Identify: (a) one confounder, (b) one mediator, (c) one collider, and (d) one potential instrumental variable. For each, explain its role in the causal structure.
Exercise A.4 --- Reinforcement Learning Framing
Formally define the sports betting MDP for a single-sport bettor. Specify: (a) the state space, (b) the action space, (c) the transition dynamics, (d) the reward function, and (e) the discount factor. Discuss why the choice of discount factor matters in the betting context.
Exercise A.5 --- Exploration vs. Exploitation
A bettor has five market types to choose from, with limited capital for each day. Explain the exploration-exploitation tradeoff in this context. Why is pure exploitation (always choosing the historically best market) suboptimal? Why is pure exploration (sampling all markets equally) also suboptimal? What information would you need to determine the right balance?
Exercise A.6 --- Market Microstructure Concepts
Define and distinguish between the following pairs of concepts from market microstructure: (a) adverse selection vs. inventory risk, (b) price discovery vs. price efficiency, (c) informed trading vs. noise trading. For each pair, give an example from sports betting markets.
Exercise A.7 --- Regression Discontinuity Requirements
Explain the key assumptions required for a valid regression discontinuity design. Then evaluate whether the NFL playoff cutoff (8th seed vs. 9th seed in conference standings) satisfies these assumptions. What threats to validity exist?
Exercise A.8 --- Emerging Methodologies
Section 42.5 discusses five emerging methodologies: foundation models, graph neural networks, conformal prediction, causal machine learning, and synthetic data. Rank these from "most likely to impact practical bettors within 3 years" to "least likely" and justify your ranking.
Part B: Calculations and Short Problems
Each problem is worth 5 points. Show all work.
Exercise B.1 --- Instrumental Variable Estimation
A researcher wants to estimate the causal effect of home attendance on team performance (win margin) in the NBA. They propose using gameday weather as an instrument (rain reduces attendance).
First-stage regression: Attendance = 15000 + 2500 x SunnyDay, F-stat = 28.5
Second-stage regression: WinMargin = -2.1 + 0.0004 x PredictedAttendance
OLS regression (naive): WinMargin = 1.8 + 0.0002 x Attendance
(a) What is the IV estimate of the causal effect of attendance on win margin?
(b) Is the instrument strong? Justify.
(c) What is the estimated bias from confounding in the OLS estimate?
(d) State one potential violation of the exclusion restriction for this instrument.
Exercise B.2 --- Thompson Sampling Calculation
A Thompson Sampling bandit has three arms with the following Beta distribution parameters after 100 rounds:
| Arm | Alpha | Beta | Pulls | Wins |
|---|---|---|---|---|
| NFL Sides | 28 | 19 | 46 | 27 |
| NBA Totals | 22 | 14 | 35 | 21 |
| MLB ML | 9 | 11 | 19 | 8 |
(a) Calculate the posterior mean win probability for each arm.
(b) Which arm has the widest 95% credible interval? (Hint: for Beta(a, b), the variance is ab/((a+b)^2(a+b+1)).)
(c) If the algorithm selects the arm with the highest Thompson sample, which arm is most likely to be selected in the next round? Justify without simulation.
Exercise B.3 --- Kyle's Lambda
In a betting market, the standard deviation of the informed bettor's signal is estimated at sigma_v = 0.08, and the standard deviation of uninformed betting volume is sigma_u = 0.25.
(a) Calculate Kyle's lambda.
(b) If total order flow for a game is +$50,000 (net buying on the home team), what is the expected price impact?
(c) How would lambda change if the market attracted twice as many uninformed bettors? What does this imply about market quality?
Exercise B.4 --- Regression Discontinuity Analysis
A researcher studies the effect of making the NBA playoffs on next-season attendance. Teams within 3 games of the playoff cutoff show:
| Group | N | Avg Next-Season Attendance | Avg Win Diff from Cutoff |
|---|---|---|---|
| Barely Made Playoffs | 24 | 17,200 | +1.5 games |
| Barely Missed Playoffs | 22 | 16,100 | -1.8 games |
(a) What is the estimated causal effect of playoff qualification on next-season attendance?
(b) The bandwidth is 3 games. What happens to the estimate's precision if the bandwidth is narrowed to 1 game?
(c) What covariate balance check should the researcher perform? Describe one specific test.
Exercise B.5 --- PIN Model Basics
The Probability of Informed Trading (PIN) model decomposes trading activity into informed and uninformed components. In a simplified betting market:
- Probability of an information event: delta = 0.3
- Arrival rate of informed bettors: mu = 50 bets/hour
- Arrival rate of uninformed bettors on each side: epsilon = 100 bets/hour
(a) Calculate the PIN (probability that a random bet is from an informed bettor).
(b) If the sportsbook observes 180 bets in an hour on the home side and 110 bets on the away side, is this pattern more consistent with an information event favoring the home team? Show your reasoning.
(c) How would the sportsbook use this inference to adjust the line?
Exercise B.6 --- Conformal Prediction Interval
A conformal prediction model produces the following prediction intervals for five upcoming games' home win probabilities:
| Game | Point Estimate | 90% Interval |
|---|---|---|
| A | 0.58 | [0.51, 0.65] |
| B | 0.62 | [0.53, 0.71] |
| C | 0.55 | [0.49, 0.61] |
| D | 0.70 | [0.58, 0.82] |
| E | 0.51 | [0.44, 0.58] |
(a) For which games does the 90% interval include the typical market no-vig probability of 0.50? What does this imply about betting those games?
(b) Which game has the most uncertain prediction? How should bet sizing account for this uncertainty?
(c) Game D has the highest point estimate but the widest interval. A bettor's model says bet Home on Game D. What caution does the interval suggest?
Part C: Programming Challenges
Each problem is worth 10 points. Include working Python code with comments and sample output.
Exercise C.1 --- Instrumental Variable Estimator
Implement the two-stage least squares (2SLS) estimator from Section 42.2 and test it on synthetic data:
(a) Generate data where a treatment variable (pace) has a causal effect on outcome (wins), but both are confounded by an unobserved variable (talent).
(b) Generate an instrument (altitude) that affects pace but not wins directly.
(c) Run OLS and 2SLS, showing that OLS is biased and 2SLS recovers the true causal effect.
(d) Vary the instrument strength (relevance) and show how weak instruments affect the 2SLS estimate.
Exercise C.2 --- Thompson Sampling Market Selector
Implement the ThompsonSamplingBandit class and run a full simulation:
(a) Create 6 arms representing different sport/market combinations with different true win rates.
(b) Run for 1,000 rounds, recording the arm selected and reward at each round.
(c) Plot cumulative regret over time (regret = optimal reward - actual reward).
(d) Plot the allocation of bets across arms over time, showing convergence to the best arm.
(e) Compare Thompson Sampling to an epsilon-greedy strategy (epsilon = 0.1) over the same 1,000 rounds.
Exercise C.3 --- Betting Environment for RL
Extend the BettingEnvironment class from Section 42.3:
(a) Add multiple games per day with varying edges and odds.
(b) Add a simplified account limitation mechanism: if the agent's cumulative win rate exceeds 55% at a sportsbook, the maximum stake is halved.
(c) Add the option to bet at two different sportsbooks with slightly different odds.
(d) Implement a simple policy (e.g., bet when edge > 3%, stake proportional to edge) and run for 100 simulated seasons. Report the distribution of final bankrolls.
Exercise C.4 --- Causal Discovery with DAGs
Using synthetic sports data:
(a) Generate data from a known causal structure (DAG) with at least 6 variables relevant to game outcomes.
(b) Implement a simple causal discovery algorithm (e.g., PC algorithm using conditional independence tests) to recover the DAG structure from data alone.
(c) Compare the discovered DAG to the true DAG. Which edges were correctly identified? Which were missed or reversed?
(d) Demonstrate how conditioning on a collider creates a spurious correlation, and how conditioning on a confounder removes a real one.
Exercise C.5 --- Market Microstructure Simulation
Build a simplified betting market simulation:
(a) Implement a market maker that sets opening odds and adjusts them based on incoming bet flow.
(b) Implement three types of bettors: an informed bettor (knows the true probability within noise), noise bettors (random), and a momentum bettor (follows recent line moves).
(c) Run the simulation for 1,000 time steps and track how the price converges to the true value.
(d) Calculate Kyle's lambda from the simulated data and compare it to the theoretical value.
(e) Vary the fraction of informed bettors and show how it affects price discovery speed and the market maker's P&L.
Part D: Analysis and Interpretation
Each problem is worth 10 points. Write clear, structured analyses.
Exercise D.1 --- Research Proposal
Choose one open problem from Section 42.1 and write a two-page research proposal. Include:
(a) Problem statement: What is the specific question, and why does it matter?
(b) Data and methods: What data would you need, and what analytical approach would you use?
(c) Expected contribution: What would a successful answer look like, and who would benefit?
(d) Limitations and challenges: What could go wrong, and how would you address it?
Exercise D.2 --- Natural Experiment Analysis
The 2020 NBA Bubble (played without fans in Orlando) provides a natural experiment on the causal effect of home-court advantage. Design a study to estimate this effect:
(a) Define the treatment, control, outcome, and key covariates.
(b) Describe the identification strategy: why is the bubble quasi-random assignment?
(c) Identify at least two threats to internal validity and explain how you would address them.
(d) Discuss the external validity question: does the bubble estimate tell us about normal home-court advantage?
Exercise D.3 --- Future of Sports Betting Essay
Write a structured argument (800-1,000 words) addressing: "What will be the three most important technical skills for quantitative sports bettors in 2035?" Ground your argument in specific trends discussed in Chapter 42, and for each skill, explain (a) why it will be important, (b) what current developments point in this direction, and (c) what the skill enables that current approaches cannot achieve.
Part E: Integration and Synthesis
Each problem is worth 10 points. These problems require creative synthesis across chapters.
Exercise E.1 --- Causal Model for Betting Edge
Build a causal model (DAG + analysis) for understanding why a specific betting strategy generates edge:
(a) Choose a strategy (e.g., "bet unders in NBA back-to-back games").
(b) Draw a DAG showing the causal pathway from the structural feature (back-to-back game) to the outcome (total points) to the betting result (under wins).
(c) Identify the specific causal mechanism that the market may be underpricing.
(d) Design a test to distinguish whether the strategy's historical profitability reflects a genuine causal relationship or a spurious correlation.
Exercise E.2 --- RL Agent for Multi-Sport Allocation
Design and implement an RL agent that learns to allocate capital across sports:
(a) Create an environment where 3-5 sports have different (and changing) edge profiles over a simulated season.
(b) The agent's action is the allocation of daily betting capital across sports.
(c) Train the agent using a policy gradient method.
(d) Compare the trained agent's performance to a fixed-allocation baseline and to an oracle that knows the true edges.
(e) Analyze what the agent has learned: does it shift allocation toward currently profitable sports?
Exercise E.3 --- Cross-Domain Transfer Learning
Investigate whether a model trained on one sport transfers to another:
(a) Train a game outcome prediction model on synthetic NBA data using generic features (home advantage, team strength differential, rest differential).
(b) Apply the trained model (without retraining) to synthetic NFL data with analogous features.
(c) Measure the performance drop.
(d) Fine-tune the NBA model on a small amount of NFL data and compare performance to a model trained from scratch on NFL data alone.
(e) Discuss what this experiment reveals about the potential and limitations of transfer learning in sports betting.
Exercise E.4 --- Market Efficiency Measurement Framework
Design a comprehensive framework for measuring market efficiency across different sports and market types:
(a) Define at least three distinct tests of market efficiency (e.g., profitability of simple strategies, closing line efficiency, prediction market accuracy).
(b) Implement each test using simulated market data.
(c) Apply the framework to compare the efficiency of "sides" markets vs. "totals" markets vs. "player props" markets.
(d) Discuss what your results imply about where the best opportunities lie for quantitative bettors.
Exercise E.5 --- Responsible Research Ethics
Write a thoughtful analysis (600-800 words) on the ethical considerations of sports betting research. Address:
(a) The tension between advancing analytical methods and the potential for those methods to harm vulnerable individuals through problem gambling.
(b) Whether researchers have obligations regarding the publication of profitable strategies (if publishing a strategy makes it public, does the edge disappear, and is that a form of responsible disclosure?).
(c) The role of regulation and self-regulation in ensuring that quantitative advantages are used responsibly.
(d) Specific safeguards that researchers and practitioners should adopt.