Chapter 17 Exercises: Modeling MLB
Part A: Conceptual Questions (Exercises 1--8)
Exercise 1. Explain the fundamental difference between wOBA and batting average as measures of offensive production. Describe three specific in-game situations where a hitter with a .260 batting average could have a substantially higher wOBA than a hitter batting .290. Why does this distinction matter for projecting run scoring in a betting context?
Exercise 2. FIP and ERA both purport to measure pitching quality. A pitcher has posted a 4.60 ERA and a 3.30 FIP over the first half of the season. Explain the likely causes of this discrepancy, identify which metric is more predictive of second-half performance, and describe how a bettor could exploit the market's likely reaction to this pitcher's surface-level results.
Exercise 3. The Pythagorean expectation formula uses an exponent of 1.83 for MLB:
$$\text{Win\%} = \frac{\text{RS}^{1.83}}{\text{RS}^{1.83} + \text{RA}^{1.83}}$$
Explain the intuition behind this formula. Why is the exponent 1.83 rather than 2? Under what circumstances would a team's actual win percentage diverge significantly from its Pythagorean expectation, and how would you use this information for betting?
Exercise 4. Describe the concept of metric stabilization in baseball. Why does strikeout rate stabilize after approximately 60 plate appearances while BABIP requires 800 or more? What are the practical implications for an in-season betting model that must decide how much weight to assign current-year data versus preseason projections?
Exercise 5. A team's lineup features six left-handed batters. Tonight they face a left-handed starting pitcher. Using the approximate platoon split data from the chapter (LHB vs. LHP: wOBA ~ .305; LHB vs. RHP: wOBA ~ .330), estimate the aggregate wOBA reduction for the lineup compared to facing a right-handed pitcher. Translate this reduction into approximate runs per game and explain how the moneyline should shift.
Exercise 6. Discuss why the negative binomial distribution is preferred over the Poisson distribution for modeling MLB run scoring. Define overdispersion in the context of run distributions and explain the mechanism in baseball (multi-run innings) that produces it. Under what circumstances might the simpler Poisson model be adequate?
Exercise 7. Explain the concept of "closing line value" (CLV) in MLB betting. Why is CLV considered a more reliable indicator of long-term betting skill than raw profit-and-loss? Describe a scenario where a bettor shows positive CLV but negative short-term profits, and explain why this bettor should continue their approach.
Exercise 8. Coors Field has a park factor of approximately 1.35 for runs. A naive modeler simply multiplies all Rockies hitters' statistics by 1/1.35 to "de-Coors" their numbers. Explain why this approach is flawed. Describe at least three separate adjustments that a more sophisticated model should apply when evaluating Rockies players for road games.
Part B: Calculation Problems (Exercises 9--15)
Exercise 9. Compute the wOBA for a hitter with the following season totals: 80 singles, 25 doubles, 5 triples, 20 home runs, 45 walks, 3 HBP, 0 IBB, 400 AB, 6 SF. Use the standard weights: $w_{BB} = 0.690$, $w_{HBP} = 0.720$, $w_{1B} = 0.880$, $w_{2B} = 1.245$, $w_{3B} = 1.575$, $w_{HR} = 2.015$. Interpret the result relative to the league-average wOBA of .315.
Exercise 10. A pitcher has the following season statistics over 150 innings: 180 strikeouts, 45 walks, 8 HBP, 18 home runs. The FIP constant for this season is $C_{\text{FIP}} = 3.15$. Calculate the pitcher's FIP and compare it to a league-average ERA of 4.20. Then compute xFIP assuming a fly ball count of 160 and a league HR/FB rate of 11.5%.
Exercise 11. Using the Pitcher Quality Score formula from the chapter:
$$\text{PQS} = 100 + 15 \times (0.30 \cdot z_{\text{FIP}} + 0.25 \cdot z_{\text{K\%}} + 0.20 \cdot z_{\text{BB\%}} + 0.10 \cdot z_{\text{GB\%}} + 0.15 \cdot z_{\text{Stuff+}})$$
compute the PQS for a pitcher with FIP = 3.50, K% = 26.0, BB% = 7.0, GB% = 46.0, Stuff+ = 108. Use the league-average reference points: FIP = 4.20 (SD 0.70), K% = 22.0 (SD 5.0), BB% = 8.0 (SD 2.0), GB% = 43.0 (SD 5.0), Stuff+ = 100.0 (SD 15.0).
Exercise 12. A game at Wrigley Field has a base park factor of 1.05. Game-time conditions: temperature 88 degrees F, wind blowing out at 14 mph, humidity 40%. Using the environmental adjustment coefficients from the chapter ($\text{TEMP\_COEFF} = 0.002$, $\text{WIND\_OUT\_COEFF} = 0.008$, $\text{HUMIDITY\_COEFF} = 0.0003$, baselines: 72 F, 5 mph, 50%), compute the total adjusted park factor. If the neutral-site run projection for each team is 4.3, what is the weather-adjusted projected total?
Exercise 13. Using a Poisson model with $\lambda_{\text{home}} = 4.8$ and $\lambda_{\text{away}} = 3.5$:
(a) Calculate $P(\text{home scores exactly 5 runs})$.
(b) Calculate $P(\text{away scores 0 runs})$, i.e., a shutout.
(c) Estimate $P(\text{home wins})$ by summing over the joint distribution (you may truncate at 12 runs).
(d) Calculate the probability that the total exceeds 8.5.
Exercise 14. A game has projected run totals of $\lambda_A = 5.0$ (favorite) and $\lambda_B = 3.5$ (underdog) using a negative binomial model with $r = 6$. The run line is set at $-1.5$ for the favorite at +135 odds (implied probability 42.6%). Using the negative binomial PMF to compute $P(\text{favorite wins by 2+})$, determine whether the run line offers positive expected value. Show your work.
Exercise 15. A bettor's model projects a game at moneyline probability 58% for the home team. The market offers the home team at $-145$ (implied 59.2% with no vig, or approximately 57.3% true probability after removing 4% total vig). Calculate: (a) the expected value of a \$100 bet on the home team, (b) the Kelly criterion optimal bet fraction assuming a \$10,000 bankroll, and (c) the half-Kelly bet amount.
Part C: Programming Exercises (Exercises 16--20)
Exercise 16. Using the pybaseball library (or synthetic data if API access is unavailable), write a Python function that pulls individual pitcher statistics for a given season and computes the Pitcher Quality Score for every qualified pitcher (minimum 100 IP). Return a sorted DataFrame ranking pitchers from best to worst PQS. Include proper type hints and docstrings.
Exercise 17. Implement a complete park factor calculator class in Python. The class should: - Accept multi-year team-level home/road scoring data - Compute raw park factors for runs and home runs - Apply regression toward 1.0 based on the number of years of data (use a regression factor of $\frac{n}{n+k}$ with $k = 3$ years) - Return both raw and regressed park factors for each venue
Include a worked example using at least 5 parks with realistic data.
Exercise 18. Write a Python function that takes two team expected run totals ($\lambda_A$ and $\lambda_B$) and produces a complete betting analysis using both Poisson and negative binomial models. The function should return: - Moneyline probabilities and fair American odds - Run line ($\pm 1.5$) cover probabilities - Over/under probabilities for a given total - First-five-innings projections
Compare the outputs of the two models for at least three different matchups spanning blowout, close, and high-scoring game scenarios.
Exercise 19. Build a platoon matchup simulator. Given a pitcher's handedness and quality score, and a list of nine batters with their handedness and wOBA, the simulator should: - Apply platoon adjustment factors to each matchup - Estimate the lineup's aggregate wOBA against this pitcher - Convert to expected runs per game - Compare the result against a lineup with opposite-handed batters
Run the simulation for at least two pitcher profiles (an ace lefty and a back-end righty) against the same lineup.
Exercise 20. Create an MLB market analysis dashboard script that generates a synthetic dataset of 2,000 games and computes: - Reverse line movement detection and profitability - Umpire impact on totals (top 5 and bottom 5 umpires) - Underdog profitability by odds bucket - Monthly over/under patterns
Output the results in a well-formatted report. Use matplotlib or a similar library to produce at least two visualizations.
Part D: Analysis Exercises (Exercises 21--25)
Exercise 21. Consider two pitchers with the following profiles:
| Metric | Pitcher X | Pitcher Y |
|---|---|---|
| ERA | 3.20 | 4.80 |
| FIP | 4.00 | 3.50 |
| xFIP | 4.10 | 3.60 |
| K% | 19.0% | 27.5% |
| BB% | 5.5% | 8.0% |
| BABIP | .240 | .340 |
| GB% | 52.0% | 38.0% |
(a) Which pitcher is likely performing above or below true talent, and why?
(b) Calculate the PQS for each pitcher and determine who your model would project as better going forward.
(c) If the market sets a moneyline based heavily on ERA, describe the specific betting opportunity this creates.
Exercise 22. You are building a model for a game at Oracle Park (San Francisco) on an evening with the following conditions: temperature 56 degrees F, wind blowing in at 18 mph, humidity 85%, fog advisory. The starting pitchers are both ground-ball specialists with FIPs of 3.40 and 3.55. Walk through the full projection process: - Start with neutral-site team run projections - Apply the base park factor - Apply environmental adjustments - Determine the implied total - Compare to the posted total of 7.5 and assess whether an over or under bet has value
Exercise 23. Analyze the relationship between a team's first-half record and its second-half record using the concept of Pythagorean expectation. A team finishes the first half (81 games) with a 50--31 record (.617), having scored 420 runs and allowed 360 runs. Calculate the team's Pythagorean win percentage and determine whether their record is sustainable. Project their second-half win total under two scenarios: (a) their run scoring and run prevention rates continue, and (b) their record regresses to their Pythagorean expectation.
Exercise 24. You discover that a particular umpire has an average of 0.8 more runs per game than the league average over a sample of 120 games. Assess whether this effect is statistically significant (assume the standard deviation of runs per game is 3.5). If it is significant, calculate how the over/under probability should shift for a game with this umpire behind the plate when the posted total is 8.5 and your model projects a total of 9.0. Discuss sample size considerations.
Exercise 25. An MLB team trades for an ace starting pitcher at the trade deadline. The ace has a FIP of 2.80 and a PQS of 125. The pitcher he replaces in the rotation had a FIP of 4.50 and a PQS of 88. The team pitches a five-man rotation. Estimate the impact of this acquisition on: (a) the team's expected run prevention per game (averaged across the rotation), (b) the team's projected win total over the remaining 65 games, and (c) the moneyline adjustment in games where the new ace starts versus where the old pitcher would have started. State all assumptions clearly.
Part E: Research Exercises (Exercises 26--30)
Exercise 26. Research the historical profitability of betting MLB underdogs at various price points (+100 to +120, +120 to +150, +150 to +200, +200+). Summarize the academic and industry findings. Discuss whether this "underdog bias" still exists and identify the structural reasons that would cause it to persist or diminish. Propose a specific filtering strategy that would isolate the most profitable subset of underdogs.
Exercise 27. Investigate the impact of Statcast data (exit velocity, launch angle, xwOBA) on MLB betting markets since its introduction in 2015. Has the market become more efficient as Statcast data has become widely available? Find at least three specific examples where the gap between traditional statistics and Statcast metrics would have identified a betting edge, and assess whether those edges would still be available today.
Exercise 28. Study the effect of bullpen usage patterns on MLB totals. Research how the rise of the "opener" strategy, increased bullpen usage, and shorter starter outings have affected run-scoring patterns over the past decade. Analyze whether traditional models that assume 6 innings from the starter need to be updated, and propose specific adjustments for the modern game.
Exercise 29. Investigate weather-based betting strategies for MLB totals. Research the empirical relationship between wind direction, wind speed, temperature, and run scoring at wind-sensitive parks (Wrigley Field, Kauffman Stadium, Oracle Park). Determine whether a strategy of betting overs on days with strong outward winds and warm temperatures shows historical profitability after accounting for the market's own adjustments to the total.
Exercise 30. Design and document a complete MLB betting system for a 162-game season. Your system should integrate: (a) pitcher matchup projections, (b) park and weather adjustments, (c) bullpen availability tracking, (d) lineup-level platoon analysis, and (e) market data for identifying line value. Describe the data pipeline, model architecture, bet selection criteria, bankroll management rules, and performance evaluation framework. Address how you would handle the cold start problem at the beginning of the season.