Chapter 9 Exercises: Regression Analysis for Sports Modeling
Instructions: Complete all exercises in the parts assigned by your instructor. Show all work for calculation problems. For programming challenges, include comments explaining your logic and provide sample output. For analysis and research problems, cite your sources where applicable.
Part A: Conceptual Understanding
Each problem is worth 5 points. Answer in complete sentences unless otherwise directed.
Exercise A.1 --- Linear vs. Logistic Regression
Explain the fundamental difference between linear regression and logistic regression in terms of (a) the type of outcome variable each models, (b) the link function used, and (c) the loss function minimized during fitting. Provide a sports betting example where each would be the appropriate choice.
Exercise A.2 --- Assumptions of Ordinary Least Squares
List the four core assumptions of ordinary least squares (OLS) linear regression (linearity, independence, homoscedasticity, normality of residuals). For each assumption, describe (a) what it means in plain language, (b) how you would test or diagnose a violation, and (c) a specific sports modeling scenario where the assumption might be violated.
Exercise A.3 --- Multicollinearity and Its Consequences
Define multicollinearity in the context of multiple regression. Explain (a) why multicollinearity does not bias coefficient estimates but inflates their standard errors, (b) how the Variance Inflation Factor (VIF) is computed and what threshold indicates a problem, and (c) why multicollinearity is particularly common when building sports models with correlated team statistics (e.g., yards per play and points scored).
Exercise A.4 --- Overfitting and the Bias-Variance Tradeoff
A modeler builds an NFL game totals model using 47 features and 256 regular-season games from a single season. Explain (a) why this model is at high risk of overfitting, (b) the bias-variance tradeoff and how it applies here, (c) what cross-validation is and how it mitigates the risk, and (d) why out-of-sample performance is the only honest measure of a betting model's quality.
Exercise A.5 --- Interpreting Logistic Regression Coefficients
In logistic regression, the coefficients represent log-odds rather than direct changes in probability. Explain (a) what a log-odds ratio of 0.35 for the feature "home_team" means in practical terms, (b) how to convert this to an odds ratio, (c) why the marginal effect on probability depends on the baseline probability, and (d) why this nonlinearity matters when translating logistic regression outputs into betting probabilities.
Exercise A.6 --- Feature Engineering for Sports Models
Describe the concept of feature engineering and explain why it is often more important than algorithm selection in sports modeling. Provide five specific examples of engineered features for an NFL game prediction model, explaining for each (a) the raw data inputs, (b) the transformation applied, and (c) why the engineered feature is more predictive than the raw inputs alone.
Exercise A.7 --- Regularization Methods
Compare and contrast Ridge regression (L2 penalty), Lasso regression (L1 penalty), and Elastic Net. For each, explain (a) the form of the penalty term, (b) its effect on coefficient estimates, (c) when it is preferable, and (d) how you would select the regularization hyperparameter in a sports betting context. Why are regularized methods particularly valuable when building models with many correlated sports statistics?
Exercise A.8 --- Regression to the Mean in Sports Contexts
Explain regression to the mean and why it is one of the most important concepts in sports analytics. Provide (a) a precise statistical definition, (b) an example in NFL scoring where a team that scores 35 points in Week 1 is likely to score fewer in Week 2, (c) how a regression model naturally accounts for this phenomenon, and (d) how failing to account for regression to the mean leads to biased betting decisions.
Part B: Calculations
Each problem is worth 5 points. Show all work and round final answers to the indicated precision.
Exercise B.1 --- Interpreting Regression Coefficients
A linear regression model predicts NFL game point differentials (home score minus away score) with the following estimated equation:
$$\hat{y} = 2.48 + 0.12 \times \text{home\_off\_epa} - 0.09 \times \text{away\_off\_epa} + 0.07 \times \text{home\_def\_epa} - 0.05 \times \text{rest\_advantage}$$
Where:
- home_off_epa = home team's offensive EPA per play (season average)
- away_off_epa = away team's offensive EPA per play (season average)
- home_def_epa = home team's defensive EPA per play (lower is better)
- rest_advantage = days of rest difference (home minus away)
(a) Interpret the intercept of 2.48 in the context of the model.
(b) Interpret the coefficient 0.12 on home_off_epa.
(c) If the home team has home_off_epa = 0.08, away_off_epa = 0.04, home_def_epa = -0.05, and rest_advantage = 3, what is the predicted point differential?
(d) The sportsbook line is Home -3.5. Based on the model, should you bet the home team or the away team against the spread?
Exercise B.2 --- R-Squared Interpretation
A model predicting NBA game totals (combined points scored by both teams) using pace, offensive rating, and defensive rating yields the following training results:
- $R^2 = 0.42$
- Adjusted $R^2 = 0.41$
- $n = 1230$, $p = 3$ predictors
(a) Interpret $R^2 = 0.42$ in the context of this model.
(b) Why is the adjusted $R^2$ lower, and what does the small gap suggest?
(c) The model's RMSE is 14.2 points. Explain what this means for betting purposes if the sportsbook total is set at 224.5.
(d) A colleague adds 12 more features (player-level stats) and $R^2$ increases to 0.47 but adjusted $R^2$ drops to 0.39. What happened, and what should you conclude?
Exercise B.3 --- Variance Inflation Factor (VIF)
You build a model with the following features and their computed VIF values:
| Feature | VIF |
|---|---|
| Points per game | 8.7 |
| Yards per game | 12.3 |
| Turnovers per game | 1.4 |
| Third-down conversion rate | 4.2 |
| Red zone scoring rate | 3.8 |
| Time of possession | 9.1 |
(a) Which features show concerning multicollinearity? Justify your threshold choice.
(b) Explain why "yards per game" and "points per game" have high VIFs.
(c) Propose two strategies to address the multicollinearity without losing important predictive information.
(d) If you remove "yards per game," would you expect the VIF of "points per game" to decrease? Why?
Exercise B.4 --- Logistic Regression Probability Calculation
A logistic regression model for NBA home wins produces the following equation:
$$\log\left(\frac{p}{1-p}\right) = -0.15 + 0.42 \times \text{net\_rating\_diff} + 0.28 \times \text{home\_court} + 0.11 \times \text{rest\_advantage}$$
(a) For a game where net_rating_diff = 3.5, home_court = 1, and rest_advantage = 1, calculate the predicted log-odds, odds, and probability of a home win.
(b) The sportsbook implies a 62% home win probability. Does the model suggest value on the home team or the away team?
(c) Calculate the predicted probability if net_rating_diff changes from 3.5 to 5.5, holding other features constant. How much did the probability change?
(d) Explain why the same 2.0-unit increase in net_rating_diff would produce a different probability change if the baseline probability were 50% instead of the value computed in (a).
Exercise B.5 --- Residual Analysis
A regression model predicting MLB run totals produces the following residual statistics:
- Mean residual: 0.02
- Standard deviation of residuals: 2.8 runs
- Skewness of residuals: 0.45
- Kurtosis of residuals: 4.1
- Durbin-Watson statistic: 1.85
(a) Assess whether the residuals have approximately zero mean. Is the model biased?
(b) The skewness is 0.45 (positive). What does this indicate about the distribution of prediction errors?
(c) The kurtosis is 4.1 (normal is 3.0). What does excess kurtosis of 1.1 suggest about tail behavior?
(d) Interpret the Durbin-Watson statistic of 1.85. Is there evidence of autocorrelation?
(e) Overall, how concerned should you be about assumption violations in this model?
Exercise B.6 --- Comparing Nested Models with F-Test
You compare two models predicting NHL game goal differentials:
- Model A (restricted): 4 features, RSS = 1240, $R^2 = 0.18$
- Model B (full): 8 features, RSS = 1180, $R^2 = 0.22$
- Sample size: $n = 820$
(a) Compute the F-statistic for comparing Model A to Model B:
$$F = \frac{(RSS_A - RSS_B) / (p_B - p_A)}{RSS_B / (n - p_B - 1)}$$
(b) With numerator df = 4 and denominator df = 811, the critical $F$ at $\alpha = 0.05$ is approximately 2.38. Is Model B significantly better than Model A?
(c) Despite statistical significance, Model B only improves $R^2$ by 0.04. Discuss whether this improvement is practically meaningful for betting purposes.
(d) How would you use AIC or BIC to make this model selection decision instead?
Exercise B.7 --- Cross-Validation Metrics
You perform 5-fold cross-validation on a point-spread prediction model with these results:
| Fold | Training RMSE | Validation RMSE | Training $R^2$ | Validation $R^2$ |
|---|---|---|---|---|
| 1 | 10.2 | 13.1 | 0.34 | 0.22 |
| 2 | 10.4 | 12.8 | 0.33 | 0.24 |
| 3 | 10.1 | 14.5 | 0.35 | 0.16 |
| 4 | 10.3 | 12.6 | 0.33 | 0.25 |
| 5 | 10.5 | 13.3 | 0.32 | 0.20 |
(a) Calculate the mean and standard deviation of the validation RMSE across folds.
(b) Calculate the mean training and validation $R^2$. What does the gap between them indicate?
(c) Fold 3 has a notably higher validation RMSE. What might cause one fold to perform significantly worse?
(d) Based on these results, what is the expected out-of-sample RMSE when this model is deployed for betting? How does this compare to a naive model that always predicts the home team wins by the spread (RMSE = 13.8)?
Part C: Programming Challenges
Each problem is worth 10 points. Write clean, well-documented Python code. Include docstrings, type hints, and at least three test cases per function.
Exercise C.1 --- Linear Regression Model for NFL Point Spreads
Build a complete linear regression pipeline for predicting NFL game point differentials using scikit-learn and statsmodels.
Requirements:
- Generate synthetic NFL game data with realistic features (offensive EPA, defensive EPA, turnover margin, home/away indicator, rest days, etc.) or use a provided dataset.
- Split data into training (80%) and testing (20%) sets with proper temporal ordering (no future data leakage).
- Fit an OLS model using statsmodels and print the full summary (coefficients, p-values, $R^2$).
- Evaluate out-of-sample performance with RMSE, MAE, and $R^2$.
- Produce residual diagnostic plots: residuals vs. fitted, Q-Q plot, residuals histogram, and residuals vs. each feature.
- Compare predicted point differentials against synthetic sportsbook lines to identify simulated value bets.
Exercise C.2 --- Logistic Regression Win Probability Model
Build a logistic regression model that predicts the probability of the home team winning an NBA game.
Requirements:
- Use features including net rating differential, home-court advantage, rest days, back-to-back indicator, travel distance, and recent form (last 10 games).
- Implement proper train/test splitting with stratified sampling to maintain class balance.
- Fit using scikit-learn's LogisticRegression with appropriate regularization.
- Evaluate with accuracy, log-loss, Brier score, and a calibration plot.
- Convert predicted probabilities to implied odds and compare against synthetic moneylines.
- Output a function get_edge(home_features, away_features, market_odds) that returns the estimated edge.
Exercise C.3 --- Feature Selection Pipeline
Implement a comprehensive feature selection pipeline that compares multiple methods on the same sports dataset.
Requirements: - Implement at least four feature selection methods: (1) correlation-based filtering, (2) Variance Inflation Factor (VIF) sequential removal, (3) Lasso-based selection (L1 regularization path), and (4) Recursive Feature Elimination (RFE) with cross-validation. - Apply all four methods to a dataset with at least 15 candidate features for predicting a sports outcome. - For each method, record the selected features, the cross-validated performance, and the computation time. - Produce a summary comparison table and a visualization showing which features are selected by each method. - Recommend a final feature set with justification.
Exercise C.4 --- Regression Diagnostic Tool
Build a reusable RegressionDiagnostics class that accepts a fitted model and produces a comprehensive diagnostic report.
Requirements:
- Accept fitted models from both statsmodels and scikit-learn.
- Compute and report: $R^2$, adjusted $R^2$, RMSE, MAE, Durbin-Watson statistic, condition number, VIF for all features, Breusch-Pagan test for heteroscedasticity, Jarque-Bera test for residual normality, and Cook's distance for influential observations.
- Generate a four-panel diagnostic plot: (1) residuals vs. fitted, (2) Q-Q plot, (3) scale-location plot, (4) residuals vs. leverage with Cook's distance contours.
- Flag any diagnostic that exceeds a configurable threshold and print a plain-English warning.
- Include a summary() method that prints all diagnostics in a formatted report.
Exercise C.5 --- End-to-End Regression-to-Betting Pipeline
Build a complete pipeline that goes from raw data to betting recommendations.
Requirements:
- Data ingestion: load game data and historical odds data (use synthetic data if needed).
- Feature engineering: compute rolling averages, strength-of-schedule adjustments, rest factors, and situational flags.
- Model training: fit a regularized linear regression (Ridge or Elastic Net) with cross-validated hyperparameter tuning using GridSearchCV.
- Prediction: generate predictions for upcoming games with confidence intervals.
- Betting logic: compare model predictions to closing lines, compute expected value, and output a recommended bet list sorted by edge.
- Bankroll management: apply Kelly Criterion sizing to each recommended bet.
- Reporting: output a formatted table of recommendations and a backtest summary showing historical P&L if the model had been deployed over the test set.
Part D: Analysis & Interpretation
Each problem is worth 5 points. Provide structured, well-reasoned responses.
Exercise D.1 --- Interpreting a Published Regression Model
A published paper reports the following regression results for predicting NFL game totals:
| Variable | Coefficient | Std Error | t-statistic | p-value |
|---|---|---|---|---|
| Intercept | 34.2 | 4.1 | 8.34 | <0.001 |
| Home offensive yards/game | 0.042 | 0.008 | 5.25 | <0.001 |
| Away offensive yards/game | 0.038 | 0.009 | 4.22 | <0.001 |
| Home defensive yards allowed/game | 0.029 | 0.010 | 2.90 | 0.004 |
| Away defensive yards allowed/game | 0.025 | 0.011 | 2.27 | 0.024 |
| Dome indicator | 2.8 | 1.2 | 2.33 | 0.020 |
| Wind speed (mph) | -0.15 | 0.06 | -2.50 | 0.013 |
$R^2 = 0.31$, Adjusted $R^2 = 0.29$, RMSE = 12.4, $n = 512$
(a) Interpret the coefficient on "dome indicator" in context.
(b) Is the model practically useful for betting, given that RMSE = 12.4 points?
(c) What important features appear to be missing from this model?
(d) The $R^2$ is 0.31. A classmate says "the model only explains 31% of variance --- it's useless." Construct a counter-argument.
(e) How would you use this model to identify value bets on game totals?
Exercise D.2 --- Model Comparison for Betting Purposes
You have built three models to predict the outcome of soccer matches (home win, draw, away win):
| Model | Training Accuracy | Test Accuracy | Log-Loss | Brier Score | Calibration |
|---|---|---|---|---|---|
| Logistic Regression | 52% | 50% | 1.02 | 0.215 | Good |
| Random Forest | 61% | 48% | 1.08 | 0.228 | Poor |
| Gradient Boosting | 58% | 51% | 1.00 | 0.210 | Moderate |
(a) Which model shows the clearest signs of overfitting? How can you tell?
(b) For betting purposes, which metric is most important: accuracy, log-loss, or Brier score? Justify your answer.
(c) Why does the logistic regression model have the best calibration despite lower training accuracy?
(d) Which model would you deploy for betting, and what would you do to improve it?
Exercise D.3 --- Diagnosing Model Failure
You build a regression model to predict NBA player points scored per game. The model has excellent in-sample fit ($R^2 = 0.72$) but performs terribly when applied to the current season ($R^2 = 0.15$). You investigate and find:
- The model was trained on 2019-2022 data.
- The NBA changed its foul-calling rules before the 2022-23 season, reducing free throw attempts league-wide by 12%.
- One key feature is "free throw attempts per game" which has shifted downward for nearly all players.
- The model's largest errors are on high-usage guards who previously generated many free throws.
(a) Explain why the model's performance degraded in terms of distributional shift.
(b) What specific type of model failure is this (concept drift, covariate shift, label shift)?
(c) Propose three approaches to fix or adapt the model for the new rules environment.
(d) How could you have designed the original model to be more robust to this type of rule change?
(e) What lesson does this case teach about the shelf life of sports betting models?
Exercise D.4 --- Feature Importance vs. Causation
A regression model for predicting college football game outcomes shows that "average recruiting ranking over the past 3 years" is the single most important feature, with a coefficient nearly twice as large as any other predictor.
(a) Does this mean recruiting ranking causes winning? Explain the difference between predictive importance and causal inference.
(b) Identify at least two confounding variables that might explain the correlation between recruiting ranking and winning.
(c) Could you use this feature in a betting model even if the relationship is not causal? Why or why not?
(d) How would you test whether recruiting ranking adds predictive value beyond what is already captured by more direct performance metrics (e.g., yards per play, turnover margin)?
Exercise D.5 --- Evaluating a Regression-Based Betting Strategy
A bettor reports the following backtested results from a regression-based NFL totals model over five seasons:
| Season | Games Bet | Win Rate | ROI | Units Profit |
|---|---|---|---|---|
| 2020 | 68 | 58.8% | +7.2% | +4.9 |
| 2021 | 72 | 55.6% | +3.1% | +2.2 |
| 2022 | 65 | 56.9% | +5.0% | +3.3 |
| 2023 | 70 | 52.9% | +0.2% | +0.1 |
| 2024 | 74 | 50.0% | -4.5% | -3.3 |
(a) Describe the trend in performance over time. What pattern do you observe?
(b) Propose three explanations for why the model's performance appears to be decaying.
(c) Is the overall 5-season sample of 349 bets at a combined ~54.8% win rate statistically significant at the 5% level against a null hypothesis of 50%? (Use the normal approximation to the binomial.)
(d) What steps would you take before continuing to use this model with real money in 2025?
Part E: Research & Extension
Each problem is worth 5 points. These require independent research beyond Chapter 9. Cite all sources.
Exercise E.1 --- History of Regression in Sports Analytics
Research and write a brief essay (500-700 words) tracing the history of regression analysis in sports analytics. Cover (a) Bill James and early sabermetrics, (b) the adoption of regression models in football analytics (e.g., Football Outsiders, Pro Football Focus), (c) the role of regression in the "Moneyball" revolution, (d) how modern sports betting operations use regression, and (e) the current state of the art and where regression fits alongside machine learning methods.
Exercise E.2 --- Regularization in Practice
Research how professional sports analytics teams use regularization (Ridge, Lasso, Elastic Net) in their models. Find at least two published examples (blog posts, papers, or talks) where analysts describe their regularization approach. For each example, report (a) the sport and prediction target, (b) the regularization method used and why, (c) how the regularization parameter was selected, and (d) the impact of regularization on model performance.
Exercise E.3 --- Alternative Regression Methods
Research one advanced regression method not covered in Chapter 9: Quantile regression, Gaussian process regression, or Bayesian linear regression. Write a 400-600 word summary explaining (a) how it differs from OLS, (b) what advantages it offers for sports modeling, (c) a concrete example of how it could be applied to a betting problem, and (d) its limitations.
Exercise E.4 --- The Closing Line as Ground Truth
Research the concept of using the closing line (rather than game outcomes) as the target variable for regression model evaluation. Find and summarize at least two sources that argue closing line value (CLV) is a better measure of model quality than win rate. Explain (a) the theoretical justification, (b) empirical evidence supporting this view, (c) how you would modify a standard regression evaluation pipeline to incorporate CLV measurement, and (d) any limitations of the CLV framework.
Exercise E.5 --- Regression in Live Betting Markets
Research how regression models are adapted for live (in-play) betting markets. Address (a) the unique challenges of in-game prediction (changing game state, time remaining, score differential), (b) how features must be engineered differently for live models vs. pre-game models, (c) any published examples of live-game regression models, (d) the role of expected points models in football and win probability models in basketball, and (e) why latency and computational speed matter for live model deployment.
Scoring Guide
| Part | Problems | Points Each | Total Points |
|---|---|---|---|
| A: Conceptual Understanding | 8 | 5 | 40 |
| B: Calculations | 7 | 5 | 35 |
| C: Programming Challenges | 5 | 10 | 50 |
| D: Analysis & Interpretation | 5 | 5 | 25 |
| E: Research & Extension | 5 | 5 | 25 |
| Total | 30 | --- | 175 |
Grading Criteria
Part A (Conceptual): Full credit requires clear, accurate explanations that demonstrate understanding of the underlying statistical concepts and their relevance to sports betting. Partial credit for incomplete but correct reasoning.
Part B (Calculations): Full credit requires correct final answers with all work shown. Partial credit for correct methodology with arithmetic errors.
Part C (Programming): Graded on correctness (40%), code quality and documentation (30%), and test coverage (30%). Code must execute without errors.
Part D (Analysis): Graded on analytical depth, logical reasoning, and appropriate application of regression concepts to real-world betting scenarios. Multiple valid approaches may exist.
Part E (Research): Graded on research quality, source credibility, analytical depth, and clear writing. Minimum source requirements specified per problem.
Solutions: Complete worked solutions for all exercises are available in
code/exercise-solutions.py. For programming challenges, reference implementations are provided in thecode/directory.