Chapter 27 Exercises: Advanced Regression and Classification

Instructions: Complete all exercises in the parts assigned by your instructor. Show all work for calculation problems. For programming challenges, include comments explaining your logic and provide sample output. For analysis and research problems, cite your sources where applicable.


Part A: Conceptual Understanding

Each problem is worth 5 points. Answer in complete sentences unless otherwise directed.


Exercise A.1 --- Gradient Boosting Mechanics

Explain the gradient boosting algorithm at a high level. Address (a) why the algorithm builds trees sequentially rather than in parallel, (b) what "fitting to residuals" means in the context of the first three iterations, (c) how the learning rate $\eta$ controls the contribution of each tree and why values of 0.01-0.1 are preferred over 1.0, and (d) how XGBoost's use of second-order gradients (Hessians) improves upon basic gradient boosting.


Exercise A.2 --- Regularization in XGBoost

XGBoost adds a regularization term $\Omega(f) = \gamma T + \frac{1}{2}\lambda \sum w_j^2$ to the objective function. Explain (a) what $T$ (number of leaves) and $w_j$ (leaf weights) represent, (b) how $\gamma$ acts as a pruning mechanism that prevents overly complex trees, (c) how $\lambda$ shrinks leaf weights and why this reduces overfitting, and (d) the difference between L1 (reg_alpha) and L2 (reg_lambda) regularization in XGBoost and when each is most useful.


Exercise A.3 --- Temporal Cross-Validation

Explain why standard k-fold cross-validation is inappropriate for sports prediction and why time-series cross-validation must be used instead. Address (a) the data leakage problem that occurs when future games are used to train models predicting past games, (b) how TimeSeriesSplit implements expanding-window validation, (c) why each fold in time-series CV has a different training set size and what implications this has, and (d) how to properly implement walk-forward validation for a sports betting model deployed in production.


Exercise A.4 --- Random Forests vs. XGBoost

Compare random forests and XGBoost as predictive models for sports outcomes. Address (a) why random forests are less prone to overfitting than XGBoost without regularization, (b) why XGBoost often achieves higher predictive accuracy when properly tuned, (c) the role of feature subsampling in decorrelating trees (random forests' $\text{max\_features}$ vs. XGBoost's colsample_bytree), and (d) when a sports bettor should prefer random forests over XGBoost.


Exercise A.5 --- Probability Calibration

Define probability calibration and explain why it is critical for sports betting. Address (a) the formal definition of perfect calibration, (b) what a reliability diagram (calibration curve) shows and how to interpret common miscalibration patterns (overconfident, underconfident, biased), (c) the difference between Platt scaling and isotonic regression, including when each is preferred, and (d) why calibration should be performed on held-out data rather than training data.


Exercise A.6 --- SHAP Values and Game Theory

Explain SHAP values from both a theoretical and practical perspective. Address (a) the connection to Shapley values from cooperative game theory and what "fair attribution" means, (b) the three key properties of SHAP (local accuracy, consistency, missingness), (c) why TreeSHAP is computationally practical for XGBoost models while naive Shapley computation is exponential, and (d) how a sports bettor would use SHAP values to audit a model's prediction for a specific game.


Exercise A.7 --- Handling Imbalanced Outcomes

Discuss the challenge of predicting rare outcomes (upsets, blowouts) in sports. Explain (a) why a model trained on imbalanced data tends to predict the majority class, (b) why accuracy is a misleading metric for imbalanced problems and what metrics should be used instead, (c) the difference between class weighting and SMOTE as rebalancing techniques, and (d) why probability calibration must be applied after any rebalancing technique.


Exercise A.8 --- Feature Engineering for Sports Models

Describe the importance of feature engineering in sports prediction. Explain (a) why rolling averages are preferred over season-to-date averages for team statistics, (b) how to create matchup-specific features (e.g., "home offense vs. away defense"), (c) why market-derived features (opening line, line movement) are powerful predictors and what this implies about market efficiency, and (d) the risk of feature leakage when engineering features from game-level data.


Part B: Calculations

Each problem is worth 5 points. Show all work and round final answers to the indicated precision.


Exercise B.1 --- Gradient Boosting Step-by-Step

A gradient boosting model for predicting game totals (over/under) has completed two iterations. The initial prediction for all games is the mean total: $\hat{y}^{(0)} = 215$. The learning rate is $\eta = 0.1$.

After fitting tree 1 to the residuals, the tree predicts $f_1(x) = +8$ for Game A. After fitting tree 2 to the new residuals, the tree predicts $f_2(x) = +3$ for Game A.

(a) What is the model's prediction for Game A after iteration 1?

(b) What is the model's prediction for Game A after iteration 2?

(c) The actual total for Game A was 230. What is the residual after iteration 2?

(d) If the over/under line is set at 224.5, would the model bet the over or the under after iteration 2?

(e) After 100 trees, the model predicts 227.3 for Game A. How many "points" has the ensemble added to the initial prediction of 215?


Exercise B.2 --- Expected Calibration Error

A model's predictions for 100 games are grouped into 5 bins:

Bin Mean Predicted Actual Frequency Count
1 0.25 0.30 20
2 0.38 0.35 20
3 0.52 0.55 20
4 0.65 0.58 20
5 0.80 0.72 20

(a) Compute the calibration error for each bin: $|p_{\text{predicted}} - p_{\text{actual}}|$.

(b) Compute the Expected Calibration Error (ECE) as the weighted average of bin errors, where weights are proportional to bin counts.

(c) Compute the Maximum Calibration Error (MCE).

(d) Is this model overconfident, underconfident, or well-calibrated? Justify your answer by examining the pattern in the bins.

(e) The model's overall Brier score is 0.215. Is this good for sports prediction? Compare to the Brier score of a naive model that always predicts 0.50.


Exercise B.3 --- SHAP Value Interpretation

An XGBoost model predicts P(home win) = 0.72 for a specific NBA game. The SHAP base value (average prediction) is 0.55. The SHAP values for the top features are:

Feature Value SHAP Value
elo_diff +150 +0.08
rest_days_diff +2 +0.04
home_win_pct_L10 0.80 +0.03
away_def_rating 108.5 +0.02
pace_diff -3.2 -0.01
travel_distance 1200 +0.01

(a) Verify that the SHAP values approximately sum to the difference between the prediction and the base value: $0.72 - 0.55 = 0.17$. (Note: there may be additional small SHAP values for other features.)

(b) Which feature contributes most to this prediction being above average?

(c) The pace_diff is negative (home team plays slower). Explain why the SHAP value is -0.01 (i.e., slower pace slightly decreases the home win probability).

(d) If the sportsbook implies P(home) = 0.65, the model sees a +7% edge. Based on the SHAP breakdown, where does this edge come from?

(e) Would you trust this prediction? What additional information would you want before placing a bet?


Exercise B.4 --- Class Imbalance Metrics

A model for predicting NBA upsets (underdog wins) produces the following confusion matrix on a test set of 200 games:

Predicted Upset Predicted No Upset
Actual Upset 18 (TP) 22 (FN)
Actual No Upset 12 (FP) 148 (TN)

(a) Compute precision, recall, and F1 score for the upset class.

(b) Compute the accuracy of the model. Is accuracy a useful metric here?

(c) The base rate of upsets is $(18 + 22)/200 = 20\%$. Compare the model's precision (from part a) to the base rate. Is the model useful?

(d) If a bettor wagers on all 30 games the model predicts as upsets, and the average upset moneyline pays +180, what is the expected profit per bet (assuming -110 for non-upsets)?

(e) Compute the model's false positive rate. Why does this matter for a bettor's bankroll?


Exercise B.5 --- Platt Scaling Calculation

An uncalibrated model produces the following raw probabilities and outcomes for 8 games:

Game Raw P(home) Outcome
1 0.85 1
2 0.75 0
3 0.90 1
4 0.60 1
5 0.80 1
6 0.55 0
7 0.70 1
8 0.65 0

(a) Convert each raw probability to log-odds: $f = \ln(p/(1-p))$.

(b) A Platt scaling calibrator fits the model $P(y=1 | f) = 1/(1 + \exp(-(Af + B)))$ and finds $A = 0.8$ and $B = -0.3$. Apply this calibration to games 1 and 6.

(c) For game 1, how did calibration change the probability? Was the raw model overconfident for this game?

(d) Compute the log-loss for the raw predictions on all 8 games.

(e) Compute the log-loss after Platt scaling (apply the calibrator to all 8 games). Did calibration improve log-loss?


Exercise B.6 --- Random Forest Feature Importance

A random forest model with 500 trees reports the following feature importances (Gini impurity reduction):

Feature Importance Normalized
elo_diff 0.25 1.00
spread 0.20 0.80
off_rating_diff 0.15 0.60
def_rating_diff 0.12 0.48
rest_diff 0.08 0.32
win_pct_diff 0.07 0.28
pace_diff 0.05 0.20
travel_dist 0.04 0.16
altitude 0.02 0.08
jersey_color 0.02 0.08

(a) The top 4 features account for what percentage of total importance?

(b) jersey_color has the same importance as altitude. What does this suggest about possible overfitting or data artifacts?

(c) spread (the sportsbook line) is the second most important feature. What does this imply about market efficiency?

(d) A colleague suggests removing all features below 0.05 importance. What are the pros and cons of this approach?

(e) If you retrained the model using only the top 5 features, would you expect the model's performance to improve, stay the same, or degrade? Why?


Exercise B.7 --- Stacking Ensemble Evaluation

A stacking ensemble uses three base models: logistic regression (LR), random forest (RF), and XGBoost (XGB). The meta-learner (logistic regression) has the following coefficients for the base model predictions:

Base Model Coefficient Intercept
LR 0.35 -0.12
RF 0.25
XGB 0.85

For a specific game, the base model predictions are: LR = 0.60, RF = 0.55, XGB = 0.68.

(a) Compute the meta-learner's log-odds: $z = -0.12 + 0.35 \times 0.60 + 0.25 \times 0.55 + 0.85 \times 0.68$.

(b) Convert the log-odds to a probability: $P = 1/(1 + e^{-z})$.

(c) The XGB coefficient (0.85) is much larger than LR (0.35) and RF (0.25). What does this imply about the meta-learner's trust in each base model?

(d) If the base models agreed perfectly (all predicting 0.65), would the stacking ensemble also predict 0.65? Compute and verify.

(e) What is the advantage of this stacking approach over a simple average of the three base model predictions?


Part C: Programming Challenges

Each problem is worth 10 points. Write clean, well-documented Python code. Include docstrings, type hints, and at least three test cases per function.


Exercise C.1 --- XGBoost Sports Prediction Pipeline

Build a complete XGBoost pipeline for predicting NBA game outcomes.

Requirements: - Generate synthetic NBA game data with at least 10 features including Elo ratings, offensive/defensive ratings, rest days, win percentages, and pace. - Implement time-series cross-validation (at least 5 folds) for hyperparameter tuning. - Tune at least 4 hyperparameters: max_depth, learning_rate, subsample, reg_lambda. - Evaluate on a held-out test set with log-loss, Brier score, AUC-ROC, and accuracy. - Report feature importance using both XGBoost's built-in gain-based importance and mean absolute SHAP values.


Exercise C.2 --- Calibration Analysis Tool

Build a comprehensive probability calibration toolkit.

Requirements: - Implement both Platt scaling and isotonic regression calibrators from scratch (without using sklearn's calibration classes for the core logic). - Generate synthetic model predictions with known miscalibration (overconfident). - Apply both calibrators on a calibration set and evaluate on a separate test set. - Compute ECE, MCE, log-loss, and Brier score before and after calibration. - Produce a formatted calibration report showing 10-bin reliability statistics.


Exercise C.3 --- Stacking Ensemble Builder

Build a model stacking framework that combines logistic regression, random forest, and XGBoost.

Requirements: - Implement manual out-of-fold stacking using time-series cross-validation. - Train a logistic regression meta-learner on the out-of-fold predictions. - Compare stacking performance against each base model and simple averaging. - Report meta-learner coefficients and interpret which base model is most trusted. - Demonstrate that the stacking ensemble outperforms individual models on synthetic data.


Exercise C.4 --- SHAP Interpretability Dashboard

Build a model interpretability module that produces comprehensive explanations.

Requirements: - Train an XGBoost model on synthetic sports data with at least 8 features. - Compute SHAP values for the test set using TreeSHAP. - Implement functions for: (1) global feature importance (mean |SHAP|), (2) single-game explanations, (3) feature dependence analysis. - For a single game, produce a text-based "SHAP report" showing the base prediction, each feature's contribution, and the final prediction. - Compare SHAP importance rankings to built-in XGBoost gain-based rankings.


Exercise C.5 --- End-to-End Sports Betting Model

Build a complete pipeline from feature engineering through calibration to betting recommendations.

Requirements: - Generate synthetic season data for 30 NBA teams with realistic features. - Engineer at least 15 features including rolling statistics, matchup features, and contextual features. - Train a calibrated XGBoost model using the pipeline: train -> tune -> calibrate -> evaluate. - For each test game, produce: predicted probability, calibrated probability, market-implied probability, estimated edge, and recommended bet size (Kelly criterion). - Report backtest results: total games, bets placed (edge > 3%), win rate, ROI, and maximum drawdown.


Part D: Analysis & Interpretation

Each problem is worth 5 points. Provide structured, well-reasoned responses.


Exercise D.1 --- Interpreting a Calibration Failure

You build an XGBoost model for NFL game prediction. The model achieves a test-set AUC-ROC of 0.68 (reasonable for NFL), but the calibration curve shows severe overconfidence: when the model predicts 75% win probability, the actual win rate is only 62%.

(a) Explain why a model can have a good AUC-ROC but poor calibration. What does each metric measure?

(b) You apply Platt scaling and the calibration improves, but the model now appears slightly underconfident in the 40-60% range. What might cause this?

(c) A colleague suggests using isotonic regression instead. Under what conditions would isotonic regression fix the residual miscalibration that Platt scaling missed?

(d) After calibration, your model's edge over the market shrinks from 5% to 2%. Why does this happen, and is the calibrated model still more useful?

(e) How would you monitor calibration drift throughout a season, and what would trigger a recalibration?


Exercise D.2 --- Feature Importance Discrepancies

You train an XGBoost model for NBA prediction. The built-in gain-based feature importance ranks "travel_distance" as the #2 feature, but SHAP ranks it #7. Meanwhile, "elo_diff" is #1 by both methods.

(a) Explain why gain-based importance and SHAP can produce different rankings.

(b) Which ranking should you trust more for understanding the model's true reliance on each feature? Why?

(c) If travel_distance has high gain-based importance but low SHAP importance, what does this suggest about how the model uses this feature?

(d) You discover that travel_distance is correlated with conference (Western Conference teams travel more). How might this confound the importance analysis?

(e) Design an experiment to determine whether travel_distance genuinely adds predictive value beyond what is captured by other features.


Exercise D.3 --- Ensemble Diminishing Returns

You build stacking ensembles with increasing numbers of base models:

Ensemble Base Models Test Log-Loss Improvement
1 XGBoost only 0.672 ---
2 XGBoost + LR 0.658 2.1%
3 XGBoost + LR + RF 0.651 1.1%
4 XGBoost + LR + RF + NN 0.649 0.3%
5 XGBoost + LR + RF + NN + SVM 0.650 -0.2%

(a) Describe the pattern of diminishing returns. Why does each additional model contribute less?

(b) Ensemble 5 is slightly worse than Ensemble 4. Explain two possible reasons.

(c) The SVM may be adding noise rather than signal. How would you test whether a base model should be included in the stack?

(d) From a betting profitability perspective, is the improvement from Ensemble 1 to Ensemble 3 likely to be economically meaningful? Estimate the impact on ROI.

(e) Given the computational cost of training 5 models, when would you recommend the 3-model ensemble versus the single XGBoost model?


Exercise D.4 --- SMOTE for Upset Prediction

You build an upset prediction model for college basketball. You try three approaches:

Approach Precision Recall F1 Log-Loss Calibration
Baseline (no rebalancing) 0.45 0.22 0.30 0.58 Good
Class weights (balanced) 0.38 0.48 0.42 0.61 Moderate
SMOTE 0.35 0.55 0.43 0.64 Poor

(a) Explain the precision-recall tradeoff evident across these three approaches.

(b) For a bettor, which approach is most useful? Consider that betting on upsets requires well-calibrated probabilities, not just correct classifications.

(c) SMOTE has the highest recall but the worst log-loss and calibration. Why does synthetic oversampling hurt probability estimates?

(d) Propose a strategy that combines the benefits of class weighting with proper calibration.

(e) A fourth approach uses the baseline model but optimizes the decision threshold. Under what circumstances would this outperform all three approaches above?


Exercise D.5 --- Model Shelf Life

You deploy an XGBoost model for NFL prediction at the start of the 2024 season, trained on 2019-2023 data. By week 8, you notice the model's weekly log-loss has increased steadily from 0.67 (weeks 1-4) to 0.73 (weeks 5-8).

(a) List three possible explanations for this performance degradation.

(b) The NFL introduced a new kickoff rule in 2024 that significantly changed field position. How would this rule change affect your model's features and predictions?

(c) Describe a concrete procedure for determining whether to retrain the model mid-season versus continuing with the original model.

(d) If you retrain on weeks 1-8 of the 2024 season, what is the risk of overfitting to an 8-week sample?

(e) Propose a model updating strategy that balances responsiveness to changing conditions with robustness against small-sample overfitting.


Part E: Research & Extension

Each problem is worth 5 points. These require independent research beyond Chapter 27. Cite all sources.


Exercise E.1 --- Gradient Boosting Variants

Research the landscape of gradient boosting implementations beyond XGBoost: LightGBM, CatBoost, and NGBoost. For each, explain (a) its key innovation relative to XGBoost, (b) when it would be preferred for sports prediction, (c) any published benchmarks comparing it to XGBoost on tabular data, and (d) its approach to handling categorical features (relevant for team identifiers and venue types).


Exercise E.2 --- Neural Networks for Sports Prediction

Research the application of neural networks (deep learning) to sports outcome prediction. Address (a) specific architectures used (feedforward, recurrent, attention-based), (b) at least two published papers or projects that apply neural networks to sports prediction with reported results, (c) whether neural networks consistently outperform gradient-boosted trees for tabular sports data, and (d) the unique challenges of applying deep learning to sports prediction (small datasets, non-stationarity).


Exercise E.3 --- Conformal Prediction for Sports

Research conformal prediction as an alternative to traditional probability calibration. Explain (a) the key idea behind conformal prediction and how it differs from calibration, (b) the coverage guarantee that conformal prediction provides, (c) how conformal prediction sets could be applied to sports outcomes (e.g., "the true outcome is in this set with 90% probability"), and (d) any published applications of conformal prediction in sports or gambling contexts.


Exercise E.4 --- Automated Machine Learning (AutoML)

Research AutoML tools (AutoGluon, H2O, TPOT, Auto-sklearn) and their applicability to sports betting. Address (a) what AutoML automates (feature engineering, model selection, hyperparameter tuning), (b) whether AutoML tools can handle the temporal constraints of sports data, (c) at least one case study or benchmark of AutoML applied to sports prediction, and (d) the advantages and disadvantages of AutoML versus hand-tuned models for a professional sports bettor.


Exercise E.5 --- Explainable AI Regulations

Research the regulatory landscape around explainable AI and its implications for sports betting models. Address (a) the EU's AI Act and its requirements for transparency in high-risk AI systems, (b) whether sports betting models would fall under these regulations, (c) how SHAP and LIME satisfy or fail to satisfy explainability requirements, and (d) the tension between model complexity (for accuracy) and explainability (for regulatory compliance and user trust).


Scoring Guide

Part Problems Points Each Total Points
A: Conceptual Understanding 8 5 40
B: Calculations 7 5 35
C: Programming Challenges 5 10 50
D: Analysis & Interpretation 5 5 25
E: Research & Extension 5 5 25
Total 30 --- 175

Grading Criteria

Part A (Conceptual): Full credit requires clear, accurate explanations that demonstrate understanding of the underlying ML concepts and their relevance to sports betting. Partial credit for incomplete but correct reasoning.

Part B (Calculations): Full credit requires correct final answers with all work shown. Partial credit for correct methodology with arithmetic errors.

Part C (Programming): Graded on correctness (40%), code quality and documentation (30%), and test coverage (30%). Code must execute without errors.

Part D (Analysis): Graded on analytical depth, logical reasoning, and appropriate application of ML concepts to real-world betting scenarios. Multiple valid approaches may exist.

Part E (Research): Graded on research quality, source credibility, analytical depth, and clear writing. Minimum source requirements specified per problem.


Solutions: Complete worked solutions for all exercises are available in code/exercise-solutions.py. For programming challenges, reference implementations are provided in the code/ directory.