Chapter 30 Exercises: Model Evaluation and Selection

These exercises progress from foundational scoring rule concepts through walk-forward validation design to full model comparison pipelines. Complete them in order; later exercises build on earlier solutions.


Part A: Proper Scoring Rules (Exercises 1-6)

Exercise 1 (Conceptual). A model predicts P(home win) = 0.60 for a game. The home team wins. Compute (a) the Brier score contribution for this single prediction, (b) the log loss contribution, and (c) the squared error if you had instead predicted 0.90. Explain why proper scoring rules penalize overconfident wrong predictions more than slightly miscalibrated correct predictions.

Exercise 2 (Computational). Given the following five predictions and outcomes, compute the Brier score and log loss by hand:

Game P(home win) Outcome
1 0.75 1
2 0.40 0
3 0.60 0
4 0.85 1
5 0.30 1

Verify your Brier score using compute_brier_score from the chapter code.

Exercise 3 (Conceptual). Explain why accuracy is not a proper scoring rule. Give a concrete example of two forecasters who achieve the same accuracy but have very different Brier scores. Which forecaster would you prefer for sports betting, and why?

Exercise 4 (Computational). Implement a function brier_score_decomposition_manual(predictions, outcomes, n_bins=10) that computes the reliability, resolution, and uncertainty components of the Brier score. Test it on the following scenario: generate 1000 predictions from a beta distribution Beta(3, 2), generate outcomes with np.random.random(1000) < true_probs, and verify that reliability - resolution + uncertainty approximately equals the direct Brier score calculation.

Exercise 5 (Applied). You have two models predicting NFL game outcomes over a 17-week season (272 games):

  • Model A: Brier score = 0.225, BSS vs. market = -0.02
  • Model B: Brier score = 0.218, BSS vs. market = +0.01

Explain what the BSS values tell you about each model's relationship to the market. Which model should you use for betting? Could Model A still be useful even with a negative BSS?

Exercise 6 (Computational). Write a function compare_scoring_rules(predictions, outcomes) that computes Brier score, log loss, Brier skill score (vs. base rate), and the Brier decomposition. Apply it to three synthetic models: (a) a well-calibrated model, (b) an overconfident model, and (c) a model that always predicts the base rate. Display results in a formatted table.


Part B: Backtesting and Simulation (Exercises 7-12)

Exercise 7 (Conceptual). List four sources of look-ahead bias in sports betting backtests. For each, explain (a) how it arises, (b) how it inflates backtest results, and (c) how to prevent it.

Exercise 8 (Applied). Using the BettingBacktester class from the chapter, run a backtest on synthetic NBA data with the following parameters: initial bankroll = $10,000, minimum edge = 0.03, fractional Kelly (25%), maximum bet fraction = 0.05. Generate 500 games where your model has a true edge of 2% over the market. Report the ROI, win rate, maximum drawdown, and Sharpe ratio.

Exercise 9 (Computational). Extend the BettingBacktester class to support a half_kelly staking method that uses exactly 50% of the full Kelly stake. Add a method monthly_summary() that groups bets by month and reports per-month ROI, number of bets, and running bankroll.

Exercise 10 (Applied). Run two backtests on the same synthetic data: one using flat $100 stakes and one using fractional Kelly (25%). Compare the ROI, final bankroll, maximum drawdown, and Sharpe ratio. Under what conditions does Kelly staking outperform flat staking? When might flat staking be preferred?

Exercise 11 (Conceptual). You backtest 50 different model configurations and find that the best one achieves +8% ROI over 3 seasons. Your friend says this proves the model is profitable. Explain why this reasoning is flawed. How would you adjust the backtest design to account for multiple comparisons?

Exercise 12 (Computational). Implement a confidence_interval_roi(backtest_result, n_bootstrap=1000, alpha=0.05) function that uses bootstrap resampling of the bet-level profits to compute a 95% confidence interval for the ROI. Apply it to the backtest from Exercise 8.


Part C: Walk-Forward Validation (Exercises 13-18)

Exercise 13 (Conceptual). A colleague uses 5-fold cross-validation (with random shuffling) to evaluate an NBA prediction model and reports a Brier score of 0.210. You suspect this estimate is optimistically biased. Explain why standard cross-validation produces biased estimates for time series sports data. Estimate the direction and approximate magnitude of the bias.

Exercise 14 (Computational). Using the WalkForwardValidator class, create the following validation schemes for a dataset of 2,000 NBA games:

(a) Expanding window: initial training size = 1000, test size = 100, step = 50. (b) Sliding window: max training size = 800, test size = 100, step = 50. (c) Expanding window with purge gap of 20 games.

For each scheme, report the number of folds, the size of the largest and smallest training sets, and the total number of test predictions.

Exercise 15 (Applied). Implement walk-forward evaluation for a logistic regression model on synthetic NBA data. Compare the results from expanding window versus sliding window validation. Under what circumstances does sliding window produce better (more realistic) estimates?

Exercise 16 (Computational). Implement the PurgedKFoldCV class with n_splits=5 and gap=20. Apply it to a dataset of 1000 observations and print the train/test indices for each fold. Verify that no training observation is within 20 indices of any test observation.

Exercise 17 (Applied). Train a gradient-boosted tree model (using scikit-learn's GradientBoostingClassifier) on synthetic NBA data using walk-forward validation with expanding window. At each fold, compute the Brier score and ECE. Plot how model performance evolves as the training set grows.

Exercise 18 (Conceptual). You are building a model that uses 20-game rolling averages as features. What minimum purge gap should you use in walk-forward validation? Explain what happens if you use no purge gap. Would the same gap be needed if your features were instantaneous (e.g., rest days, home/away indicator)?


Part D: Calibration Analysis (Exercises 19-24)

Exercise 19 (Computational). Implement calibration_analysis from scratch (without using the chapter code). Your implementation should: (a) bin predictions into equal-width bins, (b) compute the average predicted probability and observed frequency per bin, (c) compute ECE and MCE, and (d) print a text-based reliability diagram. Test on a model that is systematically overconfident by 5 percentage points.

Exercise 20 (Applied). Generate predictions from a well-calibrated model and an overconfident model (multiply true probabilities by 1.3 and clip). Compute the Brier score, ECE, and MCE for each. Then apply Platt scaling to the overconfident model and recompute all metrics. Report the percentage improvement in each metric from recalibration.

Exercise 21 (Conceptual). Explain why isotonic regression is more flexible than Platt scaling for recalibration, but also more prone to overfitting. For an NBA prediction model with 1,230 games per season, would you recommend Platt scaling or isotonic regression? What is the minimum calibration set size you would recommend for each method?

Exercise 22 (Computational). Implement a ModelRecalibrator that supports both Platt scaling and isotonic regression. Use a three-way split: train the original model on 60% of data, fit the recalibrator on 20%, and evaluate on the final 20%. Compare the calibration (ECE) and discrimination (Brier score) before and after recalibration.

Exercise 23 (Applied). Generate predictions from three models with different calibration properties: (a) underconfident (true probabilities compressed toward 0.5), (b) well-calibrated, (c) overconfident with a bias (predictions shifted 3% toward 1). Produce reliability diagrams for all three models side by side and compute ECE for each.

Exercise 24 (Computational). Implement a function expected_calibration_error(predictions, outcomes, n_bins=10, strategy='quantile') that supports both equal-width and quantile-based binning strategies. Compare the ECE values from both strategies on the same model predictions. Explain why quantile binning may give different (and sometimes more informative) ECE values.


Part E: Model Comparison and Selection (Exercises 25-30)

Exercise 25 (Computational). Implement the Diebold-Mariano test from scratch, including the Newey-West variance estimator. Test your implementation by comparing two models on synthetic data: (a) two models with identical predictions (p-value should be 1.0), (b) two models where one is strictly better (p-value should be < 0.05), and (c) two models with similar but not identical performance (p-value between 0.05 and 0.50).

Exercise 26 (Applied). Compare four models on synthetic NBA data using walk-forward validation: logistic regression, random forest, gradient-boosted trees, and a feedforward neural network. For each pair of models, run a Diebold-Mariano test. Present the results in a matrix format. Which model would you select, and why?

Exercise 27 (Computational). Compute AIC, AICc, and BIC for each of the four models from Exercise 26. Rank the models by each criterion. Do the rankings agree with the Diebold-Mariano test results? When might information criteria and statistical tests give different recommendations?

Exercise 28 (Applied). Implement the ensemble_predictions function with three weighting strategies: (a) simple average, (b) inverse Brier score weighting, and (c) log loss weighting. Apply each to the four models from Exercise 26. Does the ensemble outperform the best individual model? Which weighting strategy works best?

Exercise 29 (Research). Design a complete model selection pipeline for an NBA season prediction system. Your pipeline should: (a) train five candidate models with walk-forward validation, (b) compute Brier score, log loss, ECE, and AIC/BIC for each, (c) perform all pairwise Diebold-Mariano tests, (d) test an ensemble of the top models, (e) run a realistic backtest with fractional Kelly staking, and (f) produce a final recommendation with supporting evidence. Implement the pipeline and run it on synthetic data.

Exercise 30 (Research). Investigate the relationship between calibration quality and betting profitability. Generate predictions from three models with identical Brier scores but different calibration properties: (a) well-calibrated, (b) moderately overconfident, (c) severely overconfident. Run a backtest on each model with fractional Kelly staking. Does better calibration translate directly to higher ROI? Under what conditions does calibration quality matter most for profitability?


Solution Guidelines

Solutions for selected exercises (7, 14, 19, 22, 25, 28) are available in the code/exercise-solutions.py file. For maximum learning benefit, attempt each exercise independently before consulting the solutions. Pay particular attention to:

  • Exercises 4, 6: Verify decompositions sum correctly.
  • Exercises 8, 10: Use fixed random seeds for reproducible backtests.
  • Exercises 14, 16: Check that no temporal leakage occurs in your splits.
  • Exercises 25, 26: Ensure the DM test correctly handles the case where models have identical predictions.
  • Exercise 29: Document your model selection rationale as if presenting to a non-technical stakeholder.