Chapter 30 Quiz: Model Evaluation and Selection

Q: The Brier score for a forecast of P = 0.80 when the outcome is 1 (event occurred) is: 0.04 0.16 0.20 0.64

(a) 0.04 BS = (0.80 - 1)^2 = (-0.20)^2 = 0.04. The Brier score for a single prediction is the squared difference between the predicted probability and the binary outcome.

Q: The Diebold-Mariano test accounts for autocorrelation in the loss differential series using: The Durbin-Watson statistic The Newey-West variance estimator Augmented Dickey-Fuller test White's heteroskedasticity-consistent estimator

(b) The Newey-West variance estimator The DM test uses the Newey-West estimator (with Bartlett kernel weights) to estimate the variance of the mean loss differential, accounting for autocorrelation at multiple lags. This is necessary because sports prediction errors are often temporally correlated.

Test your understanding of proper scoring rules, backtesting, walk-forward validation, calibration analysis, and model comparison techniques. Answer all questions without consulting the chapter text.

Section 1: Multiple Choice (10 questions, 3 points each = 30 points)

Question 1. The Brier score for a forecast of P = 0.80 when the outcome is 1 (event occurred) is:

(a) 0.04
(b) 0.16
(c) 0.20
(d) 0.64

Answer

**(a) 0.04** BS = (0.80 - 1)^2 = (-0.20)^2 = 0.04. The Brier score for a single prediction is the squared difference between the predicted probability and the binary outcome.

Question 2. Which property distinguishes a proper scoring rule from an improper one?

(a) Proper scoring rules always produce values between 0 and 1
(b) Proper scoring rules are minimized when the forecaster reports their true belief
(c) Proper scoring rules penalize wrong predictions more than right ones
(d) Proper scoring rules can only be applied to binary outcomes

Answer

**(b) Proper scoring rules are minimized when the forecaster reports their true belief** A strictly proper scoring rule has the property that the expected score is optimized when the forecaster reports their honest probability estimate. This ensures the forecaster cannot game the system by reporting a probability different from their true belief.

Question 3. In the Brier score decomposition BS = Reliability - Resolution + Uncertainty, a model that always predicts the base rate would have:

(a) High reliability, high resolution
(b) Low reliability, low resolution
(c) High reliability, low resolution
(d) Low reliability, high resolution

Answer

**(b) Low reliability (zero), low resolution (zero)** A model that always predicts the base rate is perfectly calibrated (reliability = 0 because the average prediction equals the observed frequency in every bin), but has zero resolution (because it never distinguishes between likely and unlikely events). Its Brier score equals the uncertainty component, which is the base rate times (1 - base rate).

Question 4. A Brier Skill Score (BSS) of -0.05 relative to market-implied probabilities means:

(a) The model is 5% better than the market
(b) The model is 5% worse than the market at probability estimation
(c) The model has a 5% edge for betting
(d) The model loses $0.05 per dollar bet

Answer

**(b) The model is 5% worse than the market at probability estimation** BSS = 1 - BS_model / BS_reference. A negative BSS means the model's Brier score is higher (worse) than the reference forecast. BSS = -0.05 means the model's Brier score is 5% worse than the market's implied probabilities, indicating the model does not add predictive value beyond what the market already provides.

Question 5. What is the primary advantage of log loss over Brier score for evaluating sports betting models?

(a) Log loss is bounded between 0 and 1
(b) Log loss penalizes overconfident wrong predictions more severely
(c) Log loss is easier to compute
(d) Log loss does not require probability calibration

Answer

**(b) Log loss penalizes overconfident wrong predictions more severely** A prediction of 0.99 for an event that does not occur receives a Brier penalty of 0.98 but a log loss penalty of -log(0.01) = 4.61. This asymmetric penalty is valuable for betting, where overconfident predictions lead to large individual losses.

Question 6. In walk-forward validation with an expanding window, what happens to the training set at each step?

(a) It stays the same size, shifting forward in time
(b) It grows by the step size, always starting from the beginning
(c) It shrinks as more data is moved to the test set
(d) It is randomly resampled at each step

Answer

**(b) It grows by the step size, always starting from the beginning** In expanding window walk-forward validation, the training set always starts from the first observation and extends further into the future at each step. This maximizes the training data available at each step but assumes that older data remains relevant.

Question 7. A purge gap in cross-validation is used to:

(a) Speed up training by reducing the training set size
(b) Prevent information leakage through overlapping feature windows
(c) Ensure equal-sized training and test folds
(d) Create a validation set for hyperparameter tuning

Answer

**(b) Prevent information leakage through overlapping feature windows** When features include rolling statistics (e.g., 20-game rolling averages), observations near the train/test boundary share overlapping data in their feature windows. The purge gap removes these boundary observations from training to prevent this form of information leakage.

Question 8. Expected Calibration Error (ECE) is computed by:

(a) Taking the maximum absolute difference between predicted and observed frequencies across bins
(b) Taking the weighted average of absolute differences between predicted and observed frequencies across bins
(c) Computing the mean squared error of predicted probabilities
(d) Taking the Kullback-Leibler divergence between predicted and observed distributions

Answer

**(b) Taking the weighted average of absolute differences between predicted and observed frequencies across bins** ECE = sum over bins of (n_k / n) * |avg_observed_k - avg_predicted_k|, where each bin is weighted by its size (n_k / n). This gives more influence to bins with more predictions.

Question 9. The Diebold-Mariano test accounts for autocorrelation in the loss differential series using:

(a) The Durbin-Watson statistic
(b) The Newey-West variance estimator
(c) Augmented Dickey-Fuller test
(d) White's heteroskedasticity-consistent estimator

Answer

**(b) The Newey-West variance estimator** The DM test uses the Newey-West estimator (with Bartlett kernel weights) to estimate the variance of the mean loss differential, accounting for autocorrelation at multiple lags. This is necessary because sports prediction errors are often temporally correlated.

Question 10. When comparing a logistic regression (Brier = 0.215) and a neural network (Brier = 0.210) using the DM test, and the test returns p = 0.18, you should:

(a) Choose the neural network because it has a lower Brier score
(b) Choose the logistic regression because the difference is not statistically significant
(c) Run more experiments to reduce the p-value
(d) Ensemble both models

Answer

**(b) Choose the logistic regression because the difference is not statistically significant** At p = 0.18, we cannot reject the null hypothesis that the models have equal predictive accuracy. Following the principle of parsimony, when models are not significantly different, prefer the simpler model (logistic regression) because it is less prone to overfitting, more interpretable, and easier to maintain.

Section 2: Short Answer (8 questions, 5 points each = 40 points)

Question 11. What are the three components of the Brier score decomposition? For each component, state whether higher or lower values are better and what it measures.

Answer

1. **Reliability** (lower is better): Measures how well the predicted probabilities match observed frequencies. A perfectly calibrated model has reliability = 0. 2. **Resolution** (higher is better): Measures how much the model's predictions vary from the base rate. A model that always predicts the base rate has resolution = 0. 3. **Uncertainty** (constant for a given dataset): Measures the inherent difficulty of the prediction problem. Equals base_rate * (1 - base_rate), maximized at 0.25 when the base rate is 50%. The relationship is: BS = Reliability - Resolution + Uncertainty.

Question 12. Explain the difference between a sliding window and an expanding window in walk-forward validation. When would you prefer each approach?

Answer

**Expanding window** starts training from the first observation and grows the training set at each step. It maximizes the training data available at each step. Prefer it when you believe older data remains relevant and more data always helps (e.g., when the data-generating process is relatively stable). **Sliding window** maintains a fixed training set size, dropping the oldest observations as new ones are added. Prefer it when the data-generating process changes over time (non-stationarity), so that older data may actually hurt performance (e.g., NBA team quality changes year-to-year, so training on data from 5 seasons ago may introduce more noise than signal).

Question 13. What is look-ahead bias in backtesting? Give two specific examples relevant to sports betting models.

Answer

**Look-ahead bias** occurs when a backtest uses information that would not have been available at the time of each prediction, making historical results appear better than they would be in practice. Example 1: **Feature leakage** -- using a team's full-season win percentage to predict mid-season games. At game 40, you would only know the win percentage through game 39, but including the full season's record introduces future information. Example 2: **Model training leakage** -- training the prediction model on all 5 seasons of data, then "backtesting" on games from those same seasons. The model has already seen the outcomes it is predicting, so the backtest results are unrealistically good. Walk-forward validation prevents this by retraining at each step using only past data.

Question 14. Explain why a model can have a good Brier score but be unprofitable for betting, and conversely, why a model with a mediocre Brier score could potentially be profitable.

Answer

**Good Brier but unprofitable:** A model can have a strong Brier score relative to the base rate but still be worse than the market. If the market's implied probabilities are more accurate than your model's, you will make systematically wrong bets despite having a good overall Brier score. The vigorish (typically 4.76%) further erodes any marginal edge. **Mediocre Brier but profitable:** A model might have a mediocre overall Brier score but excel in specific situations where the market is inefficient. If the model accurately identifies a small subset of games where the market is wrong by > 5%, it can generate profits on those bets even though its average Brier score across all games is unimpressive. Bet selection (only betting when you have edge) turns a mediocre overall forecaster into a profitable bettor.

Question 15. Compare Platt scaling and isotonic regression for recalibration. State one advantage and one disadvantage of each.

Answer

**Platt scaling:** - Advantage: Requires very few parameters (just a slope and intercept), making it robust with small calibration sets (as few as 200-300 observations). - Disadvantage: Assumes the miscalibration is a monotonic sigmoid transformation, which may not capture more complex calibration errors (e.g., a model that is overconfident at high probabilities but underconfident at low probabilities). **Isotonic regression:** - Advantage: Non-parametric and can correct any monotonic miscalibration pattern, no matter how complex. - Disadvantage: Requires more calibration data (at least 500-1000 observations) to avoid overfitting. With small calibration sets, it can produce noisy, unreliable calibration mappings.

Question 16. What does the Kelly criterion compute, and why do practitioners typically use fractional Kelly (e.g., 25%) instead of full Kelly for sports betting?

Answer

The **Kelly criterion** computes the optimal fraction of bankroll to wager on a bet to maximize the long-run growth rate of wealth: f* = (bp - q) / b, where b = decimal_odds - 1, p = estimated win probability, and q = 1 - p. Practitioners use **fractional Kelly** (e.g., 25% of the full Kelly stake) because: 1. The Kelly formula assumes the probability estimate is correct. In practice, probability estimates contain errors, and full Kelly with overestimated edge leads to systematic over-betting and faster ruin. 2. Full Kelly produces extreme variance in bankroll trajectory, with drawdowns that most bettors find psychologically intolerable. 3. Fractional Kelly sacrifices some long-run growth rate for significantly reduced variance and drawdown risk, providing a better risk-adjusted return profile.

Question 17. Why does BIC penalize model complexity more heavily than AIC for large sample sizes? In the context of sports betting model selection, which criterion is generally preferred?

Answer

**AIC** has a complexity penalty of 2k (where k = number of parameters), which does not depend on sample size. **BIC** has a complexity penalty of k * ln(n), which grows with sample size. For n > 7, ln(n) > 2, so BIC penalizes complexity more heavily. For sports betting, **BIC is generally preferred** because: 1. Overfitting is the primary risk in sports prediction (the signal-to-noise ratio is low). 2. BIC's heavier complexity penalty selects simpler models that generalize better to new data. 3. BIC is consistent (selects the true model as n approaches infinity, if it is among the candidates), while AIC tends to select overly complex models. 4. In practice, the extra in-sample fit from complex models rarely translates to better out-of-sample prediction or profitable betting.

Question 18. Explain how model ensembling can improve predictions even when one model is clearly better than the others. Under what condition does ensembling fail to provide any benefit?

Answer

Ensembling improves predictions when **model errors are not perfectly correlated**. Even if Model A has a lower Brier score than Model B, there are individual games where Model B makes a better prediction than Model A. By averaging (or weighted-averaging) the predictions, the ensemble cancels out some of the individual errors, reducing overall error variance. Ensembling **fails to provide benefit** when the models' errors are perfectly correlated -- that is, when they make the same mistakes on the same games. This typically happens when models use the same features, the same algorithm, and similar hyperparameters, producing near-identical predictions. In this case, averaging produces a prediction almost identical to either model individually, providing no improvement.

Section 3: Applied Problems (4 questions, 7.5 points each = 30 points)

Question 19. You have the following walk-forward validation results for three models predicting NBA game outcomes (Brier scores per fold):

Fold	Logistic Regression	XGBoost	Neural Network
1	0.228	0.221	0.235
2	0.222	0.218	0.219
3	0.225	0.215	0.212
4	0.220	0.219	0.216
5	0.223	0.220	0.218

Compute the mean and standard deviation of Brier scores for each model. Based on these results alone (without running a DM test), which model would you recommend, and why? What additional analysis would strengthen your recommendation?

Answer

**Mean and standard deviation:** - Logistic Regression: mean = 0.2236, std = 0.0029 - XGBoost: mean = 0.2186, std = 0.0023 - Neural Network: mean = 0.2200, std = 0.0087 **Recommendation:** XGBoost, for two reasons: (1) it has the lowest mean Brier score, and (2) it has the lowest standard deviation, indicating stable performance across folds. The neural network has a competitive mean but much higher variance (std = 0.0087 vs. 0.0023), suggesting inconsistent generalization. The NN's poor performance in Fold 1 (0.235) is concerning. **Additional analysis needed:** 1. Pairwise Diebold-Mariano tests to determine if the differences are statistically significant. 2. Calibration analysis (ECE) for each model to assess probability quality. 3. AIC/BIC comparison to account for model complexity differences. 4. A backtest with realistic vig to test actual profitability. 5. An ensemble of XGBoost and the neural network to see if combining them outperforms either alone.

Question 20. A model's reliability diagram shows the following bin data:

Bin (predicted range)	Avg Predicted	Avg Observed	Count
0.30-0.40	0.35	0.38	45
0.40-0.50	0.45	0.43	120
0.50-0.60	0.55	0.52	180
0.60-0.70	0.65	0.58	110
0.70-0.80	0.75	0.64	45

Compute the ECE. Is this model overconfident or underconfident, and in what probability range is the miscalibration worst? Describe how recalibration would correct this pattern.

Answer

**ECE computation:** Total n = 45 + 120 + 180 + 110 + 45 = 500 ECE = (45/500)|0.38-0.35| + (120/500)|0.43-0.45| + (180/500)|0.52-0.55| + (110/500)|0.58-0.65| + (45/500)|0.64-0.75| = 0.09 * 0.03 + 0.24 * 0.02 + 0.36 * 0.03 + 0.22 * 0.07 + 0.09 * 0.11 = 0.0027 + 0.0048 + 0.0108 + 0.0154 + 0.0099 = **0.0436** **Assessment:** The model is **overconfident**, particularly at high predicted probabilities. In the 0.60-0.70 bin, it predicts 65% but outcomes occur only 58% of the time (7 percentage point gap). In the 0.70-0.80 bin, the gap widens to 11 percentage points (predicts 75%, observes 64%). **Recalibration:** Isotonic regression or Platt scaling would compress the high-end predictions toward lower values. Platt scaling would fit a logistic function that maps 0.75 to approximately 0.64 and 0.65 to approximately 0.58, while preserving the relative ordering of predictions. This would reduce the ECE substantially while maintaining the model's discrimination ability.

Question 21. You are comparing Model A and Model B using the Diebold-Mariano test on 1,000 NBA game predictions. The mean loss differential is d_bar = 0.003 (Model B is better), and the Newey-West standard error is 0.0018. Compute the DM test statistic and the two-sided p-value. Is the difference statistically significant at alpha = 0.05? What practical conclusion do you draw?

Answer

**DM test statistic:** DM = d_bar / SE = 0.003 / 0.0018 = 1.667 **P-value (two-sided):** Using the standard normal distribution: p = 2 * (1 - Phi(1.667)) = 2 * (1 - 0.9522) = 2 * 0.0478 = 0.0956 **Significance:** At alpha = 0.05, p = 0.096 > 0.05, so we **fail to reject** the null hypothesis of equal predictive accuracy. **Practical conclusion:** Despite Model B having a 0.003 lower mean Brier score, the difference is not statistically significant at the 5% level. The observed improvement could be due to random variation. Following the principle of parsimony, if Model A is simpler (fewer parameters, easier to maintain), it should be preferred. If both models are similarly complex, you might consider ensembling them, or collecting more data to increase the test's power.

Question 22. Design a complete model evaluation pipeline for the following scenario: You have 5 seasons of NBA data (1,230 games per season). You want to compare a logistic regression and a neural network, select the better model, and estimate its profitability. Describe each step of your pipeline, specifying:

(a) How you split the data (training, validation, calibration, test). (b) What walk-forward scheme you use and why. (c) What metrics you compute and in what order. (d) How you make the final selection decision. (e) How you assess profitability.

Answer

**(a) Data split:** - Seasons 1-3 (3,690 games): Walk-forward training and validation. - Season 4 (1,230 games): Model selection, calibration, and hyperparameter tuning. - Season 5 (1,230 games): Final evaluation and backtest. Touched only once after all decisions are made. **(b) Walk-forward scheme:** Expanding window on seasons 1-3 with: initial training size = 1 season (1,230), test size = 1 month (~100 games), step size = 1 month, purge gap = 25 games (to account for rolling features with up to a 20-game lookback plus a small buffer). This produces roughly 20 folds. **(c) Metrics, in order:** 1. Walk-forward cross-validated Brier score (mean and std across folds) for both models. 2. Brier score decomposition (reliability, resolution) to diagnose each model. 3. Calibration analysis (ECE, reliability diagram) on season 4 for both models. 4. Recalibrate both models using Platt scaling on season 4. 5. AIC/BIC comparison on season 4 predictions. 6. Diebold-Mariano test on season 4 predictions. 7. Information criteria comparison. **(d) Selection decision:** If the DM test shows a significant difference (p < 0.05), select the better model. If not significant, prefer the simpler model (logistic regression) or ensemble both models. Use AIC/BIC as a tiebreaker if needed. **(e) Profitability assessment:** Run a backtest on season 5 using the selected (and recalibrated) model with: fractional Kelly (25%), minimum edge threshold of 3%, maximum bet fraction of 5%. Report ROI, Sharpe ratio, max drawdown, and total number of bets. Compute a bootstrap 95% confidence interval for ROI.

Section 4: Advanced Integration (3 questions, 10/3 points each = 10 points)

Question 23. Explain the relationship between calibration and profitability in sports betting. Specifically: if Model A has ECE = 0.01 and Brier score = 0.220, while Model B has ECE = 0.05 and Brier score = 0.215, which model is likely to be more profitable for betting, and why?

Answer

**Model B is likely more profitable despite worse calibration**, because profitability depends primarily on discrimination (the ability to identify games where the true probability differs from the market's implied probability) rather than raw calibration. Model B's lower Brier score (0.215 vs. 0.220) indicates better overall prediction quality, which is driven by higher resolution (better discrimination). Its higher ECE (0.05 vs. 0.01) means its raw probability estimates are less well-calibrated, but this is easily corrected through recalibration (Platt scaling or isotonic regression) without affecting discrimination. After recalibration, Model B would retain its superior discrimination while achieving calibration comparable to Model A, making it strictly better for betting. The key insight is that calibration can be post-hoc corrected, but discrimination cannot -- it is determined by the model's features and architecture. Therefore, when choosing between models for betting, prioritize Brier score and resolution over raw ECE, since ECE can be fixed after training. However, this reasoning assumes recalibration is performed. If you deploy Model B without recalibration, its overconfident predictions could lead to over-betting on games where it falsely claims high edge, potentially losing money despite better underlying prediction quality.

Question 24. A colleague proposes using the Diebold-Mariano test to compare 10 models in a round-robin format (45 pairwise tests). At alpha = 0.05, they find 3 significant differences and conclude that 3 pairs of models are genuinely different. What is the problem with this approach? How would you correct it?

Answer

The problem is the **multiple comparisons problem** (also called the multiple testing problem). When conducting 45 independent tests at alpha = 0.05, you expect 45 * 0.05 = 2.25 false positives by chance alone, even if all models are equally good. Finding 3 significant results is barely more than the expected number of false positives. **Corrections:** 1. **Bonferroni correction:** Divide alpha by the number of tests: alpha_corrected = 0.05 / 45 = 0.0011. Only declare significance if p < 0.0011. This is conservative but simple. 2. **Holm-Bonferroni method:** A less conservative step-down procedure that orders p-values from smallest to largest and applies progressively less strict thresholds. 3. **False Discovery Rate (FDR) control** using the Benjamini-Hochberg procedure: Instead of controlling the family-wise error rate (probability of any false positive), controls the expected proportion of false positives among all declared significant results. Often more appropriate when you expect some true differences. 4. **Better approach:** Instead of pairwise testing, use a single omnibus test (e.g., the Superior Predictive Ability test by Hansen, 2005, or the Model Confidence Set by Hansen, Lunde, and Nason, 2011) that tests whether a set of models has equal predictive accuracy simultaneously, properly accounting for multiple comparisons.

Question 25. You build a model that achieves a walk-forward Brier score of 0.208, ECE of 0.015, and a backtested ROI of +3.2% over 3 seasons with fractional Kelly staking. However, the 95% bootstrap confidence interval for the ROI is [-1.1%, +7.5%]. Should you deploy this model for live betting? Justify your decision by discussing the statistical evidence, practical considerations, and risk management.

Answer

**Statistical evidence:** The point estimate of +3.2% ROI is encouraging, but the 95% confidence interval includes negative values ([-1.1%, +7.5%]). This means we cannot reject the null hypothesis that the true ROI is zero (or negative) at the 5% significance level. The evidence for profitability is suggestive but not conclusive. **Arguments for deployment (with caution):** - The Brier score (0.208) and ECE (0.015) indicate a genuinely well-calibrated and discriminating model, which is a good foundation. - The confidence interval is centered on positive ROI, with the lower bound only slightly negative. - Three seasons is a reasonable evaluation period but may not provide enough statistical power to achieve significance, especially with fractional Kelly (which reduces both returns and variance). **Arguments against (or for limited deployment):** - The confidence interval spanning zero means the model could actually be unprofitable once deployed. - Backtest ROI may overstate live performance due to execution differences (getting worse odds, missed bets, etc.). - Model selection bias (you chose this model because it looked good in the backtest) further inflates the apparent ROI. **Recommendation:** Deploy with strict risk management: (a) start with a small bankroll you can afford to lose entirely, (b) use conservative staking (10-15% of Kelly), (c) track live performance rigorously against backtest expectations, (d) set a pre-defined stopping criterion (e.g., stop if live ROI falls below -5% after 200 bets), (e) continue running the model in paper-trading mode alongside live betting to build a longer track record. If live performance matches the backtest after one full season, gradually increase stake sizes.