Chapter 27 Quiz: Advanced Regression and Classification
Instructions: Answer all 25 questions. Each question is worth 4 points (100 total). Select the best answer for multiple-choice questions. Click the answer toggle to check your work.
Question 1. In gradient boosting, each new tree is fit to:
(a) The original training labels (b) The residuals (errors) of the current ensemble (c) A random subset of the training data (d) The predictions of the previous tree only
Answer
**(b) The residuals (errors) of the current ensemble** Each new tree in gradient boosting is trained to predict the negative gradient of the loss function with respect to the current predictions, which for squared error loss equals the residuals. The tree thus learns to correct the errors that the existing ensemble is making.Question 2. XGBoost differs from standard gradient boosting primarily by:
(a) Using deeper trees and no regularization (b) Using second-order gradient information (Hessians) and built-in regularization (c) Training trees in parallel rather than sequentially (d) Requiring more data to train effectively
Answer
**(b) Using second-order gradient information (Hessians) and built-in regularization** XGBoost computes both first and second derivatives of the loss function when determining splits. It also adds regularization terms ($\gamma T + \frac{1}{2}\lambda\sum w_j^2$) to the objective function, controlling tree complexity and leaf weights. Trees are still built sequentially.Question 3. A sports bettor should set XGBoost's max_depth parameter based on:
(a) Always use the deepest trees possible for maximum accuracy (b) The dataset size and complexity --- deeper trees for larger datasets, shallower for smaller (c) The number of features in the model (d) The sport being predicted
Answer
**(b) The dataset size and complexity --- deeper trees for larger datasets, shallower for smaller** `max_depth` controls the complexity of individual trees. Deeper trees can model more interactions but are prone to overfitting, especially with small sports datasets. For most sports prediction tasks (500-5000 games), `max_depth` of 3-6 works well. Only datasets with 10,000+ games warrant deeper trees.Question 4. Which of the following is the correct way to validate a sports prediction model?
(a) Random k-fold cross-validation (b) Leave-one-out cross-validation (c) Time-series cross-validation (train on past, test on future) (d) Training on odd-numbered games, testing on even-numbered
Answer
**(c) Time-series cross-validation (train on past, test on future)** Sports data has a temporal structure: games happen in sequence. Training on future data to predict past data constitutes data leakage and produces artificially inflated performance metrics. Time-series CV (using `TimeSeriesSplit`) respects this temporal ordering by always training on earlier games and testing on later ones.Question 5. The XGBoost learning rate (eta) is typically set between:
(a) 0.5 and 1.0 (b) 0.001 and 0.3 (c) 1.0 and 10.0 (d) It does not matter as long as enough trees are used
Answer
**(b) 0.001 and 0.3** Lower learning rates (0.01-0.1) require more trees but generally produce better generalization. The learning rate scales the contribution of each tree: $\hat{y}^{(t)} = \hat{y}^{(t-1)} + \eta f_t(x)$. Very high learning rates cause overfitting, while very low rates waste computation without improving the final model if the tree count is not proportionally increased.Question 6. In random forests, the max_features parameter (number of features considered at each split) serves to:
(a) Speed up training by using fewer features (b) Decorrelate trees, reducing ensemble variance (c) Reduce the total number of features in the model (d) Ensure each feature is used exactly once
Answer
**(b) Decorrelate trees, reducing ensemble variance** By randomly restricting which features each split can consider, random forests ensure that different trees use different features, reducing the correlation between trees. Since ensemble variance is $\rho\sigma^2 + \frac{1-\rho}{B}\sigma^2$, reducing $\rho$ (correlation) directly reduces variance.Question 7. A perfectly calibrated model produces predictions where:
(a) The accuracy exceeds 70% on all games (b) Predicted probabilities match observed frequencies: when it says 70%, teams win 70% of the time (c) All predictions are either very close to 0 or very close to 1 (d) The AUC-ROC equals 1.0
Answer
**(b) Predicted probabilities match observed frequencies: when it says 70%, teams win 70% of the time** Calibration is the property that $P(Y=1 | \hat{p}(X) = p) = p$ for all $p$. A well-calibrated model's reliability diagram lies on the 45-degree line. Calibration is independent of discrimination (AUC-ROC) --- a model can be well-calibrated but not very discriminating, or highly discriminating but poorly calibrated.Question 8. An overconfident model's calibration curve will show:
(a) A curve above the diagonal for all probabilities (b) An S-shaped curve where extreme predictions are more extreme than warranted (c) A flat line at 0.5 (d) A curve below the diagonal for all probabilities
Answer
**(b) An S-shaped curve where extreme predictions are more extreme than warranted** Overconfident models push predictions toward 0 and 1 more than the data supports. On the calibration curve, this manifests as: predictions near 0.2 actually correspond to win rates around 0.30 (curve above diagonal on the left) and predictions near 0.8 correspond to win rates around 0.70 (curve below diagonal on the right), forming an S-shape.Question 9. Platt scaling calibrates model probabilities by fitting:
(a) A polynomial regression to the prediction residuals (b) A logistic regression on the model's raw outputs (log-odds) (c) A decision tree to the prediction errors (d) A k-nearest-neighbors model to similar predictions
Answer
**(b) A logistic regression on the model's raw outputs (log-odds)** Platt scaling fits $P(y=1|f(x)) = 1/(1 + \exp(Af(x) + B))$ where $f(x)$ is the uncalibrated output. This two-parameter sigmoid transformation corrects systematic over- or under-confidence. It works well when the miscalibration is approximately sigmoid-shaped.Question 10. Isotonic regression calibration is preferred over Platt scaling when:
(a) The calibration dataset is very small (< 100 samples) (b) The miscalibration pattern is non-sigmoid (e.g., non-monotonic) (c) The model is already well-calibrated (d) Speed is the primary concern
Answer
**(b) The miscalibration pattern is non-sigmoid (e.g., non-monotonic)** Isotonic regression is nonparametric and can fit any monotonically non-decreasing calibration function. Platt scaling assumes the calibration function is sigmoid, which may not hold. However, isotonic regression requires more data (500+ samples) to avoid overfitting, while Platt scaling works with as few as 100 samples.Question 11. For a sports bettor, which evaluation metric is MOST important?
(a) Accuracy (percentage of correct predictions) (b) AUC-ROC (area under the receiver operating characteristic curve) (c) Log-loss (penalizes poor probability estimates) (d) F1 score (harmonic mean of precision and recall)
Answer
**(c) Log-loss (penalizes poor probability estimates)** Bettors need accurate probability estimates, not just correct classifications. Log-loss directly measures the quality of probability predictions: it heavily penalizes confident wrong predictions and rewards well-calibrated confidence. Accuracy treats all predictions equally regardless of confidence, and F1/AUC-ROC measure ranking quality rather than probability accuracy.Question 12. The scale_pos_weight parameter in XGBoost is used to:
(a) Scale all features to unit variance (b) Handle class imbalance by upweighting the minority class in the loss function (c) Normalize the learning rate (d) Control the maximum weight of any leaf node
Answer
**(b) Handle class imbalance by upweighting the minority class in the loss function** Setting `scale_pos_weight` to the ratio of negative to positive examples (e.g., 4.0 for a 80/20 class split) effectively increases the penalty for misclassifying minority-class examples. This is computationally cheaper and often more effective than SMOTE or other resampling methods.Question 13. SMOTE (Synthetic Minority Over-sampling Technique) creates new minority-class examples by:
(a) Duplicating existing minority examples (b) Interpolating between pairs of existing minority-class neighbors (c) Randomly generating new examples from a normal distribution (d) Removing majority-class examples to balance the dataset
Answer
**(b) Interpolating between pairs of existing minority-class neighbors** SMOTE selects a minority example, finds its k nearest minority-class neighbors, picks one neighbor randomly, and creates a new synthetic example along the line segment between them: $x_{\text{new}} = x_i + \lambda(x_{nn} - x_i)$ where $\lambda \sim \text{Uniform}(0,1)$. This produces more diverse synthetic examples than simple duplication.Question 14. After applying class rebalancing (weights or SMOTE), you should ALWAYS:
(a) Retrain the model from scratch (b) Recalibrate the probability outputs on the original distribution (c) Increase the learning rate to compensate (d) Remove the least important features
Answer
**(b) Recalibrate the probability outputs on the original distribution** Any rebalancing technique distorts the model's probability estimates. A model trained on balanced data will overestimate the probability of the originally rare class. Calibration using the original (unbalanced) distribution corrects these distorted probabilities, which is essential for sports betting where accurate probability estimates drive wagering decisions.Question 15. SHAP values for a prediction always sum to:
(a) 1.0 (b) The predicted probability (c) The difference between the prediction and the average model output (base value) (d) Zero
Answer
**(c) The difference between the prediction and the average model output (base value)** By the local accuracy property: $f(x) = \phi_0 + \sum_{j=1}^p \phi_j$ where $\phi_0$ is the base value (average prediction across training data) and $\phi_j$ are the SHAP values. Therefore $\sum \phi_j = f(x) - \phi_0$. This ensures every prediction is fully explained by the feature contributions.Question 16. TreeSHAP computes exact SHAP values in polynomial time rather than exponential time because:
(a) It approximates SHAP values using sampling (b) It exploits the tree structure to avoid enumerating all feature subsets (c) It uses GPU acceleration (d) It computes SHAP values for a random subset of features only
Answer
**(b) It exploits the tree structure to avoid enumerating all feature subsets** TreeSHAP traces paths through the tree structure, using the tree's recursive partitioning to efficiently compute the exact contribution of each feature without enumerating all $2^p$ feature subsets. This runs in $O(TLD^2)$ time for an ensemble of $T$ trees with $L$ leaves and depth $D$.Question 17. A SHAP dependence plot for "elo_diff" shows a clear positive trend (higher elo_diff leads to higher SHAP values) with substantial vertical spread at elo_diff = 100. The vertical spread indicates:
(a) Measurement error in the Elo ratings (b) Feature interactions --- other features modify the effect of elo_diff (c) The model is poorly trained on this feature range (d) SHAP values are unreliable for this feature
Answer
**(b) Feature interactions --- other features modify the effect of elo_diff** Vertical spread in a SHAP dependence plot means that the same feature value produces different SHAP contributions for different predictions. This happens when the feature interacts with other features --- for example, a +100 Elo advantage might matter more when the team is well-rested than when playing a back-to-back.Question 18. Which statement about model stacking is FALSE?
(a) Base models should be diverse (different algorithms, not just different hyperparameters) (b) Out-of-fold predictions prevent the meta-learner from seeing base models' training predictions (c) The meta-learner should always be a complex model like XGBoost (d) Stacking can capture nonlinear relationships between base model predictions
Answer
**(c) The meta-learner should always be a complex model like XGBoost** The meta-learner should typically be simple (logistic regression or linear regression) to avoid overfitting at the meta-level. The base models provide the complexity; the meta-learner's job is to learn the optimal combination, which usually requires only a few parameters. A complex meta-learner on top of complex base models compounds the risk of overfitting.Question 19. Early stopping in XGBoost:
(a) Stops training when the training loss reaches zero (b) Stops training when validation performance stops improving for a specified number of rounds (c) Limits the maximum depth of each tree (d) Prevents any feature from being used more than once
Answer
**(b) Stops training when validation performance stops improving for a specified number of rounds** Early stopping monitors performance on a validation set and halts training when the metric (e.g., log-loss) fails to improve for a specified number of consecutive rounds (`early_stopping_rounds`). This prevents the model from training too many trees and overfitting. It is the most practical way to determine the optimal number of boosting rounds.Question 20. For predicting point spreads (regression) rather than win/loss (classification), XGBoost should use:
(a) objective='binary:logistic'
(b) objective='reg:squarederror'
(c) objective='multi:softmax'
(d) objective='rank:pairwise'
Answer
**(b) `objective='reg:squarederror'`** Point spread prediction is a regression problem (predicting a continuous value: the expected point differential). The `reg:squarederror` objective minimizes mean squared error, which is the natural loss function for continuous outcome prediction. The `binary:logistic` objective is for binary classification (win/loss).Question 21. The Expected Calibration Error (ECE) is computed as:
(a) The maximum absolute difference between predicted and actual probabilities across bins (b) The weighted average of absolute calibration errors across bins, where weights are bin sizes (c) The variance of predicted probabilities (d) One minus the AUC-ROC score
Answer
**(b) The weighted average of absolute calibration errors across bins, where weights are bin sizes** $\text{ECE} = \sum_{b=1}^{B} \frac{n_b}{N} |p_b - \hat{p}_b|$ where $n_b$ is the number of samples in bin $b$, $p_b$ is the observed frequency, and $\hat{p}_b$ is the mean predicted probability. The weighting ensures bins with more samples have more influence on the metric.Question 22. A Partial Dependence Plot (PDP) shows:
(a) The correlation between two features (b) The marginal effect of a single feature on predictions, averaged over all other features (c) The distribution of a feature in the training data (d) The model's prediction error as a function of a feature
Answer
**(b) The marginal effect of a single feature on predictions, averaged over all other features** The PDP computes $\text{PDP}(x_j) = \frac{1}{n}\sum_{i=1}^n f(x_j, \mathbf{x}_{-j}^{(i)})$, varying feature $j$ while keeping all other features at their observed values and averaging. This reveals how changing one feature, on average, affects the prediction.Question 23. When combining rating system outputs (Elo, Massey) as features in an XGBoost model, you should:
(a) Include only the single best rating system to avoid multicollinearity (b) Include multiple rating systems because XGBoost handles correlated features well and can learn which is most useful (c) First decorrelate the ratings using PCA before inputting them (d) Normalize all ratings to the same 0-1 scale
Answer
**(b) Include multiple rating systems because XGBoost handles correlated features well and can learn which is most useful** Tree-based models like XGBoost are robust to multicollinearity because they select one feature at a time for splits. Including multiple rating systems gives the model more information to work with, and the model will naturally learn which rating system is most informative in different contexts. This is unlike linear regression, where multicollinearity inflates standard errors.Question 24. The Brier score is defined as $\frac{1}{N}\sum(y_i - \hat{p}_i)^2$. A naive model that always predicts 0.5 on a balanced dataset (50% home wins) achieves a Brier score of:
(a) 0.00 (b) 0.25 (c) 0.50 (d) 1.00
Answer
**(b) 0.25** For each game, the squared error is $(y - 0.5)^2$. When $y = 1$: $(1 - 0.5)^2 = 0.25$. When $y = 0$: $(0 - 0.5)^2 = 0.25$. So every prediction contributes 0.25, and the average is 0.25. Any useful model should achieve a Brier score below this baseline.Question 25. You have built an XGBoost model with a test-set log-loss of 0.660. A simple Elo-based model achieves 0.675 on the same test set. Should you use the XGBoost model for betting?
(a) Yes, because it has a lower log-loss (b) Not necessarily --- the improvement may not be statistically significant or economically meaningful once calibrated (c) No, because simpler models are always better for betting (d) Yes, but only if the AUC-ROC also improves