Chapter 23: Quiz
Questions
Question 1
Which of the following is NOT a valid reason to prefer ML over logistic regression for prediction market probability estimation?
A) The relationship between features and outcomes is highly nonlinear B) You have hundreds of candidate features with unknown interactions C) You want a model that always produces perfectly calibrated probabilities D) You want to automatically discover complex feature interactions
Question 2
In a random forest, predicted probabilities tend to be:
A) Perfectly calibrated by construction B) Pushed toward extreme values (0 or 1) C) Pushed toward 0.5 due to the averaging of many trees D) Uniformly distributed between 0 and 1
Question 3
What is the key difference between how random forests and gradient boosting build their ensemble of trees?
A) Random forests use deeper trees, gradient boosting uses shallow trees B) Random forests build trees independently in parallel, gradient boosting builds them sequentially with each correcting the previous C) Random forests use all features, gradient boosting uses random subsets D) Random forests use binary splits, gradient boosting uses multiway splits
Question 4
In XGBoost's objective function, the regularization term $\Omega(h) = \gamma T + \frac{1}{2}\lambda \sum w_j^2$ includes two components. What does increasing $\gamma$ do?
A) Increases the learning rate B) Penalizes large leaf weights, preventing extreme predictions C) Requires a larger reduction in loss to justify a new split, resulting in simpler trees D) Increases the number of boosting rounds needed for convergence
Question 5
Which data splitting strategy is correct for prediction market data?
A) Random 80/20 split B) Stratified random split to maintain class proportions C) Temporal split where training data precedes validation and test data in time D) Leave-one-out cross-validation with random ordering
Question 6
A neural network for binary probability estimation should use which activation function on its output layer?
A) ReLU B) Tanh C) Sigmoid D) Softmax
Question 7
What is the primary purpose of batch normalization in a neural network?
A) To reduce the size of mini-batches B) To normalize activations within each mini-batch, stabilizing training C) To eliminate the need for dropout D) To ensure output probabilities are calibrated
Question 8
When training a neural network with binary cross-entropy loss, the gradient with respect to the pre-sigmoid logit $z_i$ is:
A) $y_i \log(\hat{p}_i)$ B) $\hat{p}_i - y_i$ C) $(y_i - \hat{p}_i)^2$ D) $\frac{y_i}{\hat{p}_i}$
Question 9
Bayesian optimization for hyperparameter tuning (e.g., Optuna) improves upon grid search because:
A) It evaluates fewer hyperparameter combinations but uses them more efficiently B) It always finds the global optimum C) It builds a probabilistic model of the hyperparameter-performance relationship and intelligently selects the next combination D) It avoids the need for cross-validation
Question 10
Why is random search often more efficient than grid search for hyperparameter tuning?
A) Random search parallelizes better B) Not all hyperparameters are equally important, and random search allocates more evaluations to important ones by chance C) Random search always converges faster D) Grid search cannot handle continuous hyperparameters
Question 11
A model produces the following reliability diagram: when it predicts 0.8, the observed frequency is 0.65; when it predicts 0.2, the observed frequency is 0.30. This model is:
A) Well calibrated B) Overconfident (predictions too extreme) C) Underconfident (predictions too moderate) D) Impossible to determine from this information
Question 12
Platt scaling calibrates a model's output by fitting:
A) An isotonic regression on the model's raw output B) A logistic regression where the model's raw output is the single feature C) A temperature parameter that scales the logit D) A polynomial regression on the model's output
Question 13
Which calibration method is most appropriate when you have only 200 calibration samples?
A) Isotonic regression, because it is nonparametric B) Platt scaling, because it has only 2 parameters and is less prone to overfitting C) Temperature scaling, because neural networks always need it D) No calibration, because 200 samples is too few
Question 14
SHAP values satisfy the property that for any prediction $f(x)$:
A) The sum of SHAP values equals $f(x)$ B) The sum of SHAP values plus the base value equals $f(x)$ C) Each SHAP value is between -1 and 1 D) SHAP values are always positive for features that increase the prediction
Question 15
A SHAP dependence plot for "approval_rating" shows that the SHAP value is near zero for approval ratings below 40%, increases linearly from 40% to 55%, and levels off above 55%. This indicates:
A) Approval rating is not an important feature B) The model has a linear relationship with approval rating C) The model has learned a nonlinear effect where approval rating matters most in the 40-55% range D) The model is poorly calibrated for candidates with high approval ratings
Question 16
Which type of SHAP explainer should be used for an XGBoost model?
A) KernelExplainer (model-agnostic, slower) B) DeepExplainer (designed for deep learning) C) TreeExplainer (fast, exact for tree-based models) D) LinearExplainer (designed for linear models)
Question 17
A feature engineer creates a "momentum" feature: the 7-day change in polling margin. This feature captures:
A) The absolute level of polling support B) The direction and rate of change in polling, which may predict future movement C) The variance of polling data D) The interaction between polling and time
Question 18
Data leakage in a prediction market model can be caused by:
A) Using time-series cross-validation B) Standardizing features using only training set statistics C) Including the prediction market contract price at resolution time as a feature D) Using early stopping to prevent overfitting
Question 19
When training an ML model on a rare event (5% base rate), which approach is LEAST appropriate for probability estimation?
A) Using scale_pos_weight in XGBoost
B) Heavy oversampling (50/50 ratio) without post-hoc recalibration
C) Evaluating with log-loss instead of accuracy
D) Using class weights in the loss function
Question 20
Expected Calibration Error (ECE) is computed by:
A) The maximum difference between predicted and observed frequency in any bin B) The weighted average of absolute differences between predicted and observed frequency across bins C) The root mean squared error between predicted and observed frequencies D) The KL divergence between predicted and observed distributions
Question 21
Concept drift in prediction markets refers to:
A) The market price moving toward the true probability over time B) The relationship between features and outcomes changing over time C) The model's predictions drifting away from 0.5 D) The gradual increase in the number of available features
Question 22
When deploying a new ML model to replace an existing one, the safest approach is:
A) Immediately replace the old model with the new one B) Run the new model in shadow mode, then canary deployment, then full deployment C) Use the new model only during weekdays D) Average the old and new model predictions permanently
Question 23
A model monitor detects that the rolling Brier score has increased from 0.18 to 0.26 over the last month. The appropriate response is:
A) Ignore it; small fluctuations are normal B) Immediately stop all trading C) Investigate the cause (concept drift, data quality, feature distribution shift), and consider retraining D) Increase the model's confidence by adjusting the temperature parameter
Question 24
For prediction market applications, the Brier Skill Score (BSS) measures:
A) How much better the model is compared to always predicting 0.5 B) How much better the model is compared to predicting the historical base rate C) The correlation between predictions and outcomes D) The average accuracy across different probability bins
Question 25
You have built four models (logistic regression, random forest, XGBoost, neural network) and want to determine if XGBoost is statistically significantly better than random forest. The appropriate test is:
A) An unpaired t-test on overall Brier scores B) A paired t-test on per-instance losses, since both models predict on the same test instances C) A chi-squared test on classification accuracy D) No test is needed; the model with the lower Brier score is better
Answer Key
Question 1: C
ML models do not automatically produce perfectly calibrated probabilities. In fact, most ML models require post-hoc calibration (Platt scaling, isotonic regression, temperature scaling) to achieve good calibration. The other options (A, B, D) are valid reasons to prefer ML.
Question 2: C
Random forest probabilities are averages of many tree predictions. Since individual trees may disagree, the average is pulled toward 0.5. This is a form of variance reduction — beneficial for stability but requiring calibration for accurate extreme probabilities.
Question 3: B
Random forests build trees independently (on bootstrapped samples) and average their predictions. Gradient boosting builds trees sequentially, where each new tree is fitted to the residuals (negative gradient of the loss) of the current ensemble, correcting previous errors.
Question 4: C
The $\gamma T$ term penalizes the number of leaves $T$ in a tree. Increasing $\gamma$ means a split must achieve a larger reduction in loss to be worth the complexity penalty of adding a new leaf. This results in fewer splits and simpler trees.
Question 5: C
Prediction market data is temporal. Training on future data to predict the past creates look-ahead bias (data leakage). Temporal splitting ensures the model only sees past data during training, mimicking real-world deployment.
Question 6: C
Sigmoid maps the output to $[0, 1]$, which can be interpreted as a probability. ReLU outputs non-negative values without an upper bound. Tanh outputs $[-1, 1]$. Softmax is for multi-class problems, not binary.
Question 7: B
Batch normalization normalizes the activations (mean=0, variance=1) within each mini-batch. This stabilizes the distribution of inputs to each layer during training, allowing higher learning rates and faster convergence.
Question 8: B
The gradient of binary cross-entropy with respect to the pre-sigmoid logit is $\hat{p}_i - y_i$, which is the difference between the predicted probability and the true label. This elegant gradient pushes predictions toward the correct value.
Question 9: C
Bayesian optimization builds a surrogate model (typically a Gaussian Process or Tree-Structured Parzen Estimator) of the relationship between hyperparameters and performance. It uses this model to select the most promising combinations to evaluate next, making the search more sample-efficient than random or grid search.
Question 10: B
Bergstra and Bengio (2012) showed that random search is more efficient because it explores each hyperparameter dimension more thoroughly. If one hyperparameter has a strong effect and another has little effect, grid search wastes evaluations on the unimportant one, while random search naturally allocates proportional coverage.
Question 11: B
When the model predicts 0.8 but the observed frequency is only 0.65, the model is overconfident (predicting more extreme values than reality). Similarly, predicting 0.2 when reality is 0.30 shows overconfidence on the low end.
Question 12: B
Platt scaling fits a logistic regression with the model's raw output as the single feature: $p_{cal} = \sigma(a \cdot f(x) + b)$. The two parameters $a$ and $b$ are learned on a calibration set to map raw outputs to calibrated probabilities.
Question 13: B
With only 200 samples, isotonic regression (nonparametric, many effective parameters) risks overfitting. Platt scaling has only 2 parameters and is much safer with limited data. Temperature scaling (1 parameter) is even simpler but less flexible.
Question 14: B
The fundamental SHAP property is: $f(x) = \phi_0 + \sum_{j=1}^{M} \phi_j$, where $\phi_0$ is the base value (average model prediction) and the SHAP values $\phi_j$ represent each feature's contribution beyond the baseline.
Question 15: C
The pattern of SHAP values — near zero below 40%, increasing 40-55%, leveling off above 55% — indicates a nonlinear relationship. The model learned that approval rating only significantly affects the prediction in the 40-55% range, with diminishing returns beyond that.
Question 16: C
TreeExplainer is specifically designed for tree-based models (XGBoost, LightGBM, random forests). It computes exact SHAP values in polynomial time, much faster than the model-agnostic KernelExplainer.
Question 17: B
A 7-day change feature captures the direction (positive or negative) and magnitude of recent movement. This "momentum" can be predictive because trends in polling often continue for a period before reversing.
Question 18: C
Using the contract price at resolution time is a severe form of leakage — this price reflects the outcome itself and would not be available at prediction time. Options A, B, and D are all proper practices that help prevent leakage.
Question 19: B
Heavy oversampling to 50/50 without recalibration will produce a model whose predicted probabilities are centered around 50%, dramatically overestimating the probability of the rare class. If you oversample, you must recalibrate to restore the correct probability scale.
Question 20: B
ECE = $\sum_{b=1}^{B} \frac{n_b}{N} |\bar{p}_b - \bar{y}_b|$, the weighted average of absolute differences between mean predicted probability and observed frequency in each bin, weighted by the fraction of samples in each bin.
Question 21: B
Concept drift means the statistical relationship between input features and outcomes changes over time. In prediction markets, this can happen due to structural changes (new media landscape, economic regime shifts, market maturation).
Question 22: B
Shadow mode (compare predictions without acting on them) then canary deployment (small allocation) then full deployment is the safest approach. This gives time to detect problems before committing significant capital.
Question 23: C
A Brier score increase from 0.18 to 0.26 is a substantial degradation (44% worse). The right response is to investigate the cause — concept drift, data quality issues, or feature distribution shifts — and consider retraining the model.
Question 24: B
BSS = 1 - Brier_model / Brier_baseline, where the baseline is the climatological (historical base rate) prediction. A positive BSS means the model adds value beyond simply predicting the base rate for all events.
Question 25: B
Since both models predict on the same test instances, the losses are paired (each instance has a loss from model A and a loss from model B). A paired t-test on these per-instance loss differences is the appropriate statistical test, as it accounts for the correlation between the two models' predictions on the same data.