Chapter 23: Exercises

Conceptual Exercises

Exercise 1: ML vs. Statistical Models

Explain three specific scenarios in prediction markets where machine learning models would be expected to outperform logistic regression. For each scenario, identify which property of ML (nonlinearity, high dimensionality, interaction effects) provides the advantage, and which ML algorithm you would try first.

Exercise 2: Random Forest Probability Estimation

A random forest with 100 trees produces a predicted probability of 0.73 for a political event. Explain mechanically how this probability is computed from the individual trees. Why do random forest probabilities tend to be "shrunk" toward 0.5 compared to the true probability? What is the name of the calibration method you would use to correct this, and how does it work?

Exercise 3: Gradient Boosting Intuition

In gradient boosting for binary classification with log-loss, explain what the "gradient" refers to. If the current ensemble predicts a probability of 0.40 for an observation whose true label is 1, what is the pseudo-residual that the next tree will try to fit? Show the calculation.

Exercise 4: XGBoost Regularization

XGBoost's objective function includes the term $\Omega(h) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2$. Explain what each component does. A prediction market modeler increases $\gamma$ from 0 to 2 and $\lambda$ from 1 to 5. Describe qualitatively how these changes would affect the learned trees and the model's behavior.

Exercise 5: Neural Network Architecture

Design a neural network architecture for predicting binary outcomes in a prediction market with 25 input features and a training set of 3,000 samples. Specify the number of layers, neurons per layer, activation functions, regularization strategy, and output activation. Justify each choice given the dataset size.

Exercise 6: Loss Functions

Compare binary cross-entropy and Brier score as loss functions for training a neural network on prediction market data. Which one more heavily penalizes a prediction of 0.95 when the true outcome is 0? Compute the loss contribution for this case under both metrics.

Exercise 7: Temporal Splitting

A researcher studying prediction markets randomly shuffles 5 years of data and uses 80/20 train/test split. Explain why this is problematic. Describe two alternative splitting strategies and explain when you would prefer each one.

Exercise 8: Calibration Methods

Compare Platt scaling, isotonic regression, and temperature scaling along these dimensions: number of parameters, flexibility, data requirements, and risk of overfitting. For a prediction market dataset with 300 calibration samples, which method would you recommend and why?

Exercise 9: SHAP Values

A model predicts that a candidate has a 78% chance of winning. The base rate (average prediction) is 52%. The SHAP values are: approval_rating = +0.12, gdp_growth = +0.08, polling_margin = +0.15, challenger_quality = -0.07, other features = -0.02. Verify that these SHAP values are consistent with the prediction (in log-odds space, the values should sum to the difference between the prediction logit and the base value logit). Explain what each SHAP value tells us about this prediction.

Exercise 10: Feature Leakage

Identify the data leakage problem in each scenario: (a) A model predicting election outcomes uses the final vote count margin as a feature. (b) A model uses the prediction market contract price at resolution time as a feature. (c) A model uses polling data with timestamps but does not filter for polls conducted before the prediction date. (d) Feature standardization (z-scoring) is applied to the entire dataset before splitting into train and test.

Coding Exercises

Exercise 11: Random Forest Baseline

Using scikit-learn, build a random forest classifier on the simulated dataset from Section 23.2.3. Train with n_estimators=200, max_depth=6, and min_samples_leaf=15. Report the Brier score and log-loss on the test set. Plot the feature importances as a horizontal bar chart.

Exercise 12: XGBoost Pipeline

Build an XGBoost model for the same dataset. Use early stopping with 50 rounds of patience. Report the optimal number of boosting rounds, and compare Brier score and log-loss against the random forest from Exercise 11. Which model performs better and by how much?

Exercise 13: LightGBM Comparison

Build a LightGBM model with parameters comparable to the XGBoost model from Exercise 12. Compare training time, memory usage, and predictive performance. Under what circumstances would you prefer LightGBM over XGBoost?

Exercise 14: Neural Network Training

Implement the PredictionMarketNet architecture from Section 23.4.6. Train it with Adam optimizer (lr=0.001) and early stopping (patience=20). Plot the training and validation loss curves. Does the model show signs of overfitting? At what epoch does overfitting begin?

Exercise 15: Hyperparameter Tuning with Optuna

Use Optuna to tune the XGBoost model from Exercise 12. Search over: max_depth (3-10), learning_rate (0.01-0.3), subsample (0.5-1.0), colsample_bytree (0.5-1.0), min_child_weight (1-20), and reg_lambda (0.01-10). Run 50 trials. Report the best parameters and the improvement over default parameters.

Exercise 16: Cross-Validation

Implement time-series cross-validation with 5 folds for the XGBoost model. Use the TimeSeriesSplit from scikit-learn with a gap of 10 observations. Report the mean and standard deviation of log-loss across folds. Compare this with the result from a (incorrect) standard 5-fold random CV. How much does the random CV overestimate performance?

Exercise 17: Reliability Diagram

Plot reliability diagrams for the random forest, XGBoost, and neural network models from previous exercises. Use 10 bins. Which model appears most calibrated? Which appears least calibrated? Describe the calibration bias of each model.

Exercise 18: Platt Scaling

Apply Platt scaling to the XGBoost model's validation set predictions, then evaluate on the test set. Report the Brier score and log-loss before and after calibration. Plot reliability diagrams for both. Did calibration improve performance?

Exercise 19: Isotonic Regression Calibration

Apply isotonic regression calibration to the random forest model. Compare the results with Platt scaling applied to the same model. Which calibration method works better for the random forest? Why might this be the case?

Exercise 20: Temperature Scaling

For the neural network model, implement temperature scaling. Find the optimal temperature on the validation set by minimizing log-loss. Report the optimal temperature and interpret it (is the model overconfident or underconfident?). Plot reliability diagrams before and after temperature scaling.

Exercise 21: SHAP Summary Plot

Compute SHAP values for the XGBoost model using shap.TreeExplainer. Create a summary beeswarm plot. Identify the three most important features. For each, describe the direction and shape of the relationship with the prediction.

Exercise 22: SHAP Force Plot

Select two test instances: one where the model predicts a high probability (>0.8) and one where it predicts a low probability (<0.2). Create SHAP force plots for both. Explain why the model made each prediction in terms of the feature contributions.

Exercise 23: SHAP Dependence Plot

Create SHAP dependence plots for the top 3 features from Exercise 21. For each plot, identify any nonlinear relationships or interaction effects. How do these findings compare with the true data-generating process?

Exercise 24: Feature Engineering

Starting with the 8 raw features, create at least 20 additional features using the techniques from Section 23.8 (lags, rolling statistics, changes, interactions). Train an XGBoost model on the expanded feature set and compare performance with the raw-feature model. Does feature engineering improve the Brier score? By how much?

Exercise 25: Feature Selection

Using the expanded feature set from Exercise 24, apply three feature selection methods: (a) correlation filter (threshold 0.95), (b) SHAP-based selection (top 15 features), and (c) recursive feature elimination. Compare the number of features selected and the resulting model performance for each method.

Exercise 26: Class Imbalance

Modify the data-generating process to create a 10% positive rate (rare events). Train XGBoost with and without scale_pos_weight. Compare calibration plots and Brier scores. Then apply Platt scaling to both models. Does rebalancing help or hurt after calibration?

Exercise 27: Model Comparison Framework

Implement the compare_models function from Section 23.10.2. Run it on all four models (logistic regression, random forest, XGBoost, neural network). Rank the models by each metric. Do the rankings agree across metrics? Apply the paired t-test from Section 23.10.3 to determine which differences are statistically significant.

Exercise 28: Full Pipeline

Build a complete pipeline that: 1. Generates features (raw + engineered) 2. Trains XGBoost with Optuna tuning (20 trials) 3. Calibrates with Platt scaling 4. Evaluates with Brier score, log-loss, and ECE 5. Generates SHAP analysis 6. Saves the model and metadata

Report all evaluation metrics and include the SHAP summary plot.

Exercise 29: Concept Drift Simulation

Create a dataset where the relationship between features and outcomes changes at the midpoint (concept drift). Train an XGBoost model on the first half and evaluate on the second half. Compare performance before and after the drift point. Then implement a sliding window approach and show that it adapts better to the drift.

Exercise 30: Model Monitoring

Implement the ModelMonitor class from Section 23.11.3. Simulate a deployment scenario where: 1. The model performs well for the first 200 predictions. 2. A drift occurs at prediction 201 (the true relationship changes). 3. The monitoring system detects the drift.

Show the rolling Brier score over time and demonstrate that the drift_alert function correctly identifies the performance degradation.