Chapter 22 Exercises

Section A: Linear Regression (Exercises 1-5)

Exercise 1: OLS from Scratch

Implement ordinary least squares linear regression from scratch (without using sklearn or statsmodels) using the normal equation $\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$. Apply it to the following prediction market dataset: predict the final vote margin from polling average, GDP growth rate, and presidential approval rating. Compare your results to statsmodels.OLS.

Data: | Polling Avg | GDP Growth | Approval | Actual Margin | |-------------|-----------|----------|---------------| | 52.3 | 2.1 | 51 | 3.8 | | 47.8 | 1.5 | 44 | -2.5 | | 55.1 | 3.0 | 56 | 7.1 | | 49.2 | 0.8 | 46 | -1.2 | | 51.5 | 2.4 | 50 | 2.0 | | 46.3 | -0.5 | 39 | -5.8 | | 53.7 | 2.8 | 54 | 5.3 | | 48.5 | 1.2 | 43 | -3.1 |

Exercise 2: Assumption Checking

Using the dataset from Exercise 1, perform a comprehensive diagnostic analysis of the OLS model. Check for: - Linearity (residuals vs. fitted values plot) - Normality of residuals (QQ plot and Shapiro-Wilk test) - Homoscedasticity (Breusch-Pagan test) - Multicollinearity (VIF for each predictor)

Interpret each diagnostic result and discuss implications for prediction market modeling.

Exercise 3: Heteroscedasticity-Robust Inference

Fit the model from Exercise 1 using three different standard error estimators: (a) classical OLS standard errors, (b) HC3 heteroscedasticity-robust standard errors, and (c) Newey-West HAC standard errors with 3 lags. Compare the coefficient confidence intervals under each approach and explain when each is appropriate for prediction market data.

Exercise 4: Prediction Intervals

Using the model from Exercise 1, generate both 95% confidence intervals and 95% prediction intervals for a new election with polling average 50.5, GDP growth 1.8, and approval rating 48. Explain the difference between the two interval types and discuss which is more relevant for a prediction market trader deciding whether a spread contract is mispriced.

Exercise 5: R-squared Decomposition

For the model in Exercise 1, compute and compare $R^2$, adjusted $R^2$, and out-of-sample $R^2$ using leave-one-out cross-validation. Discuss why in-sample $R^2$ can be misleading in the context of prediction markets with small sample sizes.

Section B: Logistic Regression (Exercises 6-12)

Exercise 6: Logistic Regression by Hand

Consider a simple logistic regression model with one predictor: $$\log\left(\frac{p}{1-p}\right) = -3.2 + 0.08 \times \text{polling\_lead}$$

(a) Calculate the predicted probability of winning for polling leads of -10, -5, 0, 5, 10, 15, and 20. (b) Plot the resulting sigmoid curve. (c) Compute the odds ratio for the polling_lead coefficient and interpret it in plain language. (d) At what polling lead does the model predict exactly 50% probability?

Exercise 7: Maximum Likelihood Estimation

Implement the log-likelihood function for logistic regression and use scipy.optimize.minimize to find the MLE parameters. Use the following dataset:

Polling Lead	Incumbency	Won (0/1)
5.2	1	1
-1.3	0	0
8.1	1	1
-3.5	0	0
2.0	1	1
-0.8	0	1
4.5	1	1
-6.2	0	0
0.5	1	0
3.0	0	1

Compare your results to sklearn.linear_model.LogisticRegression (with no regularization, i.e., C=1e10).

Exercise 8: Interpreting Logistic Regression Output

Using the statsmodels output below, interpret each coefficient, compute odds ratios, and assess which features are statistically significant:

                     coef    std err     z      P>|z|
const              -8.421    2.315   -3.639    0.000
polling_lead        0.142    0.038    3.737    0.000
gdp_growth          0.385    0.201    1.915    0.056
incumbency          1.203    0.512    2.350    0.019
approval_rating     0.054    0.031    1.742    0.082

For each significant predictor, explain what the odds ratio means in the context of election prediction markets.

Exercise 9: Logistic Regression for Sports Markets

Build a logistic regression model to predict home team wins in a sports market using the following features: home team winning percentage, away team winning percentage, home team recent form (last 5 games), and rest days advantage. Generate synthetic data for 200 games, fit the model, and evaluate using log-loss and a calibration plot. Compare the model's probabilities to simulated market prices.

Exercise 10: Multiclass Extension

A prediction market has three outcomes: Party A wins, Party B wins, and No Majority. Extend the binary logistic regression framework to a multinomial logistic regression. Using scikit-learn's LogisticRegression with multi_class='multinomial', fit a model on synthetic data with features including polling data for each party, economic indicators, and regional factors. Discuss how to convert the three-class probabilities into trading signals for the three corresponding market contracts.

Exercise 11: Marginal Effects

For the logistic regression in Exercise 8, compute the marginal effect of each predictor at the mean. Recall that the marginal effect in logistic regression is: $$\frac{\partial p}{\partial x_j} = \beta_j \cdot p(1-p)$$

Calculate marginal effects at the mean and at various points along the probability curve ($p = 0.2, 0.5, 0.8$). Discuss why marginal effects vary and what this means for prediction market analysis.

Exercise 12: Separation and Regularization

Create a dataset where logistic regression encounters "complete separation" — a linear combination of features perfectly predicts the outcome. Demonstrate that: (a) Unregularized logistic regression fails or produces extreme coefficients. (b) L2-regularized logistic regression produces finite, reasonable coefficients. (c) The choice of regularization strength $C$ affects the predicted probabilities.

Section C: Feature Engineering (Exercises 13-16)

Exercise 13: Polling Feature Pipeline

Using the PredictionMarketFeatureEngineer class from Section 22.4, create a complete feature pipeline for election prediction. Generate 365 days of synthetic data with: - Daily polling averages (with noise and trend) - Market prices (partially responsive to polls) - Trading volume (with weekly seasonality) - Days until the election

Apply the feature engineer and visualize the resulting features. Identify which features have the highest correlation with next-day price changes.

Exercise 14: Cross-Market Features

Design and implement features that capture cross-market information. Create synthetic data for three related prediction market contracts (e.g., presidential, Senate, and House elections). Engineer features such as: - Price spread between related contracts - Correlation over rolling windows - Lead-lag relationships - Joint probability implications

Evaluate which cross-market features improve a logistic regression model's out-of-sample performance.

Exercise 15: Time-Decay Features

Prediction market contracts behave differently depending on time to expiry. Create features that capture time-decay dynamics: - Price volatility scaled by time to expiry - Speed of price convergence - Theoretical time value (similar to options time value) - Urgency indicator (measuring how quickly the market should price in remaining uncertainty)

Test these features on synthetic data where the true probability gradually reveals itself over time.

Exercise 16: Feature Selection Competition

Create 50 candidate features for a prediction market model (some informative, some noise). Use three feature selection methods: (a) Lasso (L1 regularization) (b) Mutual information (c) Recursive feature elimination with cross-validation

Compare which features each method selects. Evaluate the resulting models on held-out data using walk-forward validation. Discuss which method best identifies truly informative features.

Section D: Time Series and ARIMA (Exercises 17-22)

Exercise 17: Stationarity Analysis

Generate three synthetic prediction market price series: (a) A random walk: $P_t = P_{t-1} + \epsilon_t$ (non-stationary) (b) A mean-reverting process: $P_t = 0.5 + 0.9(P_{t-1} - 0.5) + \epsilon_t$ (stationary) (c) A trending series with structural break

For each, perform the ADF test, plot the ACF and PACF, and determine the differencing order needed for stationarity.

Exercise 18: ARIMA Model Identification

Using the ACF/PACF patterns from Exercise 17(b), manually identify an appropriate ARIMA model order. Then use the automated grid search from Section 22.7 to find the best model by AIC. Compare the manual and automated selections.

Exercise 19: ARIMA Forecasting

Fit an ARIMA model to 200 days of simulated prediction market price data. Produce 1-step-ahead, 5-step-ahead, and 20-step-ahead forecasts. Plot the forecasts with confidence intervals. Evaluate each horizon's accuracy using RMSE and MAE. Discuss why forecast accuracy degrades with horizon length.

Exercise 20: Log-Odds Transformation for ARIMA

Prediction market prices are bounded between 0 and 1, but ARIMA models produce unbounded forecasts. Implement a pipeline that: (a) Transforms prices to log-odds: $z_t = \log(P_t / (1 - P_t))$ (b) Fits an ARIMA model to the log-odds series (c) Forecasts in log-odds space (d) Transforms forecasts back to probability space: $\hat{P}_t = 1 / (1 + e^{-\hat{z}_t})$

Compare forecasts from this pipeline to direct ARIMA on raw prices. Which produces more reasonable forecasts near the boundaries?

Exercise 21: Seasonal ARIMA

Daily trading volume in prediction markets often shows weekly seasonality (lower volume on weekends). Generate synthetic volume data with a weekly seasonal pattern and fit a SARIMA(p,d,q)(P,D,Q,7) model. Compare the seasonal and non-seasonal model forecasts.

Exercise 22: GARCH Volatility Modeling

Simulate prediction market price data with time-varying volatility (e.g., higher volatility after surprise news events). Fit GARCH(1,1), GARCH(2,1), and EGARCH(1,1) models. Compare: (a) Which model best captures the conditional volatility? (b) How do the volatility forecasts differ? (c) How would you use the GARCH volatility forecast to adjust position sizes?

Section E: Walk-Forward Validation and Model Comparison (Exercises 23-27)

Exercise 23: Walk-Forward Implementation

Implement walk-forward validation from scratch (without using the WalkForwardValidator class). For a logistic regression model on synthetic prediction market data: (a) Implement expanding window validation (b) Implement rolling window validation (window size = 100) (c) Plot the out-of-sample predictions over time for both approaches (d) Compare log-loss between the two approaches

Exercise 24: Lookahead Bias Detection

Create a model pipeline that intentionally contains lookahead bias (e.g., standardizing features using the full dataset before walk-forward splitting). Compare its performance to a correctly implemented pipeline where scaling is done within each fold. Quantify the performance overstatement caused by the bias.

Exercise 25: Nested Walk-Forward

Implement a nested walk-forward validation scheme where the outer loop evaluates model performance and the inner loop tunes hyperparameters: - Outer loop: walk-forward with expanding window - Inner loop: for each outer fold, tune the regularization parameter $C$ using walk-forward on the training portion only

Apply this to a logistic regression model on synthetic data with 20 features and 500 observations.

Exercise 26: Model Horse Race

Build and compare four models on the same synthetic prediction market dataset: (a) Logistic regression (no regularization) (b) Ridge logistic regression (optimal $C$ from nested CV) (c) Lasso logistic regression (optimal $C$ from nested CV) (d) ARIMA-based probability forecasts (using the log-odds transform)

Compare using walk-forward validated log-loss, Brier score, calibration RMSE, and Sharpe ratio (if converted to trading signals). Present results in a summary table.

Exercise 27: Ensemble Model

Combine the four models from Exercise 26 into an ensemble: (a) Simple average of probabilities (b) Weighted average (weights proportional to recent performance) (c) Stacking: use a logistic regression to combine the four models' outputs

Evaluate the ensemble approaches using walk-forward validation and compare to the individual models.

Section F: Signal Generation and Integration (Exercises 28-30)

Exercise 28: Threshold Optimization

Using the optimize_threshold function from Section 22.11, find the optimal trading threshold for a model with known calibration. Generate 1000 synthetic prediction/market-price/outcome triples where the model has a slight edge (model probabilities are 3% closer to the true probability on average). Plot the Sharpe ratio as a function of threshold. Discuss how transaction costs affect the optimal threshold.

Exercise 29: Kelly Position Sizing Simulation

Implement a full simulation of a prediction market trading strategy that uses: - A logistic regression model for probability estimation - Walk-forward validation for out-of-sample predictions - Kelly criterion for position sizing - Transaction cost of $0.02 per trade

Simulate 500 trading opportunities and track the portfolio value over time. Compare full Kelly, half Kelly, and quarter Kelly position sizing. Plot the equity curves and compute the Sharpe ratio, maximum drawdown, and final portfolio value for each.

Exercise 30: End-to-End Pipeline

Build a complete end-to-end pipeline that: 1. Generates synthetic prediction market data (prices, features, outcomes) 2. Engineers features using the pipeline from Section 22.4 3. Fits a regularized logistic regression model 4. Validates using walk-forward methodology 5. Generates trading signals with confidence-based sizing 6. Simulates trading performance including transaction costs 7. Produces a comprehensive performance report with: - Calibration plot - Equity curve - Summary statistics (Sharpe, drawdown, log-loss, Brier score) - Feature importance analysis - Comparison to a "follow the market" baseline

This exercise integrates all concepts from the chapter into a single working system. The solution should be modular and reusable.