Chapter 23: Key Takeaways

1. Machine Learning Adds Value When Statistical Models Hit Their Limits

ML methods (random forests, XGBoost, neural networks) outperform logistic regression when data has high dimensionality, nonlinear relationships, or complex feature interactions. However, with very small datasets (fewer than 100-200 samples), simpler models often win because they are less prone to overfitting.

2. Tree-Based Methods Are the Workhorse for Tabular Prediction Market Data

XGBoost and LightGBM consistently perform well on structured/tabular data — the most common data format in prediction markets. They handle mixed feature types, missing values, and nonlinear relationships naturally, without requiring feature scaling or extensive preprocessing.

3. Neural Networks Shine in Specific Scenarios

Neural networks are competitive with tree-based methods on tabular data but rarely dominate for moderate-sized datasets. They become advantageous when combining multiple data types (text + tabular), when datasets are very large (tens of thousands of samples), or when multi-output prediction is needed.

4. Raw ML Outputs Are Not Calibrated Probabilities

Most ML models produce probability estimates that are systematically biased. Random forests push probabilities toward 0.5. Gradient boosting can be overconfident or underconfident depending on regularization. Neural networks can exhibit overconfidence. Post-hoc calibration (Platt scaling, isotonic regression, or temperature scaling) is essential for prediction market applications.

5. Platt Scaling Is the Safe Default for Calibration

With limited calibration data (fewer than 500 samples), Platt scaling (fitting a two-parameter logistic regression on model outputs) is the safest calibration method. Isotonic regression is more flexible but can overfit with small samples. Temperature scaling is the simplest option (one parameter) but can only fix global over/underconfidence.

6. SHAP Provides Theoretically Grounded Interpretability

SHAP (SHapley Additive exPlanations) values decompose each prediction into per-feature contributions. They satisfy desirable properties (additivity, consistency, local accuracy) and work with any model type. For tree-based models, TreeSHAP provides exact and fast computation.

7. Temporal Splitting Is Non-Negotiable

Random train/test splits leak future information into training, creating an illusion of model performance that will not hold in live trading. Always split data by time: train on the past, validate on the near future, test on the far future.

8. Hyperparameter Tuning Should Use Bayesian Optimization

Optuna and similar tools use probabilistic modeling to efficiently search the hyperparameter space, finding good configurations in fewer trials than grid or random search. Time-series cross-validation must be used within the tuning loop to avoid temporal leakage.

9. Feature Engineering Often Matters More Than Algorithm Choice

Momentum features, rolling statistics, interaction terms, and domain-specific transformations can unlock predictive signal that raw features obscure. However, more features also increase overfitting risk, so feature selection (via SHAP importance, correlation filtering, or wrapper methods) should follow feature generation.

10. Prediction Markets Present Unique ML Challenges

Small datasets, class imbalance for rare events, concept drift, non-stationarity, and data leakage are all more acute in prediction markets than in typical ML applications. Defensive coding practices, rigorous validation, and continuous monitoring are essential.

11. Model Comparison Requires Statistical Testing

Observed differences in Brier score or log-loss between models may be due to chance. Paired t-tests or Diebold-Mariano tests on per-instance losses provide statistical rigor for model comparison decisions.

12. Deployment Is Not the End — It Is the Beginning

A deployed model requires continuous monitoring (rolling Brier score, calibration checks, drift detection), systematic retraining triggered by evidence of degradation, and A/B testing before fully replacing existing models. The monitoring infrastructure is as important as the model itself.

Summary Decision Framework

Question Recommendation
Which algorithm first? XGBoost with well-tuned hyperparameters
How to tune? Optuna with time-series cross-validation
How to calibrate? Platt scaling (small data) or isotonic regression (large data)
How to interpret? SHAP values with summary and dependence plots
How to evaluate? Brier score + log-loss + ECE + paired statistical tests
How to deploy? Serialize model + calibrator + metadata; monitor continuously
When to retrain? When monitoring detects Brier degradation > threshold