30 min read

In This Chapter

23.1 Why Machine Learning?
23.2 Decision Trees and Random Forests
23.3 Gradient Boosting: XGBoost and LightGBM
23.4 Neural Networks for Prediction Markets
23.5 Hyperparameter Tuning
23.6 Calibrating ML Probability Outputs
23.7 Model Interpretability with SHAP
23.8 Feature Engineering for ML Models
23.9 Handling Prediction Market Specific Challenges
23.10 Model Comparison and Selection
23.11 Deploying ML Models for Live Trading
23.12 Chapter Summary
What's Next

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 23: Machine Learning for Probability Estimation

"All models are wrong, but some are useful." — George Box

In previous chapters, we built probability estimates using logistic regression, historical base rates, and hand-crafted statistical models. Those tools are powerful, but they have limits. When the number of features grows large, when relationships among variables are deeply nonlinear, and when subtle interactions drive outcomes, classical statistical models can leave predictive power on the table. Machine learning (ML) methods are designed to capture exactly this kind of complexity.

This chapter is a practitioner's guide to applying modern ML techniques to prediction market probability estimation. We will build models using random forests, XGBoost, LightGBM, and neural networks. We will calibrate their probability outputs, interpret them with SHAP, and assemble a complete pipeline from raw features to deployed trading signals. Every concept is grounded in prediction market applications, and every code example uses realistic data structures.

By the end of this chapter, you will be able to:

Select the right ML algorithm for a given prediction market problem.
Build, tune, and evaluate tree-based and neural network models.
Calibrate raw model outputs into reliable probabilities.
Interpret model predictions using SHAP values.
Engineer features specific to prediction market data.
Deploy and monitor models in a live trading environment.

23.1 Why Machine Learning?

23.1.1 When ML Beats Statistical Models

Logistic regression is a remarkable tool. It is interpretable, fast to train, and produces well-calibrated probabilities under certain conditions. For many prediction market problems, especially those with a handful of well-understood predictors, logistic regression is hard to beat.

Machine learning becomes valuable when one or more of the following conditions hold:

High dimensionality. When you have dozens or hundreds of candidate features — polling data, economic indicators, social media sentiment, historical contract prices, weather data, player statistics — a logistic regression model either requires careful manual feature selection or risks overfitting with too many terms. Tree-based models and neural networks can handle high-dimensional inputs more gracefully.

Nonlinear relationships. Logistic regression models a linear relationship between features and the log-odds of the outcome. If the true relationship is nonlinear — for example, if a candidate's probability of winning is flat below 40% polling average and then rises sharply — logistic regression will miss this pattern unless you manually engineer the right transformation. ML models discover nonlinearities automatically.

Complex interactions. Suppose that the effect of unemployment on an incumbent's re-election probability depends on whether the economy is growing or shrinking, which in turn depends on the incumbent's approval rating. Such three-way interactions are combinatorially expensive to specify by hand but are naturally captured by decision trees.

Heterogeneous data. Prediction markets span elections, sports, financial events, weather, and entertainment. Each domain has idiosyncratic features. ML models can learn domain-specific patterns without requiring the modeler to specify a parametric form in advance.

23.1.2 The ML Workflow for Prediction Markets

The standard ML workflow adapts to prediction markets as follows:

1. Define the prediction task
   - Binary outcome: Will event X occur? (Yes/No)
   - Target variable: y ∈ {0, 1}
   - Goal: estimate P(y = 1 | features)

2. Collect and prepare data
   - Historical outcomes with associated features
   - Market prices at various time points
   - External data (polls, statistics, economic indicators)

3. Engineer features
   - Raw features, transformations, lags, interactions
   - Domain-specific feature construction

4. Split data
   - Temporal split (not random!) to respect time ordering
   - Training / validation / test sets

5. Train models
   - Multiple algorithms: RF, XGBoost, neural net
   - Hyperparameter tuning via cross-validation

6. Calibrate probabilities
   - Platt scaling, isotonic regression, or temperature scaling
   - Evaluated on held-out calibration set

7. Evaluate and compare
   - Log-loss, Brier score, calibration error
   - Statistical significance tests

8. Interpret
   - SHAP values, feature importance
   - Sanity checks on learned patterns

9. Deploy
   - Model serialization, prediction serving
   - Monitoring and retraining schedule

A critical difference from textbook ML is step 4: temporal splitting. In prediction markets, the future is what matters. If you randomly shuffle data and split, you will leak future information into training. Always split by time: train on the past, validate on the near future, test on the far future.

23.1.3 A Word of Caution

Machine learning is not a magic wand. In prediction markets:

Data is scarce. A presidential election happens every four years. You cannot train a deep neural network on four data points.
Markets are adaptive. As more participants use ML, the edge from ML shrinks. This is unlike image classification, where the underlying patterns do not change because someone else also classified images.
Overfitting is the primary enemy. With small datasets and many features, ML models can memorize noise rather than learn signal.

The discipline of this chapter is to use ML where it genuinely helps, calibrate its outputs carefully, interpret its decisions, and monitor its performance rigorously.

23.2 Decision Trees and Random Forests

23.2.1 Tree-Based Intuition

A decision tree recursively partitions the feature space into regions, making predictions based on the majority class (or average outcome) in each region. At each split, the tree selects the feature and threshold that best separates the two classes according to a criterion like Gini impurity or entropy.

For a binary classification task with classes 0 and 1, the Gini impurity of a node is:

$$ G = 1 - p_1^2 - p_0^2 = 2 p_1 (1 - p_1) $$

where $p_1$ is the fraction of class-1 samples in the node. A pure node (all one class) has $G = 0$. The tree algorithm picks the split that maximally reduces the weighted average Gini impurity across the two child nodes.

Example intuition for prediction markets. Consider predicting whether an incumbent wins re-election. A decision tree might first split on approval rating (above or below 50%), then split the high-approval group on GDP growth, and split the low-approval group on challenger quality. This produces interpretable decision rules, but a single tree is prone to overfitting.

23.2.2 Random Forests for Probability Estimation

A random forest (RF) addresses overfitting by training many trees, each on a bootstrapped sample of the data, and each considering only a random subset of features at each split. The predicted probability is the average of the individual tree predictions:

$$ \hat{p}_{\text{RF}}(x) = \frac{1}{T} \sum_{t=1}^{T} \hat{p}_t(x) $$

where $T$ is the number of trees and $\hat{p}_t(x)$ is the probability estimate from tree $t$.

Key properties for prediction market applications:

Variance reduction. By averaging many decorrelated trees, RF reduces the variance of predictions without substantially increasing bias.
Natural probability estimates. The fraction of trees voting for class 1 is a probability estimate. While not perfectly calibrated, RF probabilities tend to be closer to calibrated than those from single trees or some boosting methods.
Feature importance. RF provides a natural measure of feature importance based on the average decrease in impurity across all splits using that feature.
Robustness to outliers. Tree-based methods are invariant to monotonic transformations of features, so outliers in polling data or economic indicators do not distort predictions.

23.2.3 Python Implementation with scikit-learn

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import brier_score_loss, log_loss
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt

# Simulated prediction market dataset
np.random.seed(42)
n_samples = 2000

# Features relevant to political prediction markets
data = pd.DataFrame({
    'approval_rating': np.random.normal(48, 10, n_samples),
    'gdp_growth': np.random.normal(2.5, 1.5, n_samples),
    'unemployment': np.random.normal(5.5, 1.5, n_samples),
    'challenger_quality': np.random.uniform(0, 10, n_samples),
    'months_to_election': np.random.randint(1, 24, n_samples),
    'polling_margin': np.random.normal(0, 5, n_samples),
    'fundraising_ratio': np.random.lognormal(0, 0.5, n_samples),
    'sentiment_score': np.random.normal(0, 1, n_samples),
})

# Nonlinear outcome generation
logit = (
    0.05 * data['approval_rating']
    + 0.3 * data['gdp_growth']
    - 0.2 * data['unemployment']
    - 0.1 * data['challenger_quality']
    + 0.15 * data['polling_margin']
    + 0.1 * np.log(data['fundraising_ratio'] + 0.01)
    + 0.05 * data['approval_rating'] * data['gdp_growth'] / 10
    - 3.0
)
prob = 1 / (1 + np.exp(-logit))
data['outcome'] = np.random.binomial(1, prob)

# Temporal split
train = data.iloc[:1400]
val = data.iloc[1400:1700]
test = data.iloc[1700:]

feature_cols = [c for c in data.columns if c != 'outcome']
X_train, y_train = train[feature_cols], train['outcome']
X_val, y_val = val[feature_cols], val['outcome']
X_test, y_test = test[feature_cols], test['outcome']

# Train random forest
rf = RandomForestClassifier(
    n_estimators=500,
    max_depth=8,
    min_samples_leaf=20,
    max_features='sqrt',
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)

# Predict probabilities
rf_probs_val = rf.predict_proba(X_val)[:, 1]
rf_probs_test = rf.predict_proba(X_test)[:, 1]

# Evaluate
print(f"Validation Brier Score: {brier_score_loss(y_val, rf_probs_val):.4f}")
print(f"Validation Log Loss:    {log_loss(y_val, rf_probs_val):.4f}")

# Feature importance
importances = pd.Series(
    rf.feature_importances_, index=feature_cols
).sort_values(ascending=False)
print("\nFeature Importances:")
print(importances.to_string())

23.2.4 Handling Binary Outcomes

Prediction markets inherently produce binary outcomes (event happens or does not). When training tree-based models:

Use predict_proba, not predict. The hard classification threshold of 0.5 destroys probability information. Always extract the probability column.
Set min_samples_leaf generously. With small datasets, each leaf should contain enough samples to provide a meaningful probability estimate. A leaf with 2 samples yielding 50% is far less reliable than a leaf with 50 samples yielding 50%.
Limit tree depth. Deep trees overfit, especially with noisy prediction market data.
Consider class weights. If the outcome is rare (e.g., predicting a market crash), use class_weight='balanced' or set custom weights.

23.3 Gradient Boosting: XGBoost and LightGBM

23.3.1 Boosting Intuition

While random forests build trees independently and average their predictions, gradient boosting builds trees sequentially. Each new tree focuses on correcting the errors of the ensemble built so far. Formally, at step $m$, the model is:

$$ F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x) $$

where $F_{m-1}$ is the current ensemble, $h_m$ is a new tree fitted to the negative gradient of the loss function (the "residuals" in a generalized sense), and $\eta$ is the learning rate (shrinkage parameter).

For binary classification with log-loss, the negative gradient for observation $i$ is:

$$ -\frac{\partial L}{\partial F(x_i)} = y_i - \sigma(F(x_i)) $$

where $\sigma$ is the sigmoid function. This means each new tree is fitted to the gap between the true labels and the current predicted probabilities.

23.3.2 XGBoost for Probability Estimation

XGBoost (eXtreme Gradient Boosting) adds several innovations:

Regularization. L1 and L2 penalties on leaf weights prevent overfitting.
Second-order approximation. Uses both gradient and Hessian information for more accurate tree construction.
Column subsampling. Like random forests, XGBoost can randomly sample features at each tree or split.
Handling missing values. XGBoost learns the optimal direction for missing values at each split.

The objective function at step $m$ is:

$$ \mathcal{L}^{(m)} = \sum_{i=1}^{n} \left[ g_i h_m(x_i) + \frac{1}{2} H_i h_m(x_i)^2 \right] + \Omega(h_m) $$

where $g_i$ and $H_i$ are the first and second derivatives of the loss with respect to $F(x_i)$, and $\Omega$ is a regularization term penalizing tree complexity:

$$ \Omega(h) = \gamma T + \frac{1}{2} \lambda \sum_{j=1}^{T} w_j^2 $$

Here $T$ is the number of leaves, $w_j$ is the weight of leaf $j$, $\gamma$ controls the minimum loss reduction for a split, and $\lambda$ is the L2 regularization strength.

23.3.3 LightGBM Differences

LightGBM, developed by Microsoft, offers several computational improvements:

Gradient-based One-Side Sampling (GOSS). Keeps all instances with large gradients and randomly samples from instances with small gradients, speeding up training.
Exclusive Feature Bundling (EFB). Bundles mutually exclusive features to reduce dimensionality.
Leaf-wise tree growth. Unlike XGBoost's default level-wise growth, LightGBM grows trees leaf-wise, choosing the leaf with the maximum delta loss. This can lead to deeper, more accurate trees but increases overfitting risk.
Histogram-based splitting. Bins continuous features into discrete histograms for faster split finding.

For prediction market applications, the practical differences are:

Aspect	XGBoost	LightGBM
Training speed	Slower on large datasets	Faster, especially with many features
Memory usage	Higher	Lower
Overfitting risk	Lower (level-wise)	Higher (leaf-wise)
Small datasets	Often better	May overfit more
API maturity	More established	Rapidly improving

23.3.4 Handling Imbalanced Data

Many prediction market outcomes are imbalanced. A market asking "Will there be a Category 5 hurricane this season?" might have a base rate of 15%. Strategies include:

scale_pos_weight parameter. In XGBoost, set scale_pos_weight = count(negative) / count(positive) to rebalance the loss function.
Custom sample weights. Assign higher weights to the minority class.
Stratified sampling. Ensure each boosting iteration sees a balanced sample.
Focal loss. Down-weight easy examples and focus on hard ones. This is especially useful when some outcomes are obvious (99% probability markets) and others are genuinely uncertain.
Threshold optimization. Adjust the classification threshold post-training. However, for probability estimation (our primary goal), threshold optimization is less relevant than calibration.

23.3.5 Python XGBoost Pipeline for Prediction Markets

import xgboost as xgb
from sklearn.metrics import brier_score_loss, log_loss

# Prepare DMatrix objects for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=feature_cols)
dval = xgb.DMatrix(X_val, label=y_val, feature_names=feature_cols)
dtest = xgb.DMatrix(X_test, label=y_test, feature_names=feature_cols)

# XGBoost parameters tuned for probability estimation
params = {
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
    'max_depth': 5,
    'learning_rate': 0.05,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'min_child_weight': 10,
    'reg_alpha': 0.1,       # L1 regularization
    'reg_lambda': 1.0,      # L2 regularization
    'gamma': 0.1,           # Minimum loss reduction for split
    'seed': 42,
}

# Train with early stopping
model = xgb.train(
    params,
    dtrain,
    num_boost_round=1000,
    evals=[(dtrain, 'train'), (dval, 'val')],
    early_stopping_rounds=50,
    verbose_eval=100
)

# Predict probabilities
xgb_probs_val = model.predict(dval)
xgb_probs_test = model.predict(dtest)

print(f"\nXGBoost Validation Brier Score: {brier_score_loss(y_val, xgb_probs_val):.4f}")
print(f"XGBoost Validation Log Loss:    {log_loss(y_val, xgb_probs_val):.4f}")

# Feature importance (gain-based)
importance = model.get_score(importance_type='gain')
importance_df = pd.DataFrame.from_dict(
    importance, orient='index', columns=['gain']
).sort_values('gain', ascending=False)
print("\nXGBoost Feature Importance (Gain):")
print(importance_df.to_string())

23.3.6 LightGBM Example

import lightgbm as lgb

lgb_train = lgb.Dataset(X_train, label=y_train)
lgb_val = lgb.Dataset(X_val, label=y_val, reference=lgb_train)

lgb_params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'min_child_samples': 20,
    'lambda_l1': 0.1,
    'lambda_l2': 1.0,
    'verbose': -1,
    'seed': 42,
}

lgb_model = lgb.train(
    lgb_params,
    lgb_train,
    num_boost_round=1000,
    valid_sets=[lgb_val],
    callbacks=[
        lgb.early_stopping(stopping_rounds=50),
        lgb.log_evaluation(period=100)
    ]
)

lgb_probs_val = lgb_model.predict(X_val)
print(f"\nLightGBM Validation Brier Score: {brier_score_loss(y_val, lgb_probs_val):.4f}")
print(f"LightGBM Validation Log Loss:    {log_loss(y_val, lgb_probs_val):.4f}")

23.4 Neural Networks for Prediction Markets

23.4.1 Feedforward Networks for Tabular Data

Neural networks have revolutionized image recognition, natural language processing, and game playing. For tabular data — the kind most common in prediction markets — they are competitive but not always dominant. Recent research (e.g., the "Tabular Data: Deep Learning is Not All You Need" paper by Shwartz-Ziv and Armon, 2022) shows that well-tuned gradient boosting often matches or beats neural networks on tabular tasks.

That said, neural networks offer unique advantages:

End-to-end learning. Feature engineering can be partially automated.
Embedding categorical variables. Entity embeddings can capture relationships between categories (e.g., teams, candidates, markets).
Multimodal input. Neural nets can combine tabular data with text (news articles) or images (satellite data).
Flexible output. Multi-output prediction (predicting multiple related markets simultaneously) is natural.

23.4.2 Architecture Choices

For a binary prediction market outcome, a typical feedforward architecture is:

Input (d features)
    → Dense(128) → BatchNorm → ReLU → Dropout(0.3)
    → Dense(64)  → BatchNorm → ReLU → Dropout(0.3)
    → Dense(32)  → BatchNorm → ReLU → Dropout(0.2)
    → Dense(1)   → Sigmoid
    → Output: P(event = 1)

Key design decisions:

Width vs. depth. For tabular data with 10-100 features, 2-4 hidden layers with 32-256 neurons each is a good starting range. Deeper networks risk overfitting on small prediction market datasets.

Batch normalization. Normalizes activations within each mini-batch, stabilizing training and allowing higher learning rates.

Dropout. Randomly zeroes a fraction of neurons during training, providing regularization. Rates of 0.2-0.5 are typical; use higher dropout for smaller datasets.

Residual connections. For deeper networks, skip connections (adding the input of a block to its output) can help gradient flow.

23.4.3 Activation Functions

ReLU ($\max(0, x)$): Standard choice for hidden layers. Simple, effective, but can suffer from "dead neurons" (neurons that always output zero).
LeakyReLU ($\max(\alpha x, x)$ with small $\alpha$): Fixes the dead neuron problem.
GELU (used in transformers): Smooth approximation of ReLU, sometimes performs better.
Sigmoid ($\sigma(x) = 1/(1+e^{-x})$): Used for the output layer to produce values in $[0, 1]$, interpretable as probabilities.

23.4.4 Loss Functions for Probability Output

The standard loss for binary probability estimation is binary cross-entropy (equivalent to log-loss):

$$ \mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i) \right] $$

This directly incentivizes the model to produce well-calibrated probabilities. The gradient with respect to the pre-sigmoid output $z_i$ is:

$$ \frac{\partial \mathcal{L}}{\partial z_i} = \hat{p}_i - y_i $$

which is elegantly simple: the gradient pushes the prediction toward the true label.

Alternatives for prediction markets:

Brier score loss: $\frac{1}{N} \sum (y_i - \hat{p}_i)^2$. Sometimes used for calibration-focused training.
Focal loss: $-\alpha_t (1-\hat{p}_t)^\gamma \log(\hat{p}_t)$. Down-weights easy examples, focusing on hard-to-classify events. Useful for imbalanced prediction market outcomes.

23.4.5 Training Tips

Learning rate scheduling. Start with a moderate learning rate (e.g., 0.001) and reduce it during training. Cosine annealing and reduce-on-plateau are both effective.
Weight decay. L2 regularization on parameters prevents overfitting. Values of 1e-4 to 1e-2 are typical.
Early stopping. Monitor validation loss and stop training when it starts increasing.
Data normalization. Standardize all input features to zero mean and unit variance. Neural networks are sensitive to feature scales, unlike tree-based methods.
Batch size. Smaller batches (32-128) provide implicit regularization through gradient noise.

23.4.6 Python Implementation with PyTorch

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.preprocessing import StandardScaler

# Prepare data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# Convert to tensors
X_train_t = torch.FloatTensor(X_train_scaled)
y_train_t = torch.FloatTensor(y_train.values).unsqueeze(1)
X_val_t = torch.FloatTensor(X_val_scaled)
y_val_t = torch.FloatTensor(y_val.values).unsqueeze(1)

train_dataset = TensorDataset(X_train_t, y_train_t)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

# Define model
class PredictionMarketNet(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, 64),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(64, 32),
            nn.BatchNorm1d(32),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(32, 1),
            nn.Sigmoid()
        )

    def forward(self, x):
        return self.network(x)

model = PredictionMarketNet(X_train_scaled.shape[1])
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', factor=0.5, patience=10
)

# Training loop with early stopping
best_val_loss = float('inf')
patience_counter = 0
patience = 20

for epoch in range(200):
    model.train()
    train_loss = 0.0
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        output = model(X_batch)
        loss = criterion(output, y_batch)
        loss.backward()
        optimizer.step()
        train_loss += loss.item() * len(X_batch)

    train_loss /= len(train_dataset)

    # Validation
    model.eval()
    with torch.no_grad():
        val_output = model(X_val_t)
        val_loss = criterion(val_output, y_val_t).item()

    scheduler.step(val_loss)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), 'best_model.pt')
        patience_counter = 0
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print(f"Early stopping at epoch {epoch}")
            break

    if epoch % 20 == 0:
        print(f"Epoch {epoch}: Train Loss={train_loss:.4f}, Val Loss={val_loss:.4f}")

# Load best model and predict
model.load_state_dict(torch.load('best_model.pt'))
model.eval()
with torch.no_grad():
    nn_probs_val = model(X_val_t).numpy().flatten()
    X_test_t = torch.FloatTensor(X_test_scaled)
    nn_probs_test = model(X_test_t).numpy().flatten()

print(f"\nNeural Net Validation Brier Score: {brier_score_loss(y_val, nn_probs_val):.4f}")
print(f"Neural Net Validation Log Loss:    {log_loss(y_val, nn_probs_val):.4f}")

23.5 Hyperparameter Tuning

23.5.1 The Importance of Tuning

The performance gap between a default-parameter model and a well-tuned model can be substantial. In prediction market applications, where edges are thin, this gap can be the difference between profitable and unprofitable trading.

Key hyperparameters by model type:

Random Forest: - n_estimators: number of trees (100-1000) - max_depth: maximum tree depth (3-15) - min_samples_leaf: minimum samples per leaf (5-50) - max_features: features per split ('sqrt', 'log2', or fraction)

XGBoost: - learning_rate: shrinkage (0.01-0.3) - max_depth: tree depth (3-10) - n_estimators: boosting rounds (100-2000, with early stopping) - subsample: row sampling (0.5-1.0) - colsample_bytree: column sampling (0.5-1.0) - min_child_weight: minimum child weight (1-20) - reg_alpha, reg_lambda: L1/L2 regularization (0-10) - gamma: minimum loss reduction (0-5)

Neural Network: - Hidden layer sizes, number of layers - Learning rate, weight decay - Dropout rate, batch size - Activation function

23.5.2 Grid Search

Grid search evaluates all combinations of specified hyperparameter values. It is exhaustive but computationally expensive.

from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 7],
    'n_estimators': [100, 300, 500],
    'min_samples_leaf': [10, 20, 50],
    'max_features': ['sqrt', 0.5],
}

# With 3 x 3 x 3 x 2 = 54 combinations and 5-fold CV = 270 fits
# This is feasible for random forests but gets expensive fast

For XGBoost with 8+ hyperparameters, grid search is impractical: even 3 values per parameter gives $3^8 = 6{,}561$ combinations.

23.5.3 Random Search

Random search samples hyperparameter combinations randomly from specified distributions. Bergstra and Bengio (2012) showed that random search is more efficient than grid search because not all hyperparameters are equally important — random search allocates more evaluations to the important ones by chance.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint

param_distributions = {
    'max_depth': randint(3, 10),
    'n_estimators': randint(100, 1000),
    'min_samples_leaf': randint(5, 50),
    'max_features': uniform(0.3, 0.7),
    'min_samples_split': randint(2, 20),
}

# 100 random combinations with 5-fold CV = 500 fits
# Much more efficient than grid search

23.5.4 Bayesian Optimization with Optuna

Bayesian optimization builds a probabilistic model of the hyperparameter-performance relationship and intelligently selects the next combination to evaluate. Optuna is a modern, efficient implementation.

import optuna
from sklearn.model_selection import cross_val_score

def objective(trial):
    params = {
        'objective': 'binary:logistic',
        'eval_metric': 'logloss',
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 20),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 10.0, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 10.0, log=True),
        'gamma': trial.suggest_float('gamma', 0, 5.0),
        'seed': 42,
    }

    n_estimators = trial.suggest_int('n_estimators', 100, 1000)

    dtrain = xgb.DMatrix(X_train, label=y_train)
    dval = xgb.DMatrix(X_val, label=y_val)

    model = xgb.train(
        params, dtrain,
        num_boost_round=n_estimators,
        evals=[(dval, 'val')],
        early_stopping_rounds=50,
        verbose_eval=False
    )

    preds = model.predict(dval)
    return log_loss(y_val, preds)

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=100, show_progress_bar=True)

print(f"Best Log Loss: {study.best_value:.4f}")
print(f"Best Parameters: {study.best_params}")

23.5.5 Cross-Validation Strategies for Time-Series Data

Standard k-fold cross-validation randomly partitions data, violating temporal ordering. For prediction market data, use:

Time-Series Split (Expanding Window):

Fold 1: Train [===]      Val [=]
Fold 2: Train [=====]    Val [=]
Fold 3: Train [=======]  Val [=]
Fold 4: Train [=========] Val [=]

Each fold trains on all data up to a point and validates on the next block. This mimics real-world deployment where you always train on all available history.

Sliding Window:

Fold 1: Train [===]      Val [=]
Fold 2:   Train [===]    Val [=]
Fold 3:     Train [===]  Val [=]
Fold 4:       Train [===] Val [=]

Uses a fixed-size training window, which can be better when older data is less relevant (concept drift).

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5, gap=10)  # gap prevents leakage

for fold, (train_idx, val_idx) in enumerate(tscv.split(X_train)):
    X_fold_train = X_train.iloc[train_idx]
    y_fold_train = y_train.iloc[train_idx]
    X_fold_val = X_train.iloc[val_idx]
    y_fold_val = y_train.iloc[val_idx]
    # Train and evaluate model on this fold

23.5.6 Avoiding Overfitting

Overfitting is the central risk in ML for prediction markets. Defensive strategies:

Regularization. Always use L1/L2 penalties, dropout, or early stopping.
Hold-out test set. Never tune on the test set. Ever. Use train/validation for tuning, test for final evaluation.
Occam's razor. If a simpler model (fewer trees, shallower depth) performs nearly as well, prefer it.
Monitor the gap. If training loss is 0.30 but validation loss is 0.55, you are severely overfitting. Increase regularization.
Cross-validation. Use all the training data for validation through time-series CV, not just a single split.
Feature pruning. Remove features that do not improve validation performance. Fewer features = less overfitting risk.

23.6 Calibrating ML Probability Outputs

23.6.1 Why ML Models Produce Uncalibrated Probabilities

A model is calibrated if, among all predictions of probability $p$, the true frequency of the positive outcome is approximately $p$. For instance, among all events where the model says 70% chance, approximately 70% should actually occur.

Different ML models have characteristic calibration biases:

Random forests tend to produce probabilities pushed toward 0.5 because they average predictions from many trees, and the averaging shrinks extreme values. A true 90% probability might be estimated as 75%.
Gradient boosting (XGBoost, LightGBM) can produce reasonably calibrated probabilities when trained with log-loss, but heavy regularization or early stopping can distort calibration.
Neural networks trained with binary cross-entropy are generally better calibrated than those trained with other losses, but they can still exhibit overconfidence (predictions too close to 0 or 1) or underconfidence.
SVMs produce decision function values that are not probabilities at all without post-hoc calibration.

In prediction markets, calibration is critical. An uncalibrated model that says 80% when the true probability is 65% will systematically overpay for contracts, leading to trading losses even if the model's discrimination (ranking of events by probability) is excellent.

23.6.2 Reliability Diagrams for ML Models

A reliability diagram (calibration plot) bins predicted probabilities and plots the mean predicted probability against the observed frequency in each bin.

from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt

def plot_reliability_diagram(y_true, probs_dict, n_bins=10):
    """Plot reliability diagrams for multiple models."""
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))

    # Reliability diagram
    ax = axes[0]
    ax.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')

    for name, probs in probs_dict.items():
        fraction_pos, mean_pred = calibration_curve(
            y_true, probs, n_bins=n_bins, strategy='uniform'
        )
        ax.plot(mean_pred, fraction_pos, 's-', label=name)

    ax.set_xlabel('Mean Predicted Probability')
    ax.set_ylabel('Observed Frequency')
    ax.set_title('Reliability Diagram')
    ax.legend()
    ax.grid(True, alpha=0.3)

    # Histogram of predictions
    ax = axes[1]
    for name, probs in probs_dict.items():
        ax.hist(probs, bins=30, alpha=0.5, label=name)
    ax.set_xlabel('Predicted Probability')
    ax.set_ylabel('Count')
    ax.set_title('Distribution of Predictions')
    ax.legend()

    plt.tight_layout()
    plt.savefig('reliability_diagram.png', dpi=150)
    plt.show()

# Compare models
plot_reliability_diagram(y_val, {
    'Random Forest': rf_probs_val,
    'XGBoost': xgb_probs_val,
    'Neural Network': nn_probs_val,
})

23.6.3 Platt Scaling

Platt scaling fits a logistic regression on the model's raw output to produce calibrated probabilities:

$$ \hat{p}_{\text{calibrated}} = \frac{1}{1 + \exp(-(a \cdot f(x) + b))} $$

where $f(x)$ is the model's uncalibrated output and $a$, $b$ are parameters learned on a held-out calibration set.

Properties: - Simple, with only two parameters. - Works well when the calibration mapping is roughly sigmoidal. - Can be fit on a small calibration set (a few hundred samples). - Assumes a monotonic, parametric relationship between raw output and true probability.

from sklearn.calibration import CalibratedClassifierCV
from sklearn.linear_model import LogisticRegression

# Platt scaling: fit logistic regression on model outputs
def platt_scaling(probs_train, y_train, probs_test):
    """Apply Platt scaling to probability outputs."""
    lr = LogisticRegression(C=1e10, solver='lbfgs')  # No regularization
    lr.fit(probs_train.reshape(-1, 1), y_train)
    return lr.predict_proba(probs_test.reshape(-1, 1))[:, 1]

# Calibrate XGBoost on validation set, apply to test set
xgb_probs_calibrated = platt_scaling(xgb_probs_val, y_val, xgb_probs_test)

23.6.4 Isotonic Regression

Isotonic regression fits a non-decreasing step function to the calibration mapping, making no parametric assumptions:

$$ \hat{p}_{\text{calibrated}} = g(f(x)) $$

where $g$ is a non-decreasing function learned to minimize squared error on the calibration set.

Properties: - Nonparametric: can capture any monotonic calibration mapping. - More flexible than Platt scaling. - Requires more calibration data (hundreds to thousands of samples). - Can overfit with small calibration sets, producing jagged step functions.

from sklearn.isotonic import IsotonicRegression

def isotonic_calibration(probs_train, y_train, probs_test):
    """Apply isotonic regression calibration."""
    ir = IsotonicRegression(out_of_bounds='clip')
    ir.fit(probs_train, y_train)
    return ir.predict(probs_test)

xgb_probs_isotonic = isotonic_calibration(xgb_probs_val, y_val, xgb_probs_test)

23.6.5 Temperature Scaling

Temperature scaling, popular for neural networks, divides the pre-sigmoid logit by a learned temperature parameter $T$:

$$ \hat{p}_{\text{calibrated}} = \sigma\left(\frac{z}{T}\right) $$

where $z$ is the logit (pre-sigmoid output) and $T > 0$ is the temperature.

$T > 1$: softens predictions (moves them toward 0.5), correcting overconfidence.
$T < 1$: sharpens predictions (moves them toward 0 or 1), correcting underconfidence.
$T = 1$: no change.

Temperature scaling has only one parameter, making it very resistant to overfitting, but it can only apply a global shift — it cannot fix calibration issues that differ across the probability range.

from scipy.optimize import minimize_scalar

def temperature_scaling(logits_val, y_val, logits_test):
    """Learn temperature on validation set, apply to test set."""
    def nll(T):
        scaled = 1 / (1 + np.exp(-logits_val / T))
        scaled = np.clip(scaled, 1e-7, 1 - 1e-7)
        return -np.mean(y_val * np.log(scaled) + (1 - y_val) * np.log(1 - scaled))

    result = minimize_scalar(nll, bounds=(0.1, 10.0), method='bounded')
    T_opt = result.x
    print(f"Optimal temperature: {T_opt:.3f}")

    return 1 / (1 + np.exp(-logits_test / T_opt))

23.6.6 Python Calibration Pipeline

def full_calibration_pipeline(model_name, probs_val, probs_test, y_val, y_test):
    """Complete calibration pipeline with evaluation."""
    results = {}

    # Uncalibrated
    results['Uncalibrated'] = {
        'brier': brier_score_loss(y_test, probs_test),
        'logloss': log_loss(y_test, probs_test),
    }

    # Platt scaling
    probs_platt = platt_scaling(probs_val, y_val, probs_test)
    results['Platt'] = {
        'brier': brier_score_loss(y_test, probs_platt),
        'logloss': log_loss(y_test, probs_platt),
    }

    # Isotonic regression
    probs_iso = isotonic_calibration(probs_val, y_val, probs_test)
    results['Isotonic'] = {
        'brier': brier_score_loss(y_test, probs_iso),
        'logloss': log_loss(y_test, probs_iso),
    }

    results_df = pd.DataFrame(results).T
    print(f"\n{model_name} Calibration Results:")
    print(results_df.to_string())

    return results_df

# Run calibration for each model
full_calibration_pipeline('XGBoost', xgb_probs_val, xgb_probs_test, y_val, y_test)
full_calibration_pipeline('Random Forest', rf_probs_val, rf_probs_test, y_val, y_test)

23.6.7 Practical Advice on Calibration

Always calibrate. Even if the model seems calibrated on the training set, check on held-out data.
Use a separate calibration set. Do not calibrate on the same data used for hyperparameter tuning. The pipeline is: train set → model training; validation set → hyperparameter tuning; calibration set → calibration fitting; test set → final evaluation.
Platt scaling is safer with small data. With fewer than 500 calibration samples, prefer Platt scaling over isotonic regression.
Temperature scaling for neural nets. If your model is a neural network, try temperature scaling first — it is simple and effective.
Recalibrate periodically. As the data distribution shifts (concept drift), calibration can degrade. Recalibrate on recent data.

23.7 Model Interpretability with SHAP

23.7.1 Why Interpretability Matters for Trading

In prediction markets, model interpretability serves multiple purposes:

Trust. Before risking money on a model's prediction, you want to understand why it made that prediction. A model that predicts 85% on a political outcome should have interpretable reasons — high approval rating, strong economy, weak challenger.
Debugging. If a model makes a surprising prediction, SHAP values can reveal whether it is capturing a genuine signal or exploiting a data artifact.
Regulatory and ethical considerations. In some prediction market contexts, understanding why a model makes predictions can be important for compliance.
Feature engineering feedback. SHAP analysis can reveal which features are most valuable, guiding further feature engineering.
Communicating to stakeholders. If you manage a fund that trades prediction markets, explaining model decisions to investors requires interpretability.

23.7.2 SHAP Values Explained Intuitively

SHAP (SHapley Additive exPlanations) values are based on Shapley values from cooperative game theory. The idea is to fairly distribute the model's prediction among the input features.

For a prediction $f(x)$, the SHAP value $\phi_j$ for feature $j$ satisfies:

$$ f(x) = \phi_0 + \sum_{j=1}^{M} \phi_j $$

where $\phi_0$ is the base value (the average prediction across all data) and $\phi_j$ is the contribution of feature $j$ to the prediction for this specific instance.

Intuition. Consider all possible subsets of features. For each subset, measure how much adding feature $j$ changes the prediction. The SHAP value is the weighted average of these marginal contributions across all possible subsets.

Formally:

$$ \phi_j = \sum_{S \subseteq N \setminus \{j\}} \frac{|S|! \; (M - |S| - 1)!}{M!} \left[ f(S \cup \{j\}) - f(S) \right] $$

where $N$ is the set of all features, $M = |N|$, and $f(S)$ is the model's prediction using only features in $S$ (with other features marginalized out).

Key properties: - Additivity. SHAP values sum to the model's prediction minus the base value. - Consistency. If a feature contributes more in model A than model B for every possible subset, its SHAP value is higher in model A. - Local accuracy. The explanation exactly matches the model's output for each individual prediction.

23.7.3 Types of SHAP Plots

Force Plot. Shows how features push a single prediction away from the base value. Features pushing the prediction higher are shown in one color (typically red), and features pushing it lower in another (typically blue). This is ideal for explaining a specific prediction: "Why does the model predict 85% for this election?"

Summary Plot (Beeswarm). Shows the distribution of SHAP values for each feature across all predictions. Each dot is a single prediction, positioned by its SHAP value, and colored by the feature value. This reveals global patterns: "Across all elections, higher approval ratings consistently push predictions upward."

Dependence Plot. Shows the SHAP value for a single feature as a function of the feature's value, colored by an interacting feature. This reveals nonlinear effects and interactions: "Approval rating has a linear positive effect on prediction, but the effect is stronger when GDP growth is high."

Bar Plot. Shows mean absolute SHAP values for each feature, providing a global ranking of feature importance that is theoretically grounded and more informative than impurity-based importance.

23.7.4 Python SHAP Analysis for Prediction Market Models

import shap

# TreeExplainer for XGBoost (fast, exact)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Summary plot: global feature importance and direction
shap.summary_plot(shap_values, X_test, feature_names=feature_cols)

# Bar plot: mean absolute SHAP values
shap.summary_plot(shap_values, X_test, feature_names=feature_cols,
                  plot_type='bar')

# Force plot for a single prediction
idx = 0  # Explain the first test prediction
shap.force_plot(
    explainer.expected_value,
    shap_values[idx],
    X_test.iloc[idx],
    feature_names=feature_cols
)

# Dependence plot: approval_rating effect with interaction
shap.dependence_plot(
    'approval_rating', shap_values, X_test,
    feature_names=feature_cols,
    interaction_index='gdp_growth'
)

# Waterfall plot for detailed single-instance explanation
shap.waterfall_plot(
    shap.Explanation(
        values=shap_values[idx],
        base_values=explainer.expected_value,
        data=X_test.iloc[idx].values,
        feature_names=feature_cols
    )
)

23.7.5 SHAP for Neural Networks

For neural networks, TreeExplainer is not available. Use DeepExplainer or KernelExplainer:

# DeepExplainer for PyTorch models
# Requires a background dataset for reference
background = X_train_t[:100]  # Use a subset as background
deep_explainer = shap.DeepExplainer(model, background)
nn_shap_values = deep_explainer.shap_values(X_test_t[:50])

# KernelExplainer: model-agnostic but slower
def nn_predict(X):
    with torch.no_grad():
        return model(torch.FloatTensor(X)).numpy()

kernel_explainer = shap.KernelExplainer(nn_predict, X_train_scaled[:100])
kernel_shap_values = kernel_explainer.shap_values(X_test_scaled[:50])

23.7.6 Interpreting SHAP Results for Trading Decisions

When using SHAP to inform trading:

Sanity check feature effects. If SHAP shows that higher unemployment increases the incumbent's re-election probability, something is wrong — either the feature is mislabeled, there is data leakage, or the model is overfitting to noise.
Identify the key drivers. If three features account for 80% of the SHAP importance, focus your data collection and monitoring on those features.
Watch for interaction effects. SHAP dependence plots can reveal that a feature matters only in certain contexts. This can inform conditional trading strategies.
Track SHAP stability over time. If the SHAP importance ranking changes dramatically when you retrain on new data, the model may be unstable.

23.8 Feature Engineering for ML Models

23.8.1 The Importance of Feature Engineering

Even with powerful ML algorithms, the quality of input features often determines model performance. In prediction markets, raw data (poll numbers, economic indicators, contract prices) is just the starting point. Thoughtful feature engineering can expose signals that raw data obscures.

23.8.2 Types of Features for Prediction Markets

Raw features. Direct measurements: current poll numbers, GDP growth rate, unemployment rate, contract price, days until resolution.

Transformation features. Nonlinear transformations of raw features: - Log transforms for skewed distributions (e.g., fundraising amounts) - Polynomial features for capturing quadratic relationships - Rank transforms for robustness to outliers

Lag features. Past values of time-varying features: - Poll numbers 7, 14, 30 days ago - Price of the contract 1 hour, 1 day, 1 week ago - Economic indicators from previous quarters

Change and momentum features. Rates of change: - Change in poll numbers over the last week - Momentum of contract price (is it trending up or down?) - Acceleration: is the change itself increasing or decreasing?

Rolling statistics. Aggregations over windows: - 7-day moving average of contract price - 30-day volatility of poll numbers - Rolling maximum and minimum

Interaction features. Products or ratios of features: - Approval rating times GDP growth - Polling margin divided by days until election - Sentiment score times fundraising ratio

External data features. Data from outside the prediction market: - News sentiment (NLP-derived) - Social media volume and sentiment - Expert forecasts and aggregate predictions - Betting odds from other platforms

23.8.3 Automated Feature Generation

from itertools import combinations

def generate_features(df, feature_cols, lags=[1, 3, 7, 14]):
    """Generate engineered features for prediction market models."""
    df_feat = df.copy()

    # Lag features (assuming temporal ordering)
    for col in feature_cols:
        for lag in lags:
            df_feat[f'{col}_lag{lag}'] = df_feat[col].shift(lag)

    # Rolling statistics
    for col in feature_cols:
        df_feat[f'{col}_roll7_mean'] = df_feat[col].rolling(7).mean()
        df_feat[f'{col}_roll7_std'] = df_feat[col].rolling(7).std()
        df_feat[f'{col}_roll14_mean'] = df_feat[col].rolling(14).mean()

    # Change features
    for col in feature_cols:
        df_feat[f'{col}_change1'] = df_feat[col].diff(1)
        df_feat[f'{col}_change7'] = df_feat[col].diff(7)
        df_feat[f'{col}_pct_change1'] = df_feat[col].pct_change(1)

    # Interaction features (selected pairs)
    interaction_pairs = [
        ('approval_rating', 'gdp_growth'),
        ('polling_margin', 'months_to_election'),
        ('sentiment_score', 'approval_rating'),
    ]
    for col1, col2 in interaction_pairs:
        if col1 in df_feat.columns and col2 in df_feat.columns:
            df_feat[f'{col1}_x_{col2}'] = df_feat[col1] * df_feat[col2]
            if (df_feat[col2] != 0).all():
                df_feat[f'{col1}_div_{col2}'] = df_feat[col1] / df_feat[col2]

    # Log transforms for positive features
    for col in ['fundraising_ratio']:
        if col in df_feat.columns:
            df_feat[f'{col}_log'] = np.log1p(df_feat[col])

    # Drop rows with NaN from lag/rolling features
    df_feat = df_feat.dropna()

    return df_feat

23.8.4 Feature Selection Techniques

With many engineered features, some will be noise. Feature selection improves model performance and interpretability.

Variance threshold. Remove features with near-zero variance.

from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X_train)

Correlation filter. Remove highly correlated features (e.g., correlation > 0.95), keeping the one with higher SHAP importance.

def remove_correlated_features(X, threshold=0.95):
    """Remove features with correlation above threshold."""
    corr_matrix = X.corr().abs()
    upper = corr_matrix.where(
        np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
    )
    to_drop = [col for col in upper.columns if any(upper[col] > threshold)]
    return X.drop(columns=to_drop), to_drop

Recursive Feature Elimination (RFE). Iteratively removes the least important feature.

SHAP-based selection. Use SHAP values from Section 23.7 to rank features and select the top $k$.

# Select top features based on mean absolute SHAP values
mean_abs_shap = np.abs(shap_values).mean(axis=0)
feature_importance_shap = pd.Series(
    mean_abs_shap, index=feature_cols
).sort_values(ascending=False)

# Keep top 15 features
top_features = feature_importance_shap.head(15).index.tolist()

Boruta. A wrapper method that creates "shadow features" (random permutations of real features) and retains only features that consistently outperform their shadows.

23.9 Handling Prediction Market Specific Challenges

23.9.1 Class Imbalance in Rare Events

Many prediction market questions concern rare events: "Will there be a government shutdown?" (base rate maybe 20%), "Will this team win the championship?" (base rate 3%), "Will a recession start this quarter?" (base rate 5%).

With imbalanced classes, standard ML models tend to predict the majority class and achieve high accuracy while being useless for probability estimation.

Strategies:

Use proper scoring rules, not accuracy. Evaluate with log-loss and Brier score, which penalize poor probability estimates regardless of class balance.
Class weighting. Set class_weight='balanced' (scikit-learn) or scale_pos_weight (XGBoost) to make the model pay more attention to the minority class.
Oversampling and undersampling. SMOTE (Synthetic Minority Oversampling Technique) creates synthetic minority examples. Undersampling removes majority examples. Both have tradeoffs — SMOTE can create unrealistic examples, and undersampling discards data.
Threshold-independent evaluation. Use AUC-ROC and AUC-PR curves to evaluate discrimination independent of the decision threshold.

Important caveat. For probability estimation (as opposed to classification), aggressive resampling can distort the learned probability distribution. A model trained on 50/50 resampled data will predict probabilities centered around 50%, not around the true base rate. If you resample, you must recalibrate afterward.

23.9.2 Small Datasets

Prediction markets for elections, major geopolitical events, and rare occurrences suffer from extremely small datasets. There have been only about 20 U.S. presidential elections in the modern polling era. Strategies:

Transfer learning. Train on related tasks with more data (e.g., all elections worldwide, including congressional, gubernatorial) and fine-tune on the specific task.
Bayesian approaches. Use informative priors to incorporate domain knowledge. Bayesian neural networks or Bayesian XGBoost (via priors on hyperparameters) can help.
Simpler models. With very small datasets (fewer than 100 samples), logistic regression or even simple heuristics may outperform complex ML models. The bias-variance tradeoff strongly favors high-bias, low-variance models here.
Data augmentation. For certain types of prediction market data, you can generate synthetic examples. For example, bootstrap resampling of historical elections with slight perturbations.
Leave-one-out cross-validation. When data is precious, LOOCV maximizes the training set size for each fold.

23.9.3 Concept Drift

Prediction markets operate in non-stationary environments. The relationship between features and outcomes can change over time:

Structural shifts. The rise of social media changed how polling relates to election outcomes.
Market evolution. As prediction markets mature, the informational efficiency of prices changes.
Economic regime changes. The relationship between economic indicators and election outcomes may differ in recessions versus expansions.

Detection: - Monitor model performance over time. A sudden increase in log-loss signals potential drift. - Track the distribution of input features. If the feature distribution shifts significantly from training data, predictions may be unreliable. - Use statistical tests (Page-Hinkley, ADWIN) to detect change points.

Mitigation: - Retrain models regularly on recent data. - Use sliding window training instead of expanding window. - Weight recent observations more heavily. - Ensemble models trained on different time periods.

23.9.4 Non-Stationarity

Beyond concept drift, prediction market data can exhibit non-stationarity in the statistical sense:

Contract prices are often close to random walks (non-stationary time series).
The variance of poll numbers changes as elections approach.
The relevance of different features changes over the resolution timeline.

Practical responses: - Use stationary transformations as features: returns instead of prices, changes instead of levels. - Include time-to-resolution as a feature, allowing the model to learn that different features matter at different time horizons. - Separate models for different time horizons (e.g., one model for predictions 6+ months out, another for the final month).

23.9.5 Leakage Pitfalls

Data leakage occurs when information from the future or from the target variable inadvertently enters the training features. Common leakage scenarios in prediction markets:

Future data in features. Using poll numbers from after the prediction date. Always verify that each feature was available at the time of prediction.
Market price as feature. Using the current market price (which aggregates information from all participants, including potentially the model's own predictions) can create circular logic.
Look-ahead bias in rolling features. Centering rolling windows (using future data on both sides) instead of trailing windows.
Target encoding without proper CV. Using target-encoded features (mean outcome by category) computed on the full dataset rather than only the training fold.
Feature engineering after splitting. Computing feature statistics (mean, variance for standardization) on the full dataset including test data.

Prevention: - Organize data with strict timestamps. - Implement a FeatureStore that only returns features available at a given timestamp. - Run leakage detection: if a feature has suspiciously high importance (e.g., a single feature yields near-perfect predictions), investigate.

23.10 Model Comparison and Selection

23.10.1 Evaluation Metrics for Probability Estimation

For prediction markets, we care about the quality of probability estimates, not just classification accuracy. The key metrics are:

Log-loss (Cross-entropy):

$$ \text{LogLoss} = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{p}_i) + (1-y_i) \log(1-\hat{p}_i) \right] $$

Log-loss heavily penalizes confident wrong predictions. A prediction of 0.99 for an event that does not occur contributes $-\log(0.01) = 4.61$ to the loss. This makes log-loss the harshest metric for overconfident models.

Brier score:

$$ \text{Brier} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{p}_i)^2 $$

The Brier score is the mean squared error between predictions and outcomes. It ranges from 0 (perfect) to 1 (worst). It is less sensitive to extreme predictions than log-loss.

Expected Calibration Error (ECE):

$$ \text{ECE} = \sum_{b=1}^{B} \frac{n_b}{N} |\bar{p}_b - \bar{y}_b| $$

where $\bar{p}_b$ is the mean predicted probability in bin $b$ and $\bar{y}_b$ is the observed frequency. ECE directly measures calibration.

Brier Skill Score:

$$ \text{BSS} = 1 - \frac{\text{Brier}_{\text{model}}}{\text{Brier}_{\text{baseline}}} $$

where the baseline is the climatological probability (overall base rate). BSS > 0 means the model beats the baseline.

23.10.2 Comparing Models

from sklearn.metrics import brier_score_loss, log_loss

def compare_models(y_true, model_probs, model_names):
    """Comprehensive model comparison."""
    results = []

    for name, probs in zip(model_names, model_probs):
        # Clip probabilities to avoid log(0)
        probs_clipped = np.clip(probs, 1e-7, 1 - 1e-7)

        brier = brier_score_loss(y_true, probs_clipped)
        ll = log_loss(y_true, probs_clipped)

        # Expected Calibration Error
        n_bins = 10
        bin_edges = np.linspace(0, 1, n_bins + 1)
        ece = 0.0
        for j in range(n_bins):
            mask = (probs_clipped >= bin_edges[j]) & (probs_clipped < bin_edges[j+1])
            if mask.sum() > 0:
                bin_acc = y_true[mask].mean()
                bin_conf = probs_clipped[mask].mean()
                ece += mask.sum() / len(y_true) * abs(bin_acc - bin_conf)

        # Brier Skill Score
        base_rate = y_true.mean()
        brier_baseline = brier_score_loss(y_true, np.full_like(probs, base_rate))
        bss = 1 - brier / brier_baseline

        results.append({
            'Model': name,
            'Log Loss': ll,
            'Brier Score': brier,
            'ECE': ece,
            'Brier Skill Score': bss,
        })

    results_df = pd.DataFrame(results).set_index('Model')
    print(results_df.to_string())
    return results_df

compare_models(
    y_test.values,
    [rf_probs_test, xgb_probs_test, nn_probs_test],
    ['Random Forest', 'XGBoost', 'Neural Network']
)

23.10.3 Statistical Tests for Model Comparison

When comparing models, observed differences in log-loss or Brier score may be due to chance. Use statistical tests to assess significance.

Paired t-test on per-instance losses:

from scipy import stats

def paired_comparison(y_true, probs_a, probs_b, name_a, name_b):
    """Statistical comparison of two models using paired test."""
    # Per-instance log-loss
    eps = 1e-7
    loss_a = -(y_true * np.log(np.clip(probs_a, eps, 1-eps)) +
               (1-y_true) * np.log(np.clip(1-probs_a, eps, 1-eps)))
    loss_b = -(y_true * np.log(np.clip(probs_b, eps, 1-eps)) +
               (1-y_true) * np.log(np.clip(1-probs_b, eps, 1-eps)))

    diff = loss_a - loss_b
    t_stat, p_value = stats.ttest_rel(loss_a, loss_b)

    print(f"\n{name_a} vs {name_b}:")
    print(f"  Mean loss difference: {diff.mean():.6f}")
    print(f"  t-statistic: {t_stat:.3f}")
    print(f"  p-value: {p_value:.4f}")
    if p_value < 0.05:
        better = name_b if diff.mean() > 0 else name_a
        print(f"  Significant difference: {better} is better")
    else:
        print(f"  No significant difference at alpha=0.05")

paired_comparison(y_test.values, rf_probs_test, xgb_probs_test,
                  'Random Forest', 'XGBoost')

Diebold-Mariano test. More appropriate for time-series data, it accounts for serial correlation in the loss differences:

def diebold_mariano_test(loss_a, loss_b, h=1):
    """Diebold-Mariano test for equal predictive accuracy.
    h: forecast horizon (for HAC variance estimation)
    """
    d = loss_a - loss_b
    T = len(d)
    d_mean = d.mean()

    # HAC variance estimator
    gamma_0 = np.var(d, ddof=0)
    gamma_sum = 0
    for k in range(1, h):
        gamma_k = np.mean((d[k:] - d_mean) * (d[:-k] - d_mean))
        gamma_sum += gamma_k

    var_d = (gamma_0 + 2 * gamma_sum) / T

    if var_d <= 0:
        return np.nan, np.nan

    dm_stat = d_mean / np.sqrt(var_d)
    p_value = 2 * stats.norm.sf(abs(dm_stat))

    return dm_stat, p_value

23.10.4 When to Use Which Model

Based on extensive empirical evidence and the analysis above:

Scenario	Recommended Model
Small dataset (< 200 samples)	Logistic regression or simple RF
Medium dataset (200-5000 samples)	XGBoost or LightGBM
Large dataset (> 5000 samples)	XGBoost, LightGBM, or neural network
Many categorical features	LightGBM (native categorical support)
Need interpretability	XGBoost + SHAP
Multi-modal data (text + tabular)	Neural network
Real-time prediction	Small RF or LightGBM (fast inference)
Highest possible accuracy	Ensemble of multiple model types

23.11 Deploying ML Models for Live Trading

23.11.1 Model Serialization

Once a model is trained and validated, serialize it for deployment:

import joblib
import json
from datetime import datetime

# Save XGBoost model
model.save_model('models/xgb_election_v1.json')

# Save scikit-learn models
joblib.dump(rf, 'models/rf_election_v1.joblib')
joblib.dump(scaler, 'models/scaler_v1.joblib')

# Save PyTorch model
torch.save({
    'model_state_dict': model.state_dict(),
    'model_architecture': 'PredictionMarketNet',
    'input_dim': X_train_scaled.shape[1],
    'scaler_mean': scaler.mean_.tolist(),
    'scaler_scale': scaler.scale_.tolist(),
}, 'models/nn_election_v1.pt')

# Save metadata
metadata = {
    'model_version': 'v1',
    'training_date': datetime.now().isoformat(),
    'training_samples': len(X_train),
    'features': feature_cols,
    'validation_brier': float(brier_score_loss(y_val, xgb_probs_val)),
    'validation_logloss': float(log_loss(y_val, xgb_probs_val)),
    'calibration_method': 'platt_scaling',
}
with open('models/metadata_v1.json', 'w') as f:
    json.dump(metadata, f, indent=2)

23.11.2 Prediction Serving

A prediction serving system should:

Load the model once. Do not reload the model for every prediction.
Validate inputs. Check that features are within expected ranges.
Apply preprocessing. The same scaler and feature engineering used during training.
Apply calibration. The calibration model (Platt scaling or isotonic regression) must be applied post-prediction.
Return confidence metadata. Along with the prediction, return information about model uncertainty.

class PredictionServer:
    """Serves predictions from a trained ML pipeline."""

    def __init__(self, model_dir):
        self.model = xgb.Booster()
        self.model.load_model(f'{model_dir}/xgb_election_v1.json')
        self.scaler = joblib.load(f'{model_dir}/scaler_v1.joblib')
        self.calibrator = joblib.load(f'{model_dir}/calibrator_v1.joblib')

        with open(f'{model_dir}/metadata_v1.json') as f:
            self.metadata = json.load(f)

        self.feature_names = self.metadata['features']

    def predict(self, features_dict):
        """Generate a calibrated probability prediction."""
        # Validate inputs
        missing = set(self.feature_names) - set(features_dict.keys())
        if missing:
            raise ValueError(f"Missing features: {missing}")

        # Prepare features
        X = pd.DataFrame([features_dict])[self.feature_names]

        # Predict
        dmatrix = xgb.DMatrix(X, feature_names=self.feature_names)
        raw_prob = self.model.predict(dmatrix)[0]

        # Calibrate
        calibrated_prob = self.calibrator.predict_proba(
            np.array([[raw_prob]])
        )[0, 1]

        return {
            'probability': float(calibrated_prob),
            'raw_probability': float(raw_prob),
            'model_version': self.metadata['model_version'],
            'features_used': self.feature_names,
        }

23.11.3 Monitoring Model Performance

In live trading, continuous monitoring is essential:

Performance metrics over time. Track rolling Brier score and log-loss as outcomes are resolved. A deterioration signals concept drift or data quality issues.

class ModelMonitor:
    """Monitor deployed model performance."""

    def __init__(self, window_size=100):
        self.predictions = []
        self.outcomes = []
        self.timestamps = []
        self.window_size = window_size

    def log_prediction(self, timestamp, probability, outcome=None):
        self.predictions.append(probability)
        self.outcomes.append(outcome)
        self.timestamps.append(timestamp)

    def update_outcome(self, index, outcome):
        self.outcomes[index] = outcome

    def rolling_brier_score(self):
        """Compute rolling Brier score on resolved predictions."""
        resolved = [
            (p, o) for p, o in zip(self.predictions, self.outcomes)
            if o is not None
        ]
        if len(resolved) < self.window_size:
            return None

        recent = resolved[-self.window_size:]
        preds = np.array([p for p, o in recent])
        outcomes = np.array([o for p, o in recent])
        return brier_score_loss(outcomes, preds)

    def calibration_check(self, n_bins=5):
        """Check if model is still calibrated."""
        resolved = [
            (p, o) for p, o in zip(self.predictions, self.outcomes)
            if o is not None
        ]
        if len(resolved) < 50:
            return "Insufficient data for calibration check"

        preds = np.array([p for p, o in resolved])
        outcomes = np.array([o for p, o in resolved])

        bins = np.linspace(0, 1, n_bins + 1)
        report = []
        for i in range(n_bins):
            mask = (preds >= bins[i]) & (preds < bins[i+1])
            if mask.sum() > 0:
                expected = preds[mask].mean()
                observed = outcomes[mask].mean()
                report.append({
                    'bin': f'{bins[i]:.1f}-{bins[i+1]:.1f}',
                    'count': int(mask.sum()),
                    'expected': expected,
                    'observed': observed,
                    'gap': abs(expected - observed),
                })

        return pd.DataFrame(report)

    def drift_alert(self, threshold=0.05):
        """Alert if recent performance degrades significantly."""
        if len(self.predictions) < 2 * self.window_size:
            return False

        resolved = [
            (p, o) for p, o in zip(self.predictions, self.outcomes)
            if o is not None
        ]

        n = len(resolved)
        mid = n // 2

        old_preds = np.array([p for p, o in resolved[:mid]])
        old_outcomes = np.array([o for p, o in resolved[:mid]])
        new_preds = np.array([p for p, o in resolved[mid:]])
        new_outcomes = np.array([o for p, o in resolved[mid:]])

        old_brier = brier_score_loss(old_outcomes, old_preds)
        new_brier = brier_score_loss(new_outcomes, new_preds)

        if new_brier - old_brier > threshold:
            return True, f"Brier score degraded: {old_brier:.4f} -> {new_brier:.4f}"
        return False, f"Performance stable: {old_brier:.4f} -> {new_brier:.4f}"

23.11.4 Retraining Schedules

Models should be retrained periodically. The frequency depends on:

Market type. Election models may be retrained weekly during campaign season, while sports models may be retrained daily.
Data availability. Retrain when significant new data becomes available (new polls, new economic data releases).
Drift detection. Trigger retraining when the monitoring system detects performance degradation.
Scheduled retraining. As a baseline, retrain at least monthly with fresh data.

Retraining pipeline:

1. Collect new data since last training
2. Run feature engineering pipeline
3. Retrain model on expanded dataset
4. Evaluate on recent held-out data
5. Recalibrate
6. Compare with current production model
7. If better (and not just by noise), deploy
8. Archive old model with metadata

23.11.5 A/B Testing Models

When deploying a new model, run it in parallel with the existing model:

Shadow mode. The new model makes predictions but they are not acted upon. Compare its predictions against the old model on the same events.
Canary deployment. Allocate a small fraction (e.g., 10%) of trading capital to the new model's predictions.
Full deployment. Once the new model demonstrates superior performance over a statistically significant number of predictions, switch fully.

class ABTestRunner:
    """Run A/B tests between model versions."""

    def __init__(self, model_a, model_b, allocation_b=0.1):
        self.model_a = model_a
        self.model_b = model_b
        self.allocation_b = allocation_b
        self.results_a = []
        self.results_b = []

    def get_prediction(self, features):
        """Get predictions from both models."""
        pred_a = self.model_a.predict(features)
        pred_b = self.model_b.predict(features)

        # Use model B for a fraction of decisions
        use_b = np.random.random() < self.allocation_b

        return {
            'active_prediction': pred_b if use_b else pred_a,
            'active_model': 'B' if use_b else 'A',
            'model_a_prediction': pred_a,
            'model_b_prediction': pred_b,
        }

    def evaluate(self):
        """Compare model performance."""
        if len(self.results_a) < 30 or len(self.results_b) < 30:
            return "Insufficient data for comparison"

        brier_a = np.mean([(p - o)**2 for p, o in self.results_a])
        brier_b = np.mean([(p - o)**2 for p, o in self.results_b])

        # Statistical test
        losses_a = np.array([(p - o)**2 for p, o in self.results_a])
        losses_b = np.array([(p - o)**2 for p, o in self.results_b])
        t_stat, p_value = stats.ttest_ind(losses_a, losses_b)

        return {
            'brier_a': brier_a,
            'brier_b': brier_b,
            'improvement': (brier_a - brier_b) / brier_a * 100,
            'p_value': p_value,
            'recommendation': 'Deploy B' if p_value < 0.05 and brier_b < brier_a else 'Keep A'
        }

23.12 Chapter Summary

This chapter covered the complete machine learning pipeline for prediction market probability estimation:

Algorithm selection. Random forests provide robust baselines with natural probability estimates. XGBoost and LightGBM offer superior predictive performance through gradient boosting with regularization. Neural networks excel when data is abundant or when combining multiple data types.
The calibration imperative. Raw ML outputs are rarely well-calibrated. Platt scaling, isotonic regression, and temperature scaling transform raw outputs into reliable probabilities. In prediction markets, where you trade based on probability estimates, calibration is not optional — it is essential for profitability.
Interpretability through SHAP. SHAP values provide a theoretically grounded, model-agnostic framework for understanding individual predictions and global feature importance. For trading applications, SHAP enables trust, debugging, and strategic insight.
Feature engineering. The quality of input features often matters more than the choice of algorithm. Lag features, momentum, rolling statistics, and domain-specific interactions can expose signals hidden in raw data.
Prediction market challenges. Small datasets, class imbalance, concept drift, and leakage are ever-present threats. Temporal splitting, regular recalibration, and rigorous monitoring are essential defenses.
Deployment and monitoring. A deployed model requires serialization, a serving layer, continuous performance monitoring, and a retraining pipeline. A/B testing ensures that new models actually improve upon existing ones.

The overarching theme is disciplined application. Machine learning is powerful, but in the thin-margin world of prediction markets, undisciplined ML will lose money. Careful validation, calibration, and monitoring transform ML from a source of overconfident predictions into a genuine competitive advantage.

What's Next

In Chapter 24: Ensemble Methods and Model Stacking, we will explore how to combine multiple ML models into ensembles that outperform any individual model. We will cover simple averaging, weighted ensembles, stacking with a meta-learner, and blending strategies specifically designed for prediction market probability estimation. The techniques from this chapter — building diverse models with different algorithms, calibrating their outputs, and evaluating them rigorously — form the foundation for effective ensembling.