33 min read

In This Chapter

22.1 Statistical Modeling for Prediction Markets
22.2 Linear Regression Review
22.3 Logistic Regression for Binary Outcomes
22.4 Feature Engineering for Prediction Markets
22.5 Regularization: Ridge, Lasso, and Elastic Net
22.6 Time Series Fundamentals
22.7 ARIMA Models
22.8 Advanced Time Series: GARCH and State Space Models
22.9 Walk-Forward Validation
22.10 Model Selection and Comparison
22.11 From Model Output to Trading Signal
22.12 Chapter Summary
What's Next

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 22: Statistical Modeling — Regression and Time Series

In Chapter 21, we explored how to gather, clean, and structure data for prediction market analysis. Now we turn that prepared data into actionable models. Statistical modeling sits at the heart of quantitative prediction market trading: it lets us estimate event probabilities from observable features, forecast how market prices will evolve over time, and — most importantly — identify when a market's quoted price diverges from our model's estimate of fair value.

This chapter covers three pillars of statistical modeling as they apply to prediction markets. First, regression models — both linear and logistic — that translate features like polling data, economic indicators, and historical base rates into probability estimates. Second, time series models — ARIMA, GARCH, and state space approaches — that capture the temporal dynamics of market prices themselves. Third, validation frameworks — particularly walk-forward validation — that ensure our models generalize to unseen data without the subtle data leakage that plagues naive approaches.

By the end of this chapter, you will be able to build, validate, and deploy statistical models that produce calibrated probability estimates, and you will understand how to convert those estimates into trading signals.

22.1 Statistical Modeling for Prediction Markets

Why Statistical Models?

Prediction markets aggregate the beliefs of many participants into a single price that reflects the crowd's implied probability of an event. But crowds are not always right. Prices can be distorted by liquidity constraints, behavioral biases, information asymmetries, or simply by the slow diffusion of new information. A well-built statistical model provides an independent estimate of the true probability, giving us a benchmark against which to evaluate market prices.

Consider a prediction market contract on whether a particular candidate will win an election. The market price sits at $0.62, implying a 62% probability. Your logistic regression model, trained on polling data, economic indicators, and historical base rates, estimates the probability at 71%. That 9-percentage-point gap is your edge — a potential trading opportunity, provided the model is well-calibrated and the gap exceeds transaction costs.

Statistical models serve several roles in prediction market analysis:

Probability estimation: Mapping observable features to event probabilities.
Price forecasting: Predicting how market prices will move in the short term.
Mispricing detection: Comparing model-derived probabilities to market-implied probabilities.
Risk quantification: Understanding the uncertainty around our estimates.
Signal generation: Converting model outputs into concrete trading decisions.

The Model as Probability Estimator

A statistical model for prediction markets is fundamentally a function that maps a vector of input features $\mathbf{x}$ to a probability $p$:

$$p = f(\mathbf{x}; \boldsymbol{\theta})$$

where $\boldsymbol{\theta}$ represents the model parameters learned from historical data. The quality of this mapping depends on three things: the choice of model family $f$, the quality and relevance of features $\mathbf{x}$, and the estimation procedure for $\boldsymbol{\theta}$.

For binary prediction markets (will an event happen or not?), the output must be a well-calibrated probability between 0 and 1. "Well-calibrated" means that among all events to which the model assigns probability 0.7, roughly 70% should actually occur. This calibration requirement shapes our model choices — it is why logistic regression, with its natural probabilistic output, is so well-suited to this domain.

Connecting to Market Pricing

The connection between statistical models and market pricing runs through the concept of implied probability. If a binary contract trades at price $P$, the market-implied probability is approximately $P$ (abstracting away from transaction costs and the bid-ask spread). Our model produces an independent probability estimate $\hat{p}$. The difference $\hat{p} - P$ represents our model's view of mispricing.

This is not merely academic. The entire framework of the Kelly criterion (Chapter 15) and position sizing depends on having a probability estimate that differs from the market's. Without a model, you are trading on intuition. With a model, you are trading on quantified information.

The Modeling Workflow

The statistical modeling workflow for prediction markets follows a disciplined sequence:

Problem formulation: Define exactly what you are predicting (binary outcome, continuous spread, price direction).
Feature engineering: Create informative predictors from raw data (Section 22.4).
Model selection: Choose the appropriate model family (regression, time series, or hybrid).
Training: Estimate model parameters from historical data.
Validation: Test the model on held-out data using walk-forward validation (Section 22.9).
Calibration: Ensure the model's probability outputs match observed frequencies.
Deployment: Convert model outputs to trading signals (Section 22.11).
Monitoring: Track model performance over time and retrain as needed.

Each step introduces opportunities for error. The most common mistakes — overfitting, lookahead bias, and miscalibration — can each destroy the profitability of a trading strategy. This chapter addresses each explicitly.

22.2 Linear Regression Review

Before diving into logistic regression, let us briefly review linear regression. While binary prediction markets call for logistic regression, linear regression remains useful for modeling continuous outcomes: point spreads in sports markets, trading volumes, implied volatility, and other quantities that do not have a natural 0-to-1 constraint.

OLS Basics

Ordinary least squares (OLS) linear regression models the relationship between a dependent variable $y$ and a vector of independent variables $\mathbf{x}$ as:

$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k + \epsilon$$

where $\beta_0$ is the intercept, $\beta_1, \ldots, \beta_k$ are the slope coefficients, and $\epsilon$ is the error term. OLS finds the parameter values that minimize the sum of squared residuals:

$$\hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \sum_{i=1}^{n} (y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2$$

The closed-form solution is $\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$, though in practice we use numerically stable algorithms.

Interpreting Coefficients

Each coefficient $\beta_j$ tells us the expected change in $y$ for a one-unit increase in $x_j$, holding all other predictors constant. For prediction markets, this has direct economic meaning. If we are modeling the final vote share of a candidate and $\beta_{\text{polling}} = 0.85$, then a one-percentage-point increase in polling average is associated with a 0.85-percentage-point increase in the predicted vote share.

R-squared and Model Fit

The coefficient of determination $R^2$ measures the proportion of variance in $y$ explained by the model:

$$R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}$$

In prediction market applications, do not expect high $R^2$ values. Markets are competitive and much information is already priced in. An $R^2$ of 0.05 to 0.15 on out-of-sample data can still be economically significant if it identifies systematic mispricings.

The adjusted R-squared penalizes for the number of predictors:

$$R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n - 1)}{n - k - 1}$$

This is a simple guard against overfitting: adding irrelevant predictors increases $R^2$ but decreases $R^2_{\text{adj}}$.

Key Assumptions

OLS relies on several assumptions:

Linearity: The relationship between $\mathbf{x}$ and $y$ is linear.
Independence: Observations are independent (often violated with time series data).
Homoscedasticity: Constant variance of residuals.
Normality: Residuals are normally distributed (for inference, not prediction).
No perfect multicollinearity: No predictor is a perfect linear combination of others.

In prediction market contexts, the independence assumption deserves special attention. Sequential market prices are autocorrelated, which violates independence and can produce misleadingly small standard errors. Newey-West standard errors or other heteroscedasticity-and-autocorrelation-consistent (HAC) estimators address this.

Python Implementation with statsmodels

import numpy as np
import pandas as pd
import statsmodels.api as sm

# Example: modeling final margin from polling and economic data
data = pd.DataFrame({
    'polling_avg': [52.1, 48.3, 55.7, 47.2, 51.0, 49.8, 53.2, 46.5, 50.5, 54.1],
    'gdp_growth': [2.1, 1.8, 3.2, 0.5, 2.5, 1.2, 2.8, -0.3, 1.9, 3.5],
    'incumbency': [1, 0, 1, 0, 1, 0, 1, 0, 0, 1],
    'actual_margin': [3.5, -2.1, 7.2, -4.8, 1.5, -0.5, 4.2, -5.5, 0.8, 5.8]
})

X = data[['polling_avg', 'gdp_growth', 'incumbency']]
X = sm.add_constant(X)
y = data['actual_margin']

model = sm.OLS(y, X).fit(cov_type='HC3')  # Heteroscedasticity-robust SEs
print(model.summary())

# Prediction with confidence interval
new_data = pd.DataFrame({
    'const': [1.0],
    'polling_avg': [51.5],
    'gdp_growth': [2.0],
    'incumbency': [1]
})
prediction = model.get_prediction(new_data)
print(f"Predicted margin: {prediction.predicted_mean[0]:.2f}")
print(f"95% CI: {prediction.conf_int(alpha=0.05)[0]}")

When to Use Linear Regression in Prediction Markets

Linear regression is appropriate when the outcome variable is continuous and unbounded (or at least not constrained to [0, 1]). Suitable applications include:

Predicting the final vote share or margin in an election.
Modeling trading volume or liquidity in a prediction market.
Estimating point spreads in sports markets.
Forecasting economic indicators that are the subject of prediction markets.

For binary outcomes — the most common prediction market contract type — we need logistic regression.

22.3 Logistic Regression for Binary Outcomes

Logistic regression is the workhorse model for binary prediction markets. It naturally produces probability estimates between 0 and 1, has a solid theoretical foundation, is interpretable, and is relatively resistant to overfitting compared to more complex models.

The Problem with Linear Regression for Binary Outcomes

If we naively use linear regression to predict a binary outcome (0 or 1), several things go wrong. The predicted values can fall outside [0, 1], making them uninterpretable as probabilities. The error term cannot be normally distributed (the outcome is either 0 or 1). And the relationship between continuous predictors and a binary outcome is typically not linear — it follows an S-shaped curve.

The Logit Link Function

Logistic regression solves these problems by modeling the log-odds of the outcome as a linear function of the predictors:

$$\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k$$

The left side is the logit function — the natural logarithm of the odds. The odds $p / (1-p)$ transform a probability into a ratio, and the logarithm maps that ratio to the entire real line $(-\infty, +\infty)$. This means the right side — a linear combination of features — can take any value, and we can always invert it to get a valid probability.

Solving for $p$, we get the sigmoid function:

$$p = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}} = \sigma(\mathbf{x}^T \boldsymbol{\beta})$$

The sigmoid function is the S-shaped curve that maps any real number to the interval $(0, 1)$.

Maximum Likelihood Estimation

Unlike linear regression, logistic regression cannot be solved in closed form. Instead, we use maximum likelihood estimation (MLE). For $n$ observations with binary outcomes $y_i \in \{0, 1\}$ and predicted probabilities $p_i$, the log-likelihood is:

$$\ell(\boldsymbol{\beta}) = \sum_{i=1}^{n} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right]$$

This is the negative of the binary cross-entropy loss (also called log-loss). MLE finds the parameters $\hat{\boldsymbol{\beta}}$ that maximize this expression, typically using iteratively reweighted least squares (IRLS) or gradient-based optimization.

The log-loss metric is particularly meaningful for prediction markets because it directly penalizes poorly calibrated probabilities. A model that assigns probability 0.99 to an event that does not occur incurs a large penalty — exactly the kind of overconfidence that leads to catastrophic losses in trading.

Interpreting Coefficients as Log-Odds

In logistic regression, each coefficient $\beta_j$ represents the change in log-odds for a one-unit increase in $x_j$:

$$\Delta \log\left(\frac{p}{1-p}\right) = \beta_j \cdot \Delta x_j$$

Equivalently, $e^{\beta_j}$ is the odds ratio — the factor by which the odds are multiplied for a one-unit increase in $x_j$. If $\beta_j = 0.5$, then $e^{0.5} \approx 1.65$, meaning a one-unit increase in $x_j$ multiplies the odds by 1.65 (a 65% increase in odds).

For prediction market applications, this interpretation is directly useful. If $x_j$ is the polling average and $\beta_j = 0.12$, then each percentage-point increase in polling multiplies the odds of winning by $e^{0.12} \approx 1.13$, or a 13% increase in odds.

The Sigmoid Function in Detail

The sigmoid function $\sigma(z) = 1 / (1 + e^{-z})$ has several important properties:

$\sigma(0) = 0.5$ (when the linear predictor is zero, the probability is 50%)
$\sigma(z) \to 1$ as $z \to +\infty$
$\sigma(z) \to 0$ as $z \to -\infty$
$\sigma'(z) = \sigma(z)(1 - \sigma(z))$ (the derivative, important for gradient calculations)
$\sigma(-z) = 1 - \sigma(z)$ (symmetry)

The steepness of the sigmoid at the midpoint determines how "decisive" the model is. A model with large coefficient magnitudes produces sharp transitions — it is very confident in its predictions. A model with small coefficients produces gentle transitions — it assigns moderate probabilities over a wide range of inputs.

Predicting Event Probabilities from Features

The core workflow for using logistic regression in prediction markets is:

Collect historical data on events and their outcomes (win/lose, above/below threshold, etc.).
Engineer features that are available before the outcome is determined.
Fit the logistic regression model.
For a new event, compute feature values and pass them through the model to get a probability.
Compare that probability to the current market price.

Python Implementation

Using scikit-learn:

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import log_loss, brier_score_loss

# Features: polling lead, GDP growth, incumbency, approval rating
X_train = np.array([
    [5.2, 2.1, 1, 52],
    [-1.3, 1.8, 0, 45],
    [8.1, 3.2, 1, 58],
    [-3.5, 0.5, 0, 41],
    [2.0, 2.5, 1, 49],
    [-0.8, 1.2, 0, 47],
    [4.5, 2.8, 1, 55],
    [-6.2, -0.3, 0, 38],
])
y_train = np.array([1, 0, 1, 0, 1, 0, 1, 0])

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

model = LogisticRegression(penalty='l2', C=1.0, solver='lbfgs')
model.fit(X_train_scaled, y_train)

# Predict probability for a new scenario
X_new = np.array([[3.0, 2.0, 1, 51]])
X_new_scaled = scaler.transform(X_new)
prob = model.predict_proba(X_new_scaled)[0, 1]
print(f"Predicted probability of winning: {prob:.3f}")

Using statsmodels (for detailed inference):

import statsmodels.api as sm

X_with_const = sm.add_constant(X_train)
logit_model = sm.Logit(y_train, X_with_const).fit(disp=0)
print(logit_model.summary())

# Marginal effects at the mean
marginal_effects = logit_model.get_margeff()
print(marginal_effects.summary())

The statsmodels implementation provides standard errors, p-values, confidence intervals, and marginal effects — all useful for understanding which features drive the predictions and how reliable the model is.

22.4 Feature Engineering for Prediction Markets

The quality of a statistical model depends at least as much on the quality of its features as on the sophistication of the algorithm. Feature engineering — the art and science of transforming raw data into informative predictors — is where domain knowledge about prediction markets translates into model performance.

Principles of Feature Engineering

Good features for prediction market models are:

Predictive: They carry information about the outcome.
Available in advance: They can be computed before the outcome is determined (no lookahead bias).
Timely: They reflect the most recent information.
Low-noise: They have high signal-to-noise ratio.
Orthogonal: They provide information not already captured by other features.

Polling-Based Features

For election prediction markets, polling data is typically the most informative feature source:

Polling average: Weighted average of recent polls (e.g., RealClearPolitics or FiveThirtyEight-style).
Polling trend: Change in polling average over the last 7, 14, or 30 days.
Polling volatility: Standard deviation of recent polls.
Poll quality: Weighted by pollster rating or historical accuracy.
Likely voter vs. registered voter gap: The difference between likely and registered voter polls.

Momentum Indicators

Momentum features capture the direction and speed of change:

Price momentum: Change in market price over a defined lookback window.
Volume momentum: Change in trading volume over time.
Moving average crossovers: When a short-term moving average crosses a longer-term one.
Rate of change: Percentage change in price over $n$ periods.

Volume and Liquidity Features

Market microstructure features can be informative:

Trading volume: Total contracts traded in a time window.
Volume ratio: Current volume relative to historical average.
Bid-ask spread: A proxy for liquidity and information asymmetry.
Order book imbalance: Ratio of buy to sell orders near the current price.

The passage of time affects prediction markets in systematic ways:

Time to expiry: Days or hours until the contract resolves.
Time decay rate: How quickly uncertainty resolves (faster near expiry).
Calendar effects: Day of week, month, proximity to key events.
Event schedule: Distance to the next information release (debate, earnings, report).

Cross-Market Features

Information from related markets can improve predictions:

Correlated contract prices: Prices of related prediction market contracts.
Financial market indicators: VIX, sector ETF returns, interest rates.
Sentiment indicators: Social media sentiment scores, Google Trends data.
Betting market odds: Odds from traditional bookmakers for sports markets.

Historical Base Rates

Historical base rates provide essential prior information:

Incumbency advantage: Historical win rate of incumbents.
Home team advantage: Win rate of home teams in specific sports.
Seasonal patterns: Base rates that vary by season or cycle.
Reversion rates: How often extreme market prices revert to the mean.

Python Feature Engineering Pipeline

import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline

class PredictionMarketFeatureEngineer(BaseEstimator, TransformerMixin):
    """Feature engineering for prediction market models."""

    def __init__(self, lookback_windows=[7, 14, 30]):
        self.lookback_windows = lookback_windows

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        df = X.copy()
        features = pd.DataFrame(index=df.index)

        # Polling features
        if 'polling_avg' in df.columns:
            features['polling_avg'] = df['polling_avg']
            for window in self.lookback_windows:
                col_name = f'polling_trend_{window}d'
                features[col_name] = df['polling_avg'].diff(window)
            features['polling_volatility'] = (
                df['polling_avg'].rolling(14).std()
            )

        # Market price features
        if 'market_price' in df.columns:
            for window in self.lookback_windows:
                features[f'price_momentum_{window}d'] = (
                    df['market_price'].diff(window)
                )
                features[f'price_ma_{window}d'] = (
                    df['market_price'].rolling(window).mean()
                )
            features['price_volatility_14d'] = (
                df['market_price'].rolling(14).std()
            )
            # Mean reversion signal
            features['price_zscore_30d'] = (
                (df['market_price'] - df['market_price'].rolling(30).mean())
                / df['market_price'].rolling(30).std()
            )

        # Volume features
        if 'volume' in df.columns:
            features['log_volume'] = np.log1p(df['volume'])
            features['volume_ratio_14d'] = (
                df['volume'] / df['volume'].rolling(14).mean()
            )
            features['volume_trend_7d'] = df['volume'].diff(7)

        # Time features
        if 'days_to_expiry' in df.columns:
            features['days_to_expiry'] = df['days_to_expiry']
            features['log_days_to_expiry'] = np.log1p(df['days_to_expiry'])
            features['inv_days_to_expiry'] = 1.0 / (df['days_to_expiry'] + 1)

        # Cross-market features
        if 'related_price' in df.columns:
            features['price_spread'] = (
                df['market_price'] - df['related_price']
            )
            features['price_ratio'] = (
                df['market_price'] / (df['related_price'] + 0.001)
            )

        # Drop rows with NaN from rolling calculations
        features = features.dropna()
        return features


# Usage example
pipeline = Pipeline([
    ('feature_engineer', PredictionMarketFeatureEngineer()),
    # Further steps: scaling, model, etc.
])

Handling Missing Data

Prediction market data often has gaps — missing polls, halted trading, or incomplete historical records. Strategies include:

Forward fill: Use the last known value (appropriate for slow-moving features like polling).
Interpolation: Linear or spline interpolation for smoothly varying features.
Indicator variables: Create a binary flag indicating missingness, then impute with a neutral value.
Model-based imputation: Use the relationships among features to impute missing values.

The choice of imputation method matters. Forward-filling a volatile feature like market price introduces stale information. Model-based imputation is more principled but adds complexity and potential for information leakage during cross-validation.

22.5 Regularization: Ridge, Lasso, and Elastic Net

As we add more features to our logistic regression model, the risk of overfitting increases. Regularization adds a penalty term to the loss function that discourages large coefficient values, effectively trading a small increase in bias for a large decrease in variance.

The Overfitting Problem

Overfitting occurs when a model learns the noise in the training data rather than the underlying signal. Signs of overfitting include:

Excellent performance on training data but poor performance on validation data.
Large coefficient magnitudes.
Highly variable predictions when small changes are made to the training data.
The model "memorizes" individual training examples rather than learning general patterns.

In prediction markets, overfitting is especially dangerous because:

Sample sizes are often small (limited number of historical elections, seasons, etc.).
The number of potential features is large (many polls, indicators, derived features).
The cost of false confidence is real money lost in trading.

L2 Regularization (Ridge)

Ridge regression adds an L2 penalty — the sum of squared coefficients — to the loss function:

$$\mathcal{L}_{\text{Ridge}} = -\ell(\boldsymbol{\beta}) + \lambda \sum_{j=1}^{k} \beta_j^2$$

The hyperparameter $\lambda$ (or equivalently $C = 1/\lambda$ in scikit-learn) controls the strength of regularization. Larger $\lambda$ shrinks coefficients more aggressively toward zero.

Ridge regression has several desirable properties: - It keeps all features in the model but reduces their influence. - It handles multicollinearity gracefully. - The solution is always unique (unlike unregularized logistic regression with separable data).

L1 Regularization (Lasso)

Lasso adds an L1 penalty — the sum of absolute coefficient values:

$$\mathcal{L}_{\text{Lasso}} = -\ell(\boldsymbol{\beta}) + \lambda \sum_{j=1}^{k} |\beta_j|$$

The critical difference from Ridge is that Lasso can set coefficients exactly to zero, performing automatic feature selection. This is valuable when you suspect many features are irrelevant and want the model to identify the important ones.

Elastic Net

Elastic Net combines L1 and L2 penalties:

$$\mathcal{L}_{\text{EN}} = -\ell(\boldsymbol{\beta}) + \lambda \left[ \alpha \sum_{j=1}^{k} |\beta_j| + \frac{1-\alpha}{2} \sum_{j=1}^{k} \beta_j^2 \right]$$

The parameter $\alpha$ controls the mix: $\alpha = 1$ is pure Lasso, $\alpha = 0$ is pure Ridge. Elastic Net inherits the feature selection property of Lasso while handling correlated features more gracefully (Lasso tends to arbitrarily select one among correlated features; Elastic Net can keep groups).

Hyperparameter Tuning

The regularization strength $\lambda$ (or $C$) must be chosen carefully. The standard approach is cross-validation, but for time series data we must use time-respecting cross-validation (see Section 22.9).

import numpy as np
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import TimeSeriesSplit

# IMPORTANT: Scale features before regularization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

# Time-series-aware cross-validation for hyperparameter tuning
tscv = TimeSeriesSplit(n_splits=5)

model_ridge = LogisticRegressionCV(
    penalty='l2',
    Cs=np.logspace(-4, 4, 20),
    cv=tscv,
    scoring='neg_log_loss',
    solver='lbfgs',
    max_iter=1000
)
model_ridge.fit(X_scaled, y_train)
print(f"Best C (Ridge): {model_ridge.C_[0]:.4f}")

model_lasso = LogisticRegressionCV(
    penalty='l1',
    Cs=np.logspace(-4, 4, 20),
    cv=tscv,
    scoring='neg_log_loss',
    solver='saga',
    max_iter=5000
)
model_lasso.fit(X_scaled, y_train)
print(f"Best C (Lasso): {model_lasso.C_[0]:.4f}")
print(f"Non-zero coefficients: {np.sum(model_lasso.coef_ != 0)}")

# Elastic Net
from sklearn.linear_model import SGDClassifier
model_en = SGDClassifier(
    loss='log_loss',
    penalty='elasticnet',
    l1_ratio=0.5,
    alpha=0.001,
    max_iter=1000
)
model_en.fit(X_scaled, y_train)

Practical Guidance

For prediction market models:

Start with Ridge if you believe most features are relevant and want to reduce variance without dropping features.
Use Lasso if you have many features and want automatic feature selection.
Use Elastic Net if features are correlated (common with multiple polling sources or related market prices).
Always standardize features before applying regularization — otherwise, the penalty treats features on different scales unequally.
Use time-series cross-validation for hyperparameter tuning, not random k-fold.

22.6 Time Series Fundamentals

While logistic regression treats each observation as independent, prediction market prices form a time series — a sequence of values indexed by time. Time series models exploit the temporal structure to forecast future values. Before building time series models, we need to understand several foundational concepts.

Stationarity

A time series is stationary if its statistical properties — mean, variance, and autocovariance structure — do not change over time. Formally, a strictly stationary process has the property that the joint distribution of $(X_{t_1}, \ldots, X_{t_k})$ is the same as $(X_{t_1+h}, \ldots, X_{t_k+h})$ for any shift $h$.

In practice, we work with weak stationarity (or covariance stationarity), which requires:

Constant mean: $E[X_t] = \mu$ for all $t$.
Constant variance: $\text{Var}(X_t) = \sigma^2$ for all $t$.
Autocovariance depends only on lag: $\text{Cov}(X_t, X_{t+h}) = \gamma(h)$ for all $t$.

Stationarity matters because most time series models assume it. A model fit to a stationary series will produce stable, reliable forecasts. A model fit to a non-stationary series may produce forecasts that drift or explode.

Prediction market prices are typically non-stationary: they trend toward 0 or 1 as the event resolution date approaches. However, price changes (returns) or appropriately differenced prices may be approximately stationary.

Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF)

The autocorrelation function measures the correlation between a time series and its lagged values:

$$\rho(h) = \frac{\gamma(h)}{\gamma(0)} = \text{Corr}(X_t, X_{t+h})$$

The ACF at lag $h$ includes both the direct effect of lag $h$ and the indirect effects through intermediate lags.

The partial autocorrelation function isolates the direct effect of lag $h$ after removing the influence of intermediate lags. Formally, $\text{PACF}(h)$ is the correlation between $X_t$ and $X_{t+h}$ after removing the linear effect of $X_{t+1}, \ldots, X_{t+h-1}$.

The ACF and PACF are our primary diagnostic tools for identifying appropriate time series models:

ACF Pattern	PACF Pattern	Suggested Model
Tails off (decays)	Cuts off after lag $p$	AR($p$)
Cuts off after lag $q$	Tails off (decays)	MA($q$)
Tails off	Tails off	ARMA($p$, $q$)

Differencing

If a time series is non-stationary due to a trend, we can often achieve stationarity by differencing — computing the change between consecutive values:

$$\Delta X_t = X_t - X_{t-1}$$

A series that becomes stationary after one round of differencing is said to be integrated of order 1, written $I(1)$. If two rounds are needed, it is $I(2)$. Most financial and prediction market time series require at most one or two rounds of differencing.

For prediction market prices, first differencing $\Delta P_t = P_t - P_{t-1}$ gives us the price change, which is often approximately stationary (at least during periods away from expiry).

The Augmented Dickey-Fuller Test

The Augmented Dickey-Fuller (ADF) test is a formal statistical test for stationarity. The null hypothesis is that the series has a unit root (is non-stationary). A low p-value (typically $< 0.05$) leads us to reject the null and conclude the series is stationary.

import pandas as pd
import numpy as np
from statsmodels.tsa.stattools import adfuller, acf, pacf
import matplotlib.pyplot as plt

# Simulate prediction market price data
np.random.seed(42)
n = 200
price = 0.5 + np.cumsum(np.random.normal(0, 0.01, n))
price = np.clip(price, 0.01, 0.99)  # Keep in valid range

# ADF test on levels
result_level = adfuller(price, autolag='AIC')
print(f"ADF Statistic (levels): {result_level[0]:.4f}")
print(f"p-value (levels): {result_level[1]:.4f}")

# ADF test on first differences
price_diff = np.diff(price)
result_diff = adfuller(price_diff, autolag='AIC')
print(f"ADF Statistic (diff): {result_diff[0]:.4f}")
print(f"p-value (diff): {result_diff[1]:.4f}")

# Plot ACF and PACF
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

axes[0, 0].plot(price)
axes[0, 0].set_title('Market Price (Levels)')

axes[0, 1].plot(price_diff)
axes[0, 1].set_title('Price Changes (First Difference)')

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
plot_acf(price_diff, ax=axes[1, 0], lags=20)
axes[1, 0].set_title('ACF of Price Changes')

plot_pacf(price_diff, ax=axes[1, 1], lags=20)
axes[1, 1].set_title('PACF of Price Changes')

plt.tight_layout()
plt.savefig('time_series_diagnostics.png', dpi=150)
plt.show()

Trend and Seasonality

In traditional time series analysis, we decompose a series into trend, seasonal, and residual components. For prediction markets, the "seasonality" concept maps differently:

Trend: The gradual resolution of uncertainty as the event approaches — prices trend toward 0 or 1.
Cyclical patterns: Day-of-week effects in trading activity, response to regularly scheduled information releases (e.g., weekly polling releases).
Residual: The unpredictable component that contains potential trading opportunities.

Decomposition can be additive ($X_t = T_t + S_t + R_t$) or multiplicative ($X_t = T_t \times S_t \times R_t$). For prediction market prices bounded between 0 and 1, the additive model is more appropriate, though working with log-odds transforms can make the multiplicative model relevant.

22.7 ARIMA Models

The ARIMA (Autoregressive Integrated Moving Average) model is the foundational time series model. It combines three components: autoregression (AR), differencing (I), and moving average (MA).

AR (Autoregressive) Component

An AR($p$) model expresses the current value as a linear combination of $p$ past values:

$$X_t = c + \phi_1 X_{t-1} + \phi_2 X_{t-2} + \cdots + \phi_p X_{t-p} + \epsilon_t$$

where $c$ is a constant, $\phi_1, \ldots, \phi_p$ are the autoregressive coefficients, and $\epsilon_t$ is white noise. The model is stationary when the roots of the characteristic polynomial $1 - \phi_1 z - \cdots - \phi_p z^p$ lie outside the unit circle.

For prediction markets, an AR(1) model on price changes says that today's price change depends on yesterday's:

$$\Delta P_t = c + \phi_1 \Delta P_{t-1} + \epsilon_t$$

If $\phi_1 > 0$, there is positive momentum (price changes tend to continue). If $\phi_1 < 0$, there is mean reversion (price changes tend to reverse). In efficient markets, we would expect $\phi_1 \approx 0$, but prediction markets often exhibit both momentum and mean reversion at different frequencies.

MA (Moving Average) Component

An MA($q$) model expresses the current value as a linear combination of $q$ past error terms:

$$X_t = \mu + \epsilon_t + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \cdots + \theta_q \epsilon_{t-q}$$

The MA component captures short-term dependencies and the effect of past "shocks" on the current value. An MA(1) model on prediction market price changes means that today's change is influenced by the unexpected component of yesterday's change.

ARMA and ARIMA

An ARMA($p$, $q$) model combines both:

$$X_t = c + \sum_{i=1}^{p} \phi_i X_{t-i} + \epsilon_t + \sum_{j=1}^{q} \theta_j \epsilon_{t-j}$$

An ARIMA($p$, $d$, $q$) model adds $d$ rounds of differencing. If the original series $X_t$ is non-stationary but $\Delta^d X_t$ is stationary, we fit an ARMA($p$, $q$) model to the differenced series. ARIMA(1, 1, 0) is an AR(1) model on the first differences.

Model Identification with ACF/PACF

The Box-Jenkins methodology for ARIMA model identification follows these steps:

Check stationarity: Use the ADF test and visual inspection. Difference if necessary (this determines $d$).
Examine ACF and PACF of the stationary series: - ACF cuts off after lag $q$, PACF tails off: use MA($q$). - PACF cuts off after lag $p$, ACF tails off: use AR($p$). - Both tail off: use ARMA($p$, $q$) with small $p$ and $q$.
Fit the model and check residual diagnostics.
Compare models using information criteria (AIC, BIC).

Fitting and Diagnostics

After fitting an ARIMA model, we check:

Residual ACF: Residuals should be approximately white noise (no significant autocorrelation).
Ljung-Box test: Formal test for autocorrelation in residuals.
Residual normality: QQ plot or Jarque-Bera test.
Residual heteroscedasticity: If variance changes over time, consider GARCH (Section 22.8).

Python ARIMA Pipeline

import numpy as np
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
from statsmodels.stats.diagnostic import acorr_ljungbox
import warnings
warnings.filterwarnings('ignore')

def fit_arima_pipeline(prices, max_p=5, max_q=5):
    """
    Complete ARIMA pipeline for prediction market prices.

    Parameters
    ----------
    prices : array-like
        Time series of prediction market prices.
    max_p : int
        Maximum AR order to consider.
    max_q : int
        Maximum MA order to consider.

    Returns
    -------
    best_model : fitted ARIMA model
    results : dict with model details
    """
    prices = np.asarray(prices, dtype=float)

    # Step 1: Determine differencing order
    d = 0
    series = prices.copy()
    for diff_order in range(3):
        result = adfuller(series, autolag='AIC')
        if result[1] < 0.05:
            d = diff_order
            break
        series = np.diff(series)
        d = diff_order + 1

    print(f"Differencing order d = {d}")

    # Step 2: Grid search over p and q using AIC
    best_aic = np.inf
    best_order = (0, d, 0)
    best_model = None
    results_table = []

    for p in range(max_p + 1):
        for q in range(max_q + 1):
            if p == 0 and q == 0:
                continue
            try:
                model = ARIMA(prices, order=(p, d, q))
                fitted = model.fit()
                results_table.append({
                    'order': (p, d, q),
                    'aic': fitted.aic,
                    'bic': fitted.bic
                })
                if fitted.aic < best_aic:
                    best_aic = fitted.aic
                    best_order = (p, d, q)
                    best_model = fitted
            except Exception:
                continue

    print(f"Best order: ARIMA{best_order} (AIC={best_aic:.2f})")

    # Step 3: Diagnostics
    residuals = best_model.resid
    lb_test = acorr_ljungbox(residuals, lags=[10, 20], return_df=True)
    print(f"\nLjung-Box test p-values:\n{lb_test}")

    # Step 4: Forecast
    forecast = best_model.forecast(steps=5)
    forecast_ci = best_model.get_forecast(steps=5).conf_int()

    return best_model, {
        'order': best_order,
        'aic': best_aic,
        'forecast': forecast,
        'forecast_ci': forecast_ci,
        'residuals': residuals,
        'results_table': pd.DataFrame(results_table)
    }

Limitations of ARIMA for Prediction Markets

ARIMA models have several limitations when applied to prediction markets:

Bounded prices: ARIMA forecasts can exceed [0, 1]. A log-odds transform before modeling helps.
Non-constant volatility: Volatility changes as expiry approaches. GARCH models address this.
Event-driven jumps: Sudden information shocks (e.g., a debate gaffe) cause discontinuities that ARIMA cannot anticipate.
Linear structure: ARIMA is a linear model. Non-linear dynamics require extensions or different model families.
Short horizons: ARIMA forecasts degrade quickly. They are most useful for short-term (1 to 5 step) forecasts.

Despite these limitations, ARIMA provides a principled baseline for time series forecasting in prediction markets, and its residuals reveal patterns that more sophisticated models might exploit.

22.8 Advanced Time Series: GARCH and State Space Models

Beyond ARIMA, two families of models are particularly useful for prediction markets: GARCH models for volatility dynamics and state space models for regime switching.

Volatility Modeling with GARCH

The GARCH (Generalized Autoregressive Conditional Heteroscedasticity) model captures the phenomenon that volatility clusters — periods of high volatility tend to be followed by more high volatility, and calm periods by more calm periods.

In a GARCH(1,1) model, the conditional variance of the error term follows:

$$\sigma_t^2 = \omega + \alpha \epsilon_{t-1}^2 + \beta \sigma_{t-1}^2$$

where $\omega > 0$, $\alpha \geq 0$, $\beta \geq 0$, and $\alpha + \beta < 1$ for stationarity. The parameters have intuitive interpretations:

$\omega$: The long-run variance baseline.
$\alpha$: The reaction to recent shocks (higher $\alpha$ = more responsive to surprises).
$\beta$: The persistence of volatility (higher $\beta$ = slower decay of volatility).
$\alpha + \beta$: The persistence of the process (closer to 1 = longer memory).

For prediction markets, GARCH is valuable because:

It quantifies the current level of uncertainty, which affects position sizing.
It detects volatility regimes (high-information vs. low-information periods).
It provides better confidence intervals for forecasts than constant-variance models.

Python implementation:

from arch import arch_model
import numpy as np
import pandas as pd

# Prepare returns from market prices
returns = np.diff(np.log(prices / (1 - prices)))  # Log-odds returns

# Fit GARCH(1,1)
garch = arch_model(returns * 100, vol='Garch', p=1, q=1,
                    mean='AR', lags=1, dist='t')
result = garch.fit(disp='off')
print(result.summary())

# Extract conditional volatility
cond_vol = result.conditional_volatility

# Forecast volatility
forecast = result.forecast(horizon=5)
print("Forecasted variance (next 5 steps):")
print(forecast.variance.iloc[-1])

State Space Models and Regime Switching

State space models represent a time series as a system with an observable component and a hidden (latent) state that evolves over time. The general form is:

Observation equation: $y_t = Z_t \alpha_t + \epsilon_t$

State equation: $\alpha_{t+1} = T_t \alpha_t + R_t \eta_t$

where $\alpha_t$ is the latent state, $Z_t$ and $T_t$ are system matrices, and $\epsilon_t$ and $\eta_t$ are noise terms. The Kalman filter provides optimal estimates of the hidden state.

For prediction markets, state space models are useful for:

Local level models: Tracking the "true" probability as it evolves, filtering out market noise.
Regime switching: Detecting when the market enters different regimes (e.g., trending vs. mean-reverting).
Time-varying parameters: Allowing model parameters to change over time.

Regime-switching model example:

import statsmodels.api as sm
from statsmodels.tsa.regime_switching.markov_regression import MarkovRegression

# Fit a 2-regime model to price changes
model = MarkovRegression(
    price_changes,
    k_regimes=2,
    trend='c',
    switching_variance=True
)
result = model.fit()
print(result.summary())

# Smoothed regime probabilities
regime_probs = result.smoothed_marginal_probabilities
print("Current regime 0 probability:", regime_probs[0].iloc[-1])
print("Current regime 1 probability:", regime_probs[1].iloc[-1])

A common finding is that prediction markets alternate between a "quiet" regime (small price changes, low volatility) and an "active" regime (large price changes, high volatility). The regime-switching model identifies which regime we are currently in, which is valuable for adjusting trading strategies and position sizes.

When to Use Each Model

Model	Use When	Prediction Market Application
ARIMA	Price changes show autocorrelation	Short-term price forecasting
GARCH	Volatility is time-varying	Risk management, position sizing
State Space	True probability changes smoothly	Filtering noise from market prices
Markov Switching	Market has distinct regimes	Strategy selection, regime detection

22.9 Walk-Forward Validation

Validation is the most critical step in the modeling workflow. A model that looks good on training data but fails on unseen data is worse than useless — it gives false confidence that leads to losses. For time series data, standard cross-validation is inappropriate and can produce dangerously optimistic results. Walk-forward validation is the correct approach.

Why Standard Cross-Validation Fails for Time Series

Standard k-fold cross-validation randomly splits data into training and validation sets. For time series, this creates a fundamental problem: lookahead bias. If a random fold puts January data in the test set and March data in the training set, the model is being trained on future information to predict the past. This leakage inflates performance estimates.

Consider a prediction market model trained on 2020 data and tested on a random subset of 2018-2022 data. The model effectively "knows the future" — 2020 training data helps predict 2019 test data. In production, this never happens. The model only sees the past when making real predictions.

Walk-Forward (Rolling Window) Validation

Walk-forward validation respects the temporal ordering of data. At each step:

Train the model on data up to time $t$.
Predict for time $t + 1$ (or the next $h$ steps).
Record the prediction and the actual outcome.
Advance the window and repeat.

There are several variants:

Expanding window: The training set grows with each step. Train on $[1, t]$, predict $t+1$. Then train on $[1, t+1]$, predict $t+2$. This uses all available historical data at each step.

Rolling window: The training set has a fixed size $W$. Train on $[t-W+1, t]$, predict $t+1$. This prevents the model from being influenced by very old data that may no longer be relevant.

Anchored walk-forward: A compromise — the start is fixed but the end advances. Train on $[1, t]$ but only use data from a fixed point onward.

Avoiding Lookahead Bias

Lookahead bias is the most insidious error in time series modeling. It can creep in at every stage:

Feature computation: Using future data to compute a moving average or normalization statistic.
Feature selection: Selecting features based on their correlation with the full dataset (including future data).
Hyperparameter tuning: Tuning hyperparameters on data that overlaps with the test period.
Data preprocessing: Fitting scalers or imputers on the full dataset before splitting.

To avoid lookahead bias:

Compute all features using only past data at each point in time.
Fit preprocessing steps (scalers, imputers) only on the training portion of each walk-forward fold.
Perform hyperparameter tuning within the training portion, using an inner walk-forward loop.
Never use the test set for any decision — not feature selection, not model selection, not hyperparameter tuning.

Python Walk-Forward Framework

import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator
from sklearn.metrics import log_loss, brier_score_loss
from typing import List, Tuple, Optional

class WalkForwardValidator:
    """
    Walk-forward validation framework for prediction market models.

    Supports expanding window, rolling window, and anchored approaches.
    """

    def __init__(
        self,
        min_train_size: int = 100,
        step_size: int = 1,
        window_type: str = 'expanding',
        max_train_size: Optional[int] = None
    ):
        """
        Parameters
        ----------
        min_train_size : int
            Minimum number of observations in the training set.
        step_size : int
            Number of steps to advance the window at each iteration.
        window_type : str
            'expanding' (growing window) or 'rolling' (fixed window).
        max_train_size : int, optional
            Maximum training set size (for rolling window).
        """
        self.min_train_size = min_train_size
        self.step_size = step_size
        self.window_type = window_type
        self.max_train_size = max_train_size

    def split(self, n: int) -> List[Tuple[np.ndarray, np.ndarray]]:
        """Generate train/test index splits."""
        splits = []
        for test_start in range(self.min_train_size, n, self.step_size):
            test_end = min(test_start + self.step_size, n)

            if self.window_type == 'rolling' and self.max_train_size:
                train_start = max(0, test_start - self.max_train_size)
            else:
                train_start = 0

            train_idx = np.arange(train_start, test_start)
            test_idx = np.arange(test_start, test_end)
            splits.append((train_idx, test_idx))

        return splits

    def validate(
        self,
        model: BaseEstimator,
        X: np.ndarray,
        y: np.ndarray,
        fit_params: dict = None
    ) -> dict:
        """
        Run walk-forward validation.

        Returns
        -------
        dict with predictions, actuals, and metrics.
        """
        if fit_params is None:
            fit_params = {}

        n = len(X)
        splits = self.split(n)

        all_preds = []
        all_actuals = []
        all_test_indices = []

        for train_idx, test_idx in splits:
            X_train = X[train_idx]
            y_train = y[train_idx]
            X_test = X[test_idx]
            y_test = y[test_idx]

            # Fit model on training data only
            model.fit(X_train, y_train, **fit_params)

            # Predict probabilities
            if hasattr(model, 'predict_proba'):
                probs = model.predict_proba(X_test)[:, 1]
            else:
                probs = model.predict(X_test)

            all_preds.extend(probs)
            all_actuals.extend(y_test)
            all_test_indices.extend(test_idx)

        preds = np.array(all_preds)
        actuals = np.array(all_actuals)

        # Clip predictions to avoid log(0)
        preds_clipped = np.clip(preds, 1e-7, 1 - 1e-7)

        metrics = {
            'log_loss': log_loss(actuals, preds_clipped),
            'brier_score': brier_score_loss(actuals, preds_clipped),
            'accuracy': np.mean((preds > 0.5) == actuals),
            'n_predictions': len(preds),
            'n_splits': len(splits),
        }

        return {
            'predictions': preds,
            'actuals': actuals,
            'test_indices': np.array(all_test_indices),
            'metrics': metrics
        }

    def validate_with_refit_schedule(
        self,
        model: BaseEstimator,
        X: np.ndarray,
        y: np.ndarray,
        refit_every: int = 10
    ) -> dict:
        """
        Walk-forward validation with periodic refitting.
        More realistic: models are not refit on every new observation.
        """
        n = len(X)
        splits = self.split(n)

        all_preds = []
        all_actuals = []
        fitted_model = None
        last_fit_idx = -1

        for i, (train_idx, test_idx) in enumerate(splits):
            if i % refit_every == 0 or fitted_model is None:
                X_train = X[train_idx]
                y_train = y[train_idx]
                model.fit(X_train, y_train)
                fitted_model = model
                last_fit_idx = i

            X_test = X[test_idx]
            y_test = y[test_idx]

            if hasattr(fitted_model, 'predict_proba'):
                probs = fitted_model.predict_proba(X_test)[:, 1]
            else:
                probs = fitted_model.predict(X_test)

            all_preds.extend(probs)
            all_actuals.extend(y_test)

        preds = np.array(all_preds)
        actuals = np.array(all_actuals)
        preds_clipped = np.clip(preds, 1e-7, 1 - 1e-7)

        return {
            'predictions': preds,
            'actuals': actuals,
            'metrics': {
                'log_loss': log_loss(actuals, preds_clipped),
                'brier_score': brier_score_loss(actuals, preds_clipped),
                'accuracy': np.mean((preds > 0.5) == actuals),
                'n_refits': len(splits) // refit_every + 1,
            }
        }

Choosing Window Parameters

The choice of window type and size involves trade-offs:

Expanding window maximizes the amount of training data at each step. This is best when the underlying relationship is stable over time.
Rolling window limits the training data to recent observations. This is best when the relationship changes over time (concept drift).
Minimum training size must be large enough for the model to learn reliably. For logistic regression with $k$ features, a minimum of $10k$ to $20k$ observations is a rough guideline.
Step size determines the granularity of validation. Step size 1 gives the most detailed picture but is computationally expensive.

For prediction markets, concept drift is common — the features that predict outcomes in one election cycle may not work in the next. A rolling window of 2 to 4 years of data is often a good starting point for political markets, while sports markets may use a single season or a rolling window of 50 to 100 games.

22.10 Model Selection and Comparison

With multiple candidate models — different feature sets, regularization strengths, ARIMA orders, or model families — we need principled methods for selecting the best one.

Information Criteria: AIC and BIC

The Akaike Information Criterion (AIC) balances model fit and complexity:

$$\text{AIC} = -2\ell(\hat{\boldsymbol{\theta}}) + 2k$$

where $\ell(\hat{\boldsymbol{\theta}})$ is the maximized log-likelihood and $k$ is the number of parameters. Lower AIC is better. AIC penalizes complexity (more parameters) to discourage overfitting.

The Bayesian Information Criterion (BIC) applies a stronger penalty:

$$\text{BIC} = -2\ell(\hat{\boldsymbol{\theta}}) + k \ln(n)$$

where $n$ is the sample size. BIC penalizes complexity more heavily than AIC (for $n \geq 8$), favoring simpler models. In practice:

AIC tends to select slightly more complex models.
BIC tends to select simpler models.
For prediction (our primary goal), AIC is often preferred.
For identifying the "true" model, BIC is consistent (selects the true model as $n \to \infty$).

Cross-Validated Log-Loss

For prediction market models, the most directly relevant metric is out-of-sample log-loss (also called cross-entropy):

$$\text{Log-loss} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1-y_i) \log(1-\hat{p}_i) \right]$$

This measures the quality of probability estimates, not just classification accuracy. A model with log-loss 0.65 is better calibrated than one with log-loss 0.72, even if they have the same classification accuracy.

The Brier score is another proper scoring rule:

$$\text{Brier} = \frac{1}{n} \sum_{i=1}^{n} (\hat{p}_i - y_i)^2$$

Brier score ranges from 0 (perfect) to 1 (worst possible), and it rewards both calibration and resolution (the ability to discriminate between events and non-events).

Comparing Multiple Models

A systematic model comparison framework:

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import log_loss, brier_score_loss

def compare_models(models, model_names, X, y, validator):
    """
    Compare multiple models using walk-forward validation.

    Parameters
    ----------
    models : list of sklearn-compatible models
    model_names : list of str
    X : feature matrix
    y : target vector
    validator : WalkForwardValidator instance

    Returns
    -------
    DataFrame with comparison metrics.
    """
    results = []

    for model, name in zip(models, model_names):
        print(f"Validating {name}...")
        val_result = validator.validate(model, X, y)
        metrics = val_result['metrics']
        metrics['model'] = name

        # Add calibration metrics
        preds = val_result['predictions']
        actuals = val_result['actuals']

        # Calibration: bin predictions and compare to observed frequency
        bins = np.linspace(0, 1, 11)
        bin_indices = np.digitize(preds, bins) - 1
        bin_indices = np.clip(bin_indices, 0, len(bins) - 2)

        calibration_error = 0
        for b in range(len(bins) - 1):
            mask = bin_indices == b
            if mask.sum() > 0:
                predicted_avg = preds[mask].mean()
                observed_avg = actuals[mask].mean()
                calibration_error += mask.sum() * (predicted_avg - observed_avg)**2
        calibration_error = np.sqrt(calibration_error / len(preds))
        metrics['calibration_rmse'] = calibration_error

        results.append(metrics)

    return pd.DataFrame(results).set_index('model')

Residual Analysis

After selecting a model, residual analysis confirms that the model's assumptions are satisfied and reveals any remaining patterns:

Residual time series plot: Look for trends, heteroscedasticity, or structural breaks.
Residual ACF: Check for remaining autocorrelation.
Residual distribution: Compare to the assumed distribution.
Residuals vs. fitted values: Check for non-linear patterns or heteroscedasticity.
Residuals vs. each predictor: Check for non-linear relationships that the model missed.

For logistic regression, "residuals" are typically deviance residuals or Pearson residuals:

$$r_i^{\text{deviance}} = \text{sign}(y_i - \hat{p}_i) \sqrt{-2[y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)]}$$

Calibration Assessment

A model's probability estimates are well-calibrated if, among all predictions of probability $p$, approximately $p \times 100\%$ of the events actually occur. Calibration is assessed visually with a reliability diagram (plotting predicted probability bins against observed frequency) and quantitatively with metrics like the Expected Calibration Error (ECE).

import numpy as np
import matplotlib.pyplot as plt

def plot_calibration(predictions, actuals, n_bins=10, title='Calibration Plot'):
    """Plot a reliability diagram."""
    bin_edges = np.linspace(0, 1, n_bins + 1)
    bin_centers = []
    observed_freqs = []
    bin_counts = []

    for i in range(n_bins):
        mask = (predictions >= bin_edges[i]) & (predictions < bin_edges[i+1])
        if mask.sum() > 0:
            bin_centers.append(predictions[mask].mean())
            observed_freqs.append(actuals[mask].mean())
            bin_counts.append(mask.sum())

    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(8, 8),
                                     gridspec_kw={'height_ratios': [3, 1]})

    # Reliability diagram
    ax1.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
    ax1.plot(bin_centers, observed_freqs, 'bo-', label='Model')
    ax1.set_xlabel('Predicted Probability')
    ax1.set_ylabel('Observed Frequency')
    ax1.set_title(title)
    ax1.legend()
    ax1.set_xlim(0, 1)
    ax1.set_ylim(0, 1)

    # Histogram of predictions
    ax2.hist(predictions, bins=bin_edges, edgecolor='black', alpha=0.7)
    ax2.set_xlabel('Predicted Probability')
    ax2.set_ylabel('Count')

    plt.tight_layout()
    return fig

def expected_calibration_error(predictions, actuals, n_bins=10):
    """Compute the Expected Calibration Error."""
    bin_edges = np.linspace(0, 1, n_bins + 1)
    ece = 0
    n = len(predictions)

    for i in range(n_bins):
        mask = (predictions >= bin_edges[i]) & (predictions < bin_edges[i+1])
        if mask.sum() > 0:
            predicted_avg = predictions[mask].mean()
            observed_avg = actuals[mask].mean()
            ece += mask.sum() / n * abs(predicted_avg - observed_avg)

    return ece

22.11 From Model Output to Trading Signal

A model that produces well-calibrated probabilities is valuable, but it is not yet a trading strategy. Converting model outputs to trading decisions requires several additional steps: defining when to trade, how much to trade, and how to manage the resulting positions.

Converting Model Probabilities to Trading Decisions

The simplest approach compares the model's probability $\hat{p}$ to the market price $P$:

If $\hat{p} > P + \text{threshold}$: Buy (the contract is underpriced).
If $\hat{p} < P - \text{threshold}$: Sell (the contract is overpriced).
Otherwise: No trade.

The threshold accounts for transaction costs (bid-ask spread, fees) and model uncertainty. A common starting point is to set the threshold at twice the half-spread plus any per-trade fees.

Threshold Selection

The threshold involves a trade-off between frequency and quality of trades:

Low threshold: More trades, smaller average edge per trade, higher transaction costs, more diversification.
High threshold: Fewer trades, larger average edge per trade, lower transaction costs, less diversification.

We can optimize the threshold using walk-forward validation:

import numpy as np

def optimize_threshold(predictions, market_prices, actuals,
                        thresholds=np.arange(0.01, 0.20, 0.01),
                        transaction_cost=0.02):
    """
    Find the optimal threshold for converting model predictions to trades.

    Parameters
    ----------
    predictions : array of model probability estimates
    market_prices : array of market prices at prediction time
    actuals : array of binary outcomes (0 or 1)
    thresholds : array of thresholds to test
    transaction_cost : per-trade cost (as a fraction)

    Returns
    -------
    dict with optimal threshold and profit metrics
    """
    results = []

    for thresh in thresholds:
        edges = predictions - market_prices

        # Buy signals
        buy_mask = edges > thresh
        buy_profit = actuals[buy_mask] - market_prices[buy_mask] - transaction_cost

        # Sell signals
        sell_mask = edges < -thresh
        sell_profit = market_prices[sell_mask] - actuals[sell_mask] - transaction_cost

        all_profits = np.concatenate([buy_profit, sell_profit]) if \
            (len(buy_profit) > 0 or len(sell_profit) > 0) else np.array([0])

        results.append({
            'threshold': thresh,
            'n_trades': len(buy_profit) + len(sell_profit),
            'total_profit': all_profits.sum(),
            'avg_profit_per_trade': all_profits.mean() if len(all_profits) > 0 else 0,
            'win_rate': (all_profits > 0).mean() if len(all_profits) > 0 else 0,
            'sharpe': all_profits.mean() / (all_profits.std() + 1e-10) * np.sqrt(252)
        })

    results_df = pd.DataFrame(results)
    best_idx = results_df['sharpe'].idxmax()

    return {
        'optimal_threshold': results_df.loc[best_idx, 'threshold'],
        'results': results_df
    }

Confidence-Based Position Sizing

Rather than using a fixed position size, scale positions based on the model's confidence — the magnitude of the perceived mispricing:

$$\text{Position size} = f(\hat{p} - P) \times \text{max\_position}$$

Where $f$ could be:

Linear: $f(x) = |x| / x_{\max}$ — position size proportional to edge.
Kelly: $f(x) = \frac{\hat{p} - P}{1 - P}$ for buys — the Kelly-optimal fraction (Chapter 15).
Half-Kelly: $f(x) = 0.5 \times \text{Kelly}$ — a more conservative approach.
Sigmoid: $f(x) = \sigma(k \cdot |x|)$ — smooth transition from small to large positions.

def kelly_position_size(model_prob, market_price, max_position=100,
                         kelly_fraction=0.5):
    """
    Calculate position size using fractional Kelly criterion.

    Parameters
    ----------
    model_prob : float
        Model's estimated probability.
    market_price : float
        Current market price (implied probability).
    max_position : float
        Maximum position size (in contracts or dollars).
    kelly_fraction : float
        Fraction of full Kelly to use (default 0.5 = half-Kelly).

    Returns
    -------
    float : Signed position size (positive = buy, negative = sell).
    """
    edge = model_prob - market_price

    if abs(edge) < 0.01:  # Minimum edge threshold
        return 0.0

    if edge > 0:
        # Buy: Kelly fraction for a binary bet
        odds = (1 - market_price) / market_price  # Payout odds
        kelly = model_prob - (1 - model_prob) / odds
    else:
        # Sell: Equivalent Kelly for the opposite position
        odds = market_price / (1 - market_price)
        kelly = (1 - model_prob) - model_prob / odds
        kelly = -kelly  # Negative for sell

    kelly = kelly * kelly_fraction
    kelly = np.clip(kelly, -1, 1)  # Cap at full position

    return kelly * max_position

Integrating with a Trading System

A complete signal generation pipeline connects the model to a trading system:

class StatisticalSignalGenerator:
    """
    Generates trading signals from statistical model outputs.

    Integrates with the trading system framework from Chapter 19.
    """

    def __init__(self, model, scaler, feature_engineer,
                 threshold=0.05, kelly_fraction=0.25,
                 max_position=100, transaction_cost=0.02):
        self.model = model
        self.scaler = scaler
        self.feature_engineer = feature_engineer
        self.threshold = threshold
        self.kelly_fraction = kelly_fraction
        self.max_position = max_position
        self.transaction_cost = transaction_cost

    def generate_signal(self, current_data, market_price):
        """
        Generate a trading signal from current data and market price.

        Parameters
        ----------
        current_data : pd.DataFrame
            Current feature data (single row or recent history).
        market_price : float
            Current market price.

        Returns
        -------
        dict with signal details.
        """
        # Engineer features
        features = self.feature_engineer.transform(current_data)
        if len(features) == 0:
            return {'action': 'hold', 'size': 0, 'reason': 'insufficient data'}

        # Get the most recent feature vector
        X = features.iloc[[-1]].values
        X_scaled = self.scaler.transform(X)

        # Predict probability
        model_prob = self.model.predict_proba(X_scaled)[0, 1]

        # Calculate edge
        edge = model_prob - market_price

        # Determine action
        if edge > self.threshold + self.transaction_cost:
            action = 'buy'
        elif edge < -(self.threshold + self.transaction_cost):
            action = 'sell'
        else:
            action = 'hold'

        # Calculate position size
        if action != 'hold':
            size = kelly_position_size(
                model_prob, market_price,
                self.max_position, self.kelly_fraction
            )
        else:
            size = 0

        return {
            'action': action,
            'size': abs(size),
            'direction': 1 if size > 0 else (-1 if size < 0 else 0),
            'model_prob': model_prob,
            'market_price': market_price,
            'edge': edge,
            'confidence': abs(edge),
        }

Practical Considerations

Several practical issues arise when deploying statistical models for trading:

Latency: The time between computing a signal and executing the trade. Stale signals lose value.
Model staleness: Models trained on old data may degrade. Establish a retraining schedule.
Regime changes: The statistical relationships that the model has learned may break down during unusual events.
Correlated positions: Multiple models may generate correlated signals, concentrating risk.
Slippage: The difference between the intended trade price and the actual execution price.
Model confidence vs. market efficiency: Markets for major events are often more efficient than niche markets. The same model may work well for obscure contracts but poorly for headline events.

22.12 Chapter Summary

This chapter covered the statistical modeling toolkit for prediction market analysis:

Regression models provide the foundation for estimating event probabilities from observable features. Linear regression suits continuous outcomes (spreads, volumes), while logistic regression is the workhorse for binary prediction market contracts. The logistic regression coefficients are interpretable as log-odds, and the model naturally produces calibrated probability estimates.

Feature engineering transforms raw data — polls, prices, volumes, time variables, and cross-market information — into informative predictors. Good features are predictive, available in advance, timely, and complementary. A well-designed feature pipeline is often more important than the choice of model.

Regularization (Ridge, Lasso, Elastic Net) prevents overfitting by penalizing large coefficients. Lasso performs automatic feature selection. Elastic Net handles correlated features. The regularization strength must be tuned using time-respecting validation.

Time series models — ARIMA, GARCH, and state space models — capture the temporal dynamics of prediction market prices. ARIMA models autocorrelation in price changes, GARCH models time-varying volatility, and regime-switching models detect structural changes in market behavior.

Walk-forward validation is the correct validation methodology for time series models. It respects temporal ordering, avoids lookahead bias, and provides realistic estimates of out-of-sample performance. Standard cross-validation must not be used for time-ordered data.

Model selection and comparison use information criteria (AIC, BIC) and proper scoring rules (log-loss, Brier score). Calibration assessment ensures that model probabilities match observed frequencies.

Signal generation converts model outputs to trading decisions through threshold selection and confidence-based position sizing. The Kelly criterion provides a principled framework for scaling positions based on estimated edge.

The key insight of this chapter is that statistical modeling for prediction markets is not about achieving the best possible in-sample fit. It is about producing well-calibrated, out-of-sample probability estimates that systematically differ from market prices by more than the cost of trading.

What's Next

In Chapter 23, we move from classical statistical models to machine learning approaches. We will explore decision trees, random forests, gradient boosting, and neural networks — models that can capture non-linear relationships and complex interactions that logistic regression and ARIMA miss. We will also address the unique challenges of applying machine learning to prediction markets: limited data, non-stationarity, and the risk of overfitting to a small number of historical events.

The statistical models covered in this chapter will serve as important baselines. Any machine learning model should be compared against a well-tuned logistic regression or ARIMA model. If the more complex model does not beat the simpler baseline on out-of-sample data, the added complexity is not justified. This principle — favoring simplicity unless complexity demonstrably helps — will guide our approach to machine learning in the chapters ahead.