26 min read

> "The whole is greater than the sum of its parts." — Aristotle (paraphrased)

Chapter 25: Ensemble Methods and Model Combination

"The whole is greater than the sum of its parts." — Aristotle (paraphrased)

In the world of prediction markets, no single model is perfect. A polling model might capture short-term dynamics but miss structural fundamentals. A machine learning algorithm might detect subtle patterns but overfit to noise. A prediction market price incorporates the wisdom of crowds but can be distorted by thin liquidity or cognitive biases. The insight that drives this chapter is profound yet simple: combining multiple imperfect forecasts almost always produces a forecast superior to any individual component.

This chapter is your comprehensive guide to ensemble methods and model combination, applied specifically to the domain of prediction markets and probabilistic forecasting. We will cover everything from the humblest simple average to sophisticated Bayesian model averaging and meta-learning stacking architectures. Along the way, we will explore the deep theoretical reasons why ensembles work, the practical pitfalls that can sabotage them, and the specialized technique of extremizing that addresses a unique challenge in probability forecasting.

By the end of this chapter, you will be able to build, calibrate, and deploy robust ensemble systems that combine market prices, statistical models, expert judgment, and machine learning predictions into forecasts that outperform any single source.


25.1 Why Combine Models?

The Diversity of Errors

Every forecasting model makes errors. The critical insight behind ensemble methods is that different models tend to make different errors. A polling-based election model might underestimate turnout effects, while a fundamentals-based model might miss late-breaking news. When their errors are uncorrelated, combining them cancels out individual mistakes, leaving behind a more accurate signal.

Consider two forecasters predicting the probability of an event. Forecaster A estimates 0.70 and Forecaster B estimates 0.60. If the true probability is 0.65, the simple average of 0.65 is exactly right, even though both individual forecasters were wrong. This is not a coincidence; it reflects a deep mathematical regularity.

The Bias-Variance Tradeoff in Forecasting

Recall from Chapter 23 the bias-variance decomposition of prediction error:

$$\text{MSE} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}$$

When we combine $K$ models by simple averaging, the bias of the ensemble is the average of the individual biases (unchanged if all models share the same bias), but the variance is reduced. Specifically, if models have equal variance $\sigma^2$ and pairwise correlation $\rho$, the ensemble variance is:

$$\sigma_{\text{ensemble}}^2 = \frac{\sigma^2}{K}\left[1 + (K-1)\rho\right]$$

When $\rho = 0$ (uncorrelated errors), variance drops as $\sigma^2 / K$, an enormous reduction. Even when $\rho > 0$, variance still decreases as long as $\rho < 1$. This is the mathematical foundation of ensemble methods: combining diverse models reduces variance without increasing bias.

Empirical Evidence: Ensembles Almost Always Win

The empirical track record of ensemble methods is extraordinary:

  • The Netflix Prize (2009): The winning solution combined over 800 individual models. The top teams all used ensembles; no single-model approach came close.
  • Kaggle competitions: In a survey of over 500 Kaggle competitions, the winning solution used an ensemble in more than 95% of cases.
  • The Good Judgment Project: Superforecasters who combined multiple mental models outperformed those who relied on a single framework.
  • Forecasting competitions (M-competitions): Across the M1 through M5 competitions spanning decades, simple combinations of methods consistently outperformed individual methods.
  • Prediction markets vs. polls: FiveThirtyEight and The Economist combine multiple data sources into ensemble-like models that outperform any single poll.

The evidence is so overwhelming that statistician Rob Hyndman has called the superiority of combinations "the most important finding in forecasting."

The Forecast Combination Puzzle

Here is a puzzle that has intrigued researchers since Bates and Granger's seminal 1969 paper: simple averages often perform as well as or better than sophisticated optimal combination methods. This is counterintuitive. If we know that Model A has historically been twice as accurate as Model B, shouldn't we weight Model A more heavily?

The answer involves a subtle interplay of estimation error and non-stationarity:

  1. Estimation error in weights: Optimal weights must be estimated from historical data. With limited data and many models, the estimation error in the weights can overwhelm the theoretical advantage of optimal weighting.

  2. Non-stationarity: The relative performance of models changes over time. Weights optimized for the past may not apply to the future.

  3. Robustness: Equal weights are maximally robust. They cannot be "wrong" in the way that extreme weights can be.

This puzzle does not mean we should always use simple averaging. It means we should be humble about our ability to improve on it, and that any departure from equal weights should be justified by strong evidence.


25.2 Simple Averaging

Equal-Weight Combination

The simplest ensemble method is the unweighted arithmetic mean:

$$\hat{p}_{\text{ensemble}} = \frac{1}{K} \sum_{i=1}^{K} \hat{p}_i$$

where $\hat{p}_i$ is the probability forecast from model $i$ and $K$ is the number of models. Despite its simplicity, this method has remarkable properties.

When Simple Averaging Works Surprisingly Well

Simple averaging excels in the following conditions:

  1. You have few historical observations relative to the number of models: With limited data, you cannot reliably estimate optimal weights, so equal weights are safer.
  2. Model performance is roughly comparable: When no single model is vastly superior, the gains from weighting are small.
  3. The environment is non-stationary: When the data-generating process shifts over time, historically optimal weights become stale.
  4. Models are diverse: When models make different kinds of errors, averaging cancels them out effectively.

In prediction market contexts, simple averaging is a strong baseline for combining multiple independent probability estimates, whether from different models, different market platforms, or different forecasters.

Theoretical Justification: Variance Reduction

Suppose we have $K$ models, each producing an estimate $\hat{p}_i = p + \epsilon_i$, where $p$ is the true probability and $\epsilon_i$ is the error of model $i$. If errors have zero mean (unbiased models) and variance $\sigma_i^2$, and errors are independent, then:

$$\text{Var}\left(\frac{1}{K}\sum_{i=1}^{K} \hat{p}_i\right) = \frac{1}{K^2}\sum_{i=1}^{K}\sigma_i^2 \leq \frac{\bar{\sigma}^2}{K}$$

where $\bar{\sigma}^2$ is the average individual variance. The ensemble variance shrinks at rate $1/K$, a powerful result.

Even when models are correlated, averaging helps. If all pairwise correlations equal $\rho$:

$$\text{Var}(\text{ensemble}) = \frac{\bar{\sigma}^2}{K} + \frac{K-1}{K}\rho\bar{\sigma}^2$$

As $K \to \infty$, this converges to $\rho\bar{\sigma}^2$. The irreducible floor is determined by the average correlation. This highlights why diversity (low $\rho$) is so valuable.

Python Simple Averager

import numpy as np
from typing import List

def simple_average(forecasts: List[float]) -> float:
    """
    Combine probability forecasts by simple averaging.

    Parameters
    ----------
    forecasts : list of float
        Individual probability forecasts, each in [0, 1].

    Returns
    -------
    float
        The ensemble forecast (arithmetic mean).
    """
    forecasts = np.array(forecasts)
    if np.any((forecasts < 0) | (forecasts > 1)):
        raise ValueError("All forecasts must be in [0, 1].")
    return float(np.mean(forecasts))


# Example usage
models = [0.72, 0.68, 0.75, 0.61, 0.70]
ensemble = simple_average(models)
print(f"Individual forecasts: {models}")
print(f"Simple average: {ensemble:.4f}")
# Output: Simple average: 0.6920

The simplicity is the point. This is your baseline, and it is surprisingly hard to beat.


25.3 Weighted Averaging

Performance-Based Weighting

When we have evidence that some models are consistently better than others, we can assign them higher weights:

$$\hat{p}_{\text{ensemble}} = \sum_{i=1}^{K} w_i \hat{p}_i, \quad \text{where } \sum_{i=1}^{K} w_i = 1, \quad w_i \geq 0$$

The challenge is determining the weights $w_i$.

Inverse-Error Weighting

A natural approach weights each model inversely proportional to its historical error:

$$w_i = \frac{1/\text{MSE}_i}{\sum_{j=1}^{K} 1/\text{MSE}_j}$$

where $\text{MSE}_i$ is the mean squared error of model $i$ over a validation period. Models with lower error get higher weight. This approach is intuitive and often effective, but it requires enough historical data to estimate MSE reliably.

A variant uses the Brier score for probability forecasts:

$$w_i = \frac{1/\text{BS}_i}{\sum_{j=1}^{K} 1/\text{BS}_j}$$

Recency-Weighted Averaging

In non-stationary environments, recent performance is more informative than distant performance. We can use exponentially decaying weights on the historical errors:

$$\text{MSE}_i^{(\lambda)} = \frac{\sum_{t=1}^{T} \lambda^{T-t}(p_i^{(t)} - y^{(t)})^2}{\sum_{t=1}^{T} \lambda^{T-t}}$$

where $\lambda \in (0, 1)$ is the decay rate. Smaller $\lambda$ gives more weight to recent observations.

Optimal Weights with Historical Data

The theoretically optimal weights minimize the expected loss of the combined forecast. For mean squared error, this leads to a constrained regression problem:

$$\min_{w} \sum_{t=1}^{T}\left(y^{(t)} - \sum_{i=1}^{K} w_i \hat{p}_i^{(t)}\right)^2 \quad \text{s.t.} \quad \sum_i w_i = 1, \quad w_i \geq 0$$

This is a constrained least squares problem that can be solved using quadratic programming. The solution accounts for correlations between model errors; if two models make similar errors, the optimal weights reduce their combined influence.

The Overfitting Danger

Optimized weights carry a serious risk: overfitting. With $K$ models and limited historical data, the optimized weights may fit the noise in the training set rather than genuine performance differences. This is especially dangerous when:

  • $K$ is large relative to the number of historical observations $T$
  • Model performance is measured over a non-representative period
  • The models themselves were tuned on overlapping data

Strategies to mitigate overfitting:

  1. Shrinkage toward equal weights: Blend optimal weights with equal weights: $$w_i^{\text{shrunk}} = \alpha \cdot w_i^{\text{optimal}} + (1 - \alpha) \cdot \frac{1}{K}$$

  2. Cross-validation: Estimate weights using cross-validated performance.

  3. Regularization: Add a penalty for weights deviating from $1/K$.

  4. Bayesian estimation: Place a Dirichlet prior on the weights centered at the uniform distribution.

Python Weighted Combiner

import numpy as np
from scipy.optimize import minimize

def weighted_average(forecasts: List[float], weights: List[float]) -> float:
    """Combine forecasts using specified weights."""
    forecasts = np.array(forecasts)
    weights = np.array(weights)
    weights = weights / weights.sum()  # Ensure normalization
    return float(np.dot(weights, forecasts))


def inverse_error_weights(errors: List[float]) -> np.ndarray:
    """Compute inverse-error weights from historical MSE values."""
    errors = np.array(errors)
    inv_errors = 1.0 / errors
    return inv_errors / inv_errors.sum()


def optimize_weights(forecasts_matrix: np.ndarray,
                     outcomes: np.ndarray,
                     shrinkage: float = 0.0) -> np.ndarray:
    """
    Find optimal combination weights via constrained optimization.

    Parameters
    ----------
    forecasts_matrix : np.ndarray, shape (T, K)
        Historical forecasts from K models over T periods.
    outcomes : np.ndarray, shape (T,)
        Binary outcomes (0 or 1).
    shrinkage : float
        Shrinkage toward equal weights (0 = no shrinkage, 1 = equal weights).

    Returns
    -------
    np.ndarray
        Optimal weights, shape (K,).
    """
    T, K = forecasts_matrix.shape
    equal_weights = np.ones(K) / K

    def objective(w):
        combined = forecasts_matrix @ w
        mse = np.mean((outcomes - combined) ** 2)
        # Regularization toward equal weights
        reg = shrinkage * np.sum((w - equal_weights) ** 2)
        return mse + reg

    # Constraints: weights sum to 1
    constraints = {'type': 'eq', 'fun': lambda w: np.sum(w) - 1.0}
    # Bounds: weights in [0, 1]
    bounds = [(0.0, 1.0)] * K

    result = minimize(objective, equal_weights, method='SLSQP',
                      bounds=bounds, constraints=constraints)
    return result.x


# Example
np.random.seed(42)
T, K = 100, 4
true_probs = np.random.beta(2, 2, T)
outcomes = (np.random.random(T) < true_probs).astype(float)

# Simulate 4 models with different accuracies
forecasts = np.column_stack([
    true_probs + np.random.normal(0, 0.10, T),  # Good model
    true_probs + np.random.normal(0, 0.15, T),  # Medium model
    true_probs + np.random.normal(0, 0.20, T),  # Weak model
    true_probs + np.random.normal(0, 0.12, T),  # Good model
])
forecasts = np.clip(forecasts, 0.01, 0.99)

weights = optimize_weights(forecasts, outcomes, shrinkage=0.1)
print(f"Optimal weights: {weights.round(3)}")
print(f"Equal-weight MSE: {np.mean((outcomes - forecasts.mean(axis=1))**2):.4f}")
combined = forecasts @ weights
print(f"Optimal-weight MSE: {np.mean((outcomes - combined)**2):.4f}")

25.4 Linear and Logarithmic Pooling

When combining probability forecasts specifically, we enter the domain of opinion pooling, which has a rich theory in decision science.

Linear Opinion Pool

The linear opinion pool is simply the weighted average of probabilities:

$$p_{\text{LP}} = \sum_{i=1}^{K} w_i \cdot p_i$$

Properties of the linear pool: - It preserves marginal calibration: if each model is calibrated, the linear pool is calibrated. - It is externally Bayesian: if all models share the same prior and update on different evidence, the linear pool may overcount shared prior information. - It can never produce a probability more extreme than the most extreme individual forecast (it is bounded by the range of individual forecasts).

Logarithmic Opinion Pool

The logarithmic opinion pool combines log-odds (or equivalently, takes the geometric mean of probability ratios):

$$p_{\text{LogP}} = \frac{\prod_{i=1}^{K} p_i^{w_i}}{\prod_{i=1}^{K} p_i^{w_i} + \prod_{i=1}^{K} (1 - p_i)^{w_i}}$$

Equivalently, in log-odds space:

$$\log\frac{p_{\text{LogP}}}{1 - p_{\text{LogP}}} = \sum_{i=1}^{K} w_i \cdot \log\frac{p_i}{1 - p_i}$$

Properties of the logarithmic pool: - It is externally Bayesian: it correctly combines independent pieces of evidence (under certain conditions). - It can produce probabilities more extreme than any individual forecast. - It handles shared information more naturally than the linear pool because it operates in log-odds space, where independent evidence adds linearly. - It is proper: if all component forecasts are proper (elicited to maximize expected score under a logarithmic scoring rule), the logarithmic pool is also proper.

The Supra-Bayesian Approach

A "supra-Bayesian" decision maker treats each forecaster's probability as data and updates their own prior accordingly. Under certain assumptions (conditional independence of forecasters given the true state), the supra-Bayesian combination resembles the logarithmic pool.

The supra-Bayesian framework provides a principled foundation:

$$p_{\text{SB}}(E | p_1, \ldots, p_K) \propto p_0(E) \cdot \prod_{i=1}^{K} \frac{f(p_i | E)}{f(p_i)}$$

where $p_0(E)$ is the prior, $f(p_i | E)$ is the likelihood of observing forecast $p_i$ given event $E$, and $f(p_i)$ is the marginal likelihood of forecast $p_i$.

When to Use Each

Criterion Linear Pool Logarithmic Pool
Shared information Overcounts it Handles it naturally
Extreme forecasts Moderates them Can amplify them
Calibration Preserves marginal calibration May miscalibrate
Independence assumption Not required Conditional independence
Interpretation Democratic vote Evidence combination

Use the linear pool when: - Models may share significant information - You want to preserve calibration guarantees - You want a conservative combination

Use the logarithmic pool when: - Models process largely independent information - You want to capture the full strength of agreement - The individual forecasts are well-calibrated

Python Pool Implementations

import numpy as np

def linear_pool(forecasts, weights=None):
    """
    Combine probability forecasts using the linear opinion pool.

    Parameters
    ----------
    forecasts : array-like of float
        Probability forecasts in (0, 1).
    weights : array-like of float, optional
        Combination weights (default: equal weights).

    Returns
    -------
    float
        Combined probability.
    """
    forecasts = np.array(forecasts, dtype=float)
    if weights is None:
        weights = np.ones(len(forecasts)) / len(forecasts)
    else:
        weights = np.array(weights, dtype=float)
        weights /= weights.sum()
    return float(np.dot(weights, forecasts))


def logarithmic_pool(forecasts, weights=None, eps=1e-10):
    """
    Combine probability forecasts using the logarithmic opinion pool.

    Parameters
    ----------
    forecasts : array-like of float
        Probability forecasts in (0, 1).
    weights : array-like of float, optional
        Combination weights (default: equal weights).
    eps : float
        Small constant to avoid log(0).

    Returns
    -------
    float
        Combined probability.
    """
    forecasts = np.array(forecasts, dtype=float)
    forecasts = np.clip(forecasts, eps, 1 - eps)

    if weights is None:
        weights = np.ones(len(forecasts)) / len(forecasts)
    else:
        weights = np.array(weights, dtype=float)
        weights /= weights.sum()

    # Combine in log-odds space
    log_odds = np.log(forecasts / (1 - forecasts))
    combined_log_odds = np.dot(weights, log_odds)

    # Convert back to probability
    return float(1.0 / (1.0 + np.exp(-combined_log_odds)))


# Example: Three forecasters predict an election outcome
forecasters = [0.65, 0.70, 0.75]
print(f"Linear pool:      {linear_pool(forecasters):.4f}")
print(f"Logarithmic pool: {logarithmic_pool(forecasters):.4f}")

# Notice: with agreement, log pool is more extreme
# Linear pool:      0.7000
# Logarithmic pool: 0.7035

# With more extreme disagreement
forecasters_extreme = [0.90, 0.50, 0.80]
print(f"\nLinear pool:      {linear_pool(forecasters_extreme):.4f}")
print(f"Logarithmic pool: {logarithmic_pool(forecasters_extreme):.4f}")

25.5 Extremizing

Why Average Probabilities Are Often Too Moderate

One of the most consistent findings in forecast aggregation research is that simple averages of probability forecasts tend to be insufficiently extreme. If five well-informed forecasters each give 80% probability to an event, the average is 80%, but the true probability may be closer to 95%.

Why? Because the forecasters share common background information. If each forecaster independently reaches 80% using partly overlapping evidence, the full weight of all evidence combined should push the probability further from 50% than any individual forecast.

This insight was formalized by Baron et al. (2014) and operationalized in the Good Judgment Project, where extremizing was shown to significantly improve the calibration of aggregated forecasts.

The Extremizing Transformation

The extremizing transformation pushes a combined forecast away from 50%:

$$p_{\text{extremized}} = \frac{p_{\text{avg}}^d}{p_{\text{avg}}^d + (1 - p_{\text{avg}})^d}$$

where $d > 1$ is the extremizing parameter. When $d = 1$, no transformation occurs. When $d > 1$, probabilities are pushed toward 0 or 1. When $d < 1$, probabilities are pushed toward 50% (anti-extremizing).

Equivalently, in log-odds space, extremizing simply scales the log-odds:

$$\text{logit}(p_{\text{extremized}}) = d \cdot \text{logit}(p_{\text{avg}})$$

This is elegant: extremizing is just multiplication in log-odds space.

Theoretical Justification from Shared Information

Suppose $K$ forecasters each observe $n$ independent signals plus $m$ shared signals. Each forecaster's log-odds reflects $n + m$ pieces of evidence, but the combined evidence across all forecasters is $Kn + m$ signals (the shared information is counted once, not $K$ times).

If we naively average in log-odds space (logarithmic pool), we get:

$$\text{logit}_{\text{naive}} = K(n + m) \cdot \text{signal strength}$$

But the correct answer is:

$$\text{logit}_{\text{correct}} = (Kn + m) \cdot \text{signal strength}$$

The ratio gives us the optimal extremizing factor:

$$d = \frac{Kn + m}{K(n + m)} = \frac{Kn + m}{Kn + Km}$$

When $m = 0$ (no shared information), $d = 1$: no extremizing needed. When $m > 0$, $d < 1$ for the log pool, suggesting the log pool already overcounts shared information. But for the linear pool, the effective extremizing factor is typically $d > 1$, because the linear pool is insufficiently extreme to begin with.

For practical purposes with a linear pool aggregate, the extremizing factor typically falls in the range $d \in [1.5, 3.0]$, depending on the number of forecasters and the degree of information sharing.

Calibrating the Extremizing Factor

The optimal extremizing factor can be estimated empirically:

  1. Grid search: Try different values of $d$ and select the one that minimizes Brier score on a calibration dataset.
  2. Logistic recalibration: Fit a logistic regression of outcomes on the logit of average forecasts. The coefficient on logit is the optimal extremizing factor.
  3. Cross-validation: Use time-series cross-validation to select $d$ out-of-sample.

The logistic recalibration approach is particularly elegant:

$$\text{logit}(P(Y=1)) = \beta_0 + \beta_1 \cdot \text{logit}(\bar{p})$$

Here, $\beta_1$ is the extremizing factor $d$, and $\beta_0$ corrects for any bias. If $\beta_0 = 0$ and $\beta_1 > 1$, extremizing is beneficial.

Python Extremizer

import numpy as np
from scipy.special import logit, expit
from sklearn.linear_model import LogisticRegression

def extremize(p, d=1.5):
    """
    Apply extremizing transformation to a probability forecast.

    Parameters
    ----------
    p : float or array
        Probability forecast(s) in (0, 1).
    d : float
        Extremizing factor. d > 1 pushes toward 0/1,
        d < 1 pushes toward 0.5.

    Returns
    -------
    float or array
        Extremized probability.
    """
    p = np.asarray(p, dtype=float)
    p = np.clip(p, 1e-6, 1 - 1e-6)
    log_odds = logit(p)
    return expit(d * log_odds)


def calibrate_extremizing(avg_forecasts, outcomes):
    """
    Find the optimal extremizing factor using logistic recalibration.

    Parameters
    ----------
    avg_forecasts : array-like
        Averaged probability forecasts.
    outcomes : array-like
        Binary outcomes (0 or 1).

    Returns
    -------
    dict
        Contains 'extremizing_factor' (d), 'intercept' (bias correction),
        and 'brier_before' and 'brier_after'.
    """
    avg_forecasts = np.array(avg_forecasts)
    outcomes = np.array(outcomes)

    # Compute logit of average forecasts
    log_odds = logit(np.clip(avg_forecasts, 1e-6, 1 - 1e-6))

    # Fit logistic regression
    X = log_odds.reshape(-1, 1)
    lr = LogisticRegression(fit_intercept=True, C=1e6)
    lr.fit(X, outcomes)

    d = lr.coef_[0, 0]
    intercept = lr.intercept_[0]

    # Compute Brier scores
    brier_before = np.mean((avg_forecasts - outcomes) ** 2)
    extremized = expit(intercept + d * log_odds)
    brier_after = np.mean((extremized - outcomes) ** 2)

    return {
        'extremizing_factor': d,
        'intercept': intercept,
        'brier_before': brier_before,
        'brier_after': brier_after,
    }


# Example
np.random.seed(42)
n = 500
true_p = np.random.beta(2, 2, n)
outcomes = (np.random.random(n) < true_p).astype(float)

# Simulate "moderate" average forecasts (pulled toward 0.5)
avg_forecasts = 0.6 * true_p + 0.2  # Shrunk toward 0.5

result = calibrate_extremizing(avg_forecasts, outcomes)
print(f"Optimal extremizing factor: {result['extremizing_factor']:.3f}")
print(f"Intercept (bias correction): {result['intercept']:.3f}")
print(f"Brier score before: {result['brier_before']:.4f}")
print(f"Brier score after:  {result['brier_after']:.4f}")

25.6 Stacking and Meta-Learning

Using a Meta-Model to Combine Base Models

Stacking (stacked generalization) goes beyond weighted averaging by training a meta-learner that learns how to optimally combine the predictions of base models. Instead of fixed weights, the meta-learner can learn non-linear combination functions and context-dependent weighting.

The stacking architecture has two levels:

  1. Level 0 (Base models): Multiple diverse models each generate predictions.
  2. Level 1 (Meta-learner): A single model takes the base predictions as inputs and produces the final forecast.

The meta-learner can be any model, but common choices include: - Logistic regression: Simple, interpretable, acts like learned weighted averaging - Ridge regression: Logistic regression with L2 regularization to prevent overfitting - Gradient boosting: Can learn non-linear combinations - Neural network: Maximum flexibility, requires most data

Stacking with Logistic Regression

The simplest and most common stacking approach uses logistic regression as the meta-learner:

$$\text{logit}(p_{\text{stack}}) = \beta_0 + \sum_{i=1}^{K} \beta_i \cdot \hat{p}_i$$

This learns a bias correction ($\beta_0$) and importance weights ($\beta_i$) for each base model. The logistic link function ensures the output is a valid probability.

Cross-Validated Stacking

A critical pitfall in stacking is information leakage. If the meta-learner is trained on the same data used to train the base models, it will learn to trust overfit base models. The solution is cross-validated stacking:

  1. Split the training data into $F$ folds.
  2. For each fold $f$: a. Train all base models on the data excluding fold $f$. b. Generate predictions for fold $f$ using these models.
  3. Collect all out-of-fold predictions to form the meta-training set.
  4. Train the meta-learner on this out-of-fold prediction matrix.
  5. Retrain all base models on the full training data for final predictions.

This ensures the meta-learner sees predictions made by base models that did not see the corresponding outcomes, preventing information leakage.

Feature-Dependent Combination Weights

A powerful extension allows the combination weights to depend on features of the prediction problem. For example, in an election forecasting ensemble, you might weight the polling model more heavily close to the election and the fundamentals model more heavily far from the election.

This can be implemented by including interaction terms in the meta-learner:

$$\text{logit}(p_{\text{stack}}) = \beta_0 + \sum_{i=1}^{K} (\beta_i + \gamma_i \cdot x) \cdot \hat{p}_i$$

where $x$ is a context feature (e.g., days until the election).

Python Stacking Pipeline

import numpy as np
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.calibration import CalibratedClassifierCV

class StackingEnsemble:
    """
    A stacking ensemble for probability forecasting.

    Uses cross-validated out-of-fold predictions to train
    a meta-learner that combines base model forecasts.
    """

    def __init__(self, base_models, meta_learner=None, n_folds=5):
        """
        Parameters
        ----------
        base_models : list of sklearn estimators
            The level-0 models.
        meta_learner : sklearn estimator, optional
            The level-1 model. Defaults to LogisticRegression.
        n_folds : int
            Number of folds for cross-validated stacking.
        """
        self.base_models = base_models
        self.meta_learner = meta_learner or LogisticRegression(C=1.0)
        self.n_folds = n_folds
        self.fitted_base_models = []

    def fit(self, X, y):
        """
        Fit the stacking ensemble using cross-validated stacking.

        Parameters
        ----------
        X : np.ndarray, shape (n_samples, n_features)
        y : np.ndarray, shape (n_samples,)
        """
        n_samples = len(y)
        n_models = len(self.base_models)
        meta_features = np.zeros((n_samples, n_models))

        kf = KFold(n_splits=self.n_folds, shuffle=True, random_state=42)

        # Generate out-of-fold predictions
        for fold_idx, (train_idx, val_idx) in enumerate(kf.split(X)):
            X_train, X_val = X[train_idx], X[val_idx]
            y_train = y[train_idx]

            for model_idx, model in enumerate(self.base_models):
                import copy
                model_clone = copy.deepcopy(model)
                model_clone.fit(X_train, y_train)

                if hasattr(model_clone, 'predict_proba'):
                    preds = model_clone.predict_proba(X_val)[:, 1]
                else:
                    preds = model_clone.predict(X_val)

                meta_features[val_idx, model_idx] = preds

        # Train meta-learner on out-of-fold predictions
        self.meta_learner.fit(meta_features, y)

        # Retrain base models on full data
        self.fitted_base_models = []
        for model in self.base_models:
            import copy
            model_clone = copy.deepcopy(model)
            model_clone.fit(X, y)
            self.fitted_base_models.append(model_clone)

        return self

    def predict_proba(self, X):
        """
        Generate ensemble probability predictions.

        Parameters
        ----------
        X : np.ndarray, shape (n_samples, n_features)

        Returns
        -------
        np.ndarray, shape (n_samples,)
            Ensemble probability of class 1.
        """
        meta_features = np.column_stack([
            model.predict_proba(X)[:, 1]
            if hasattr(model, 'predict_proba')
            else model.predict(X)
            for model in self.fitted_base_models
        ])

        if hasattr(self.meta_learner, 'predict_proba'):
            return self.meta_learner.predict_proba(meta_features)[:, 1]
        return self.meta_learner.predict(meta_features)

    def get_meta_weights(self):
        """Return meta-learner coefficients (for logistic regression)."""
        if hasattr(self.meta_learner, 'coef_'):
            return {
                'intercept': self.meta_learner.intercept_[0],
                'weights': self.meta_learner.coef_[0],
            }
        return None


# Example usage
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=20,
                            random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

base_models = [
    RandomForestClassifier(n_estimators=100, random_state=42),
    GradientBoostingClassifier(n_estimators=100, random_state=42),
    SVC(probability=True, random_state=42),
    GaussianNB(),
]

ensemble = StackingEnsemble(base_models, n_folds=5)
ensemble.fit(X_train, y_train)

ensemble_probs = ensemble.predict_proba(X_test)
from sklearn.metrics import brier_score_loss
print(f"Ensemble Brier score: "
      f"{brier_score_loss(y_test, ensemble_probs):.4f}")

# Compare with individual models
for model in ensemble.fitted_base_models:
    if hasattr(model, 'predict_proba'):
        probs = model.predict_proba(X_test)[:, 1]
        bs = brier_score_loss(y_test, probs)
        print(f"{model.__class__.__name__}: Brier = {bs:.4f}")

25.7 Bayesian Model Averaging

Treating Model Choice as Uncertain

In classical statistics, we select a single "best" model and proceed as if it were the true model. Bayesian Model Averaging (BMA) takes a fundamentally different approach: it acknowledges that the true model is unknown and averages over all candidate models, weighted by their posterior probability of being correct.

Posterior Model Probabilities

Given data $D$ and a set of candidate models $\{M_1, \ldots, M_K\}$, the posterior probability of model $M_k$ is:

$$P(M_k | D) = \frac{P(D | M_k) \cdot P(M_k)}{\sum_{j=1}^{K} P(D | M_j) \cdot P(M_j)}$$

where $P(D | M_k)$ is the marginal likelihood of the data under model $M_k$:

$$P(D | M_k) = \int P(D | \theta_k, M_k) \cdot P(\theta_k | M_k) \, d\theta_k$$

This integral averages the likelihood over the prior distribution of parameters, automatically penalizing overly complex models (Occam's razor).

The BMA Formula

The BMA prediction for a quantity of interest $\Delta$ (e.g., a probability forecast) is:

$$P(\Delta | D) = \sum_{k=1}^{K} P(\Delta | M_k, D) \cdot P(M_k | D)$$

This weighted average uses the posterior model probabilities as weights. Each model's forecast is weighted by how well that model explains the observed data, with a built-in complexity penalty.

BIC Approximation

Computing the marginal likelihood exactly is often intractable. A widely used approximation employs the Bayesian Information Criterion (BIC):

$$\log P(D | M_k) \approx -\frac{1}{2} \text{BIC}_k$$

where:

$$\text{BIC}_k = -2 \log \hat{L}_k + d_k \log n$$

with $\hat{L}_k$ being the maximized likelihood, $d_k$ the number of parameters, and $n$ the sample size. The BIC approximation makes BMA computationally tractable for a wide range of models.

Using the BIC approximation, the posterior model weights become:

$$P(M_k | D) \propto P(M_k) \cdot \exp\left(-\frac{1}{2}\text{BIC}_k\right)$$

With equal priors $P(M_k) = 1/K$:

$$P(M_k | D) = \frac{\exp(-\frac{1}{2}\text{BIC}_k)}{\sum_{j=1}^{K}\exp(-\frac{1}{2}\text{BIC}_j)}$$

Advantages and Limitations

Advantages of BMA: - Principled framework grounded in probability theory - Automatically balances model fit and complexity - Provides honest uncertainty by averaging over model uncertainty - Well-calibrated prediction intervals

Limitations of BMA: - Assumes the true model is in the candidate set (M-closed assumption) - Marginal likelihood is sensitive to prior specification - With large data, tends to concentrate weight on a single model - Computationally expensive for complex models

Python BMA Implementation

import numpy as np
from scipy.stats import norm

class BayesianModelAverager:
    """
    Bayesian Model Averaging using BIC approximation.

    Combines predictions from multiple models, weighted by
    their approximate posterior probabilities.
    """

    def __init__(self, models, prior_weights=None):
        """
        Parameters
        ----------
        models : list of dicts
            Each dict has:
            - 'name': str
            - 'model': fitted sklearn model with predict_proba
            - 'n_params': int, number of parameters
            - 'log_likelihood': float, maximized log-likelihood
        prior_weights : array-like, optional
            Prior model probabilities (default: uniform).
        """
        self.models = models
        self.n_models = len(models)
        if prior_weights is None:
            self.prior_weights = np.ones(self.n_models) / self.n_models
        else:
            self.prior_weights = np.array(prior_weights)
            self.prior_weights /= self.prior_weights.sum()

        self.posterior_weights = None

    def compute_weights(self, n_samples):
        """
        Compute posterior model weights using BIC approximation.

        Parameters
        ----------
        n_samples : int
            Number of training samples.
        """
        bic_values = np.array([
            -2 * m['log_likelihood'] + m['n_params'] * np.log(n_samples)
            for m in self.models
        ])

        # Log posterior (unnormalized): log_prior - BIC/2
        log_posterior = np.log(self.prior_weights) - 0.5 * bic_values

        # Normalize using log-sum-exp for numerical stability
        log_posterior -= np.max(log_posterior)
        self.posterior_weights = np.exp(log_posterior)
        self.posterior_weights /= self.posterior_weights.sum()

        return self.posterior_weights

    def predict(self, X):
        """
        Generate BMA probability predictions.

        Parameters
        ----------
        X : np.ndarray
            Feature matrix.

        Returns
        -------
        np.ndarray
            BMA-averaged probability predictions.
        """
        if self.posterior_weights is None:
            raise RuntimeError("Call compute_weights() first.")

        predictions = np.zeros(len(X))
        for model_info, weight in zip(self.models, self.posterior_weights):
            model = model_info['model']
            if hasattr(model, 'predict_proba'):
                pred = model.predict_proba(X)[:, 1]
            else:
                pred = model.predict(X)
            predictions += weight * pred

        return predictions

    def summary(self):
        """Print model weights and BIC values."""
        print("Bayesian Model Averaging Summary")
        print("-" * 50)
        for m, w in zip(self.models, self.posterior_weights):
            print(f"  {m['name']:30s}  Weight: {w:.4f}")
        print("-" * 50)


def compute_log_likelihood(model, X, y):
    """Compute log-likelihood for a fitted classifier."""
    probs = model.predict_proba(X)
    # Clip for numerical stability
    probs = np.clip(probs, 1e-10, 1 - 1e-10)
    ll = np.sum(y * np.log(probs[:, 1]) + (1 - y) * np.log(probs[:, 0]))
    return ll


# Example usage
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=500, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Fit models
lr = LogisticRegression(max_iter=1000).fit(X_train, y_train)
rf = RandomForestClassifier(n_estimators=50, random_state=42).fit(X_train, y_train)
nb = GaussianNB().fit(X_train, y_train)

models = [
    {
        'name': 'Logistic Regression',
        'model': lr,
        'n_params': X_train.shape[1] + 1,
        'log_likelihood': compute_log_likelihood(lr, X_train, y_train),
    },
    {
        'name': 'Random Forest',
        'model': rf,
        'n_params': 50 * 10,  # Approximate
        'log_likelihood': compute_log_likelihood(rf, X_train, y_train),
    },
    {
        'name': 'Naive Bayes',
        'model': nb,
        'n_params': 2 * X_train.shape[1],
        'log_likelihood': compute_log_likelihood(nb, X_train, y_train),
    },
]

bma = BayesianModelAverager(models)
bma.compute_weights(n_samples=len(X_train))
bma.summary()

bma_preds = bma.predict(X_test)
from sklearn.metrics import brier_score_loss
print(f"\nBMA Brier score: {brier_score_loss(y_test, bma_preds):.4f}")

25.8 Combining Market Prices with Model Forecasts

Market Price as One "Model"

A prediction market price $p_{\text{market}}$ can be interpreted as the market's probability forecast. It aggregates the beliefs of all traders, weighted by their willingness to risk capital. This makes it a powerful "model" in its own right, but it is not infallible.

Market prices can be distorted by: - Thin liquidity: Low trading volume can leave prices stale or manipulable. - Favorite-longshot bias: Markets tend to overvalue extreme outcomes. - Herding and information cascades: Traders follow each other rather than trading on independent information. - Time premiums: Early prices are less informative than late prices.

Your statistical or ML model may have advantages over the market: - It may process structured data more systematically. - It may be less susceptible to cognitive biases. - It may incorporate information not yet reflected in prices.

The question is: how should you combine your model's forecast with the market price?

When to Trust Your Model vs. the Market

The optimal balance depends on several factors:

Factor Trust Model More Trust Market More
Market liquidity Low liquidity High liquidity
Information freshness Model uses newer data Market is up-to-date
Track record Model has proven edge Market has better calibration
Market maturity New market Established, active market
Domain Specialized domain you know well General knowledge domain

Optimal Combination Weights

The optimal combination depends on the relative accuracy and correlation of the two sources. Using a simplified framework:

$$\hat{p}_{\text{combined}} = w \cdot \hat{p}_{\text{model}} + (1 - w) \cdot p_{\text{market}}$$

The optimal weight on the model is:

$$w^* = \frac{\sigma_{\text{market}}^2 - \sigma_{\text{model,market}}}{\sigma_{\text{model}}^2 + \sigma_{\text{market}}^2 - 2\sigma_{\text{model,market}}}$$

where $\sigma_{\text{model}}^2$ and $\sigma_{\text{market}}^2$ are the error variances, and $\sigma_{\text{model,market}}$ is the covariance of errors. In practice, these must be estimated from historical data.

Updating Your Model with Market Information

A Bayesian approach treats the market price as a noisy signal about the true probability:

$$p_{\text{posterior}} \propto p_{\text{model}} \cdot L(p_{\text{market}} | p_{\text{true}})$$

If we model the market price as:

$$p_{\text{market}} | p_{\text{true}} \sim \text{Beta}\left(\kappa \cdot p_{\text{true}}, \kappa \cdot (1 - p_{\text{true}})\right)$$

where $\kappa$ measures market precision (higher $\kappa$ = more informative market), we can compute the posterior numerically.

A simpler approach works in log-odds space:

$$\text{logit}(p_{\text{combined}}) = w_1 \cdot \text{logit}(p_{\text{model}}) + w_2 \cdot \text{logit}(p_{\text{market}})$$

where $w_1 + w_2$ need not equal 1 (the sum controls the overall extremeness).

Python Market-Model Combiner

import numpy as np
from scipy.special import logit, expit

class MarketModelCombiner:
    """
    Combine model forecasts with prediction market prices.
    """

    def __init__(self, method='linear', model_weight=0.5):
        """
        Parameters
        ----------
        method : str
            'linear' for linear combination,
            'log_odds' for log-odds combination.
        model_weight : float
            Weight on model forecast (0 to 1 for linear,
            positive real for log_odds).
        """
        self.method = method
        self.model_weight = model_weight

    def combine(self, model_forecast, market_price):
        """
        Combine a model forecast with a market price.

        Parameters
        ----------
        model_forecast : float or array
            Model's probability forecast.
        market_price : float or array
            Market price / implied probability.

        Returns
        -------
        float or array
            Combined forecast.
        """
        model_forecast = np.clip(model_forecast, 1e-4, 1 - 1e-4)
        market_price = np.clip(market_price, 1e-4, 1 - 1e-4)

        if self.method == 'linear':
            w = self.model_weight
            return w * model_forecast + (1 - w) * market_price

        elif self.method == 'log_odds':
            w1 = self.model_weight
            w2 = 1.0 - w1
            combined_logit = (w1 * logit(model_forecast) +
                              w2 * logit(market_price))
            return expit(combined_logit)

        else:
            raise ValueError(f"Unknown method: {self.method}")

    def calibrate(self, model_forecasts, market_prices, outcomes):
        """
        Find optimal combination weight from historical data.

        Parameters
        ----------
        model_forecasts : array-like
            Historical model forecasts.
        market_prices : array-like
            Historical market prices.
        outcomes : array-like
            Binary outcomes.

        Returns
        -------
        float
            Optimal model weight.
        """
        from scipy.optimize import minimize_scalar

        model_forecasts = np.array(model_forecasts)
        market_prices = np.array(market_prices)
        outcomes = np.array(outcomes)

        def brier_score(w):
            self.model_weight = w
            combined = self.combine(model_forecasts, market_prices)
            return np.mean((combined - outcomes) ** 2)

        result = minimize_scalar(brier_score, bounds=(0, 1),
                                 method='bounded')
        self.model_weight = result.x
        return result.x


# Example
np.random.seed(42)
n = 200
true_p = np.random.beta(2, 3, n)
outcomes = (np.random.random(n) < true_p).astype(float)

# Model forecast: slightly noisy version of truth
model_f = np.clip(true_p + np.random.normal(0, 0.12, n), 0.01, 0.99)
# Market price: noisier, with some bias
market_p = np.clip(true_p + np.random.normal(0.02, 0.15, n), 0.01, 0.99)

combiner = MarketModelCombiner(method='linear')
optimal_w = combiner.calibrate(model_f, market_p, outcomes)
print(f"Optimal model weight: {optimal_w:.3f}")

combined = combiner.combine(model_f, market_p)
print(f"Model Brier:    {np.mean((model_f - outcomes)**2):.4f}")
print(f"Market Brier:   {np.mean((market_p - outcomes)**2):.4f}")
print(f"Combined Brier: {np.mean((combined - outcomes)**2):.4f}")

25.9 Forecast Aggregation at Scale

Combining Many Forecasters

When aggregating forecasts from dozens or hundreds of forecasters (as on platforms like Metaculus, Good Judgment Open, or PredictIt), special challenges arise:

  • Heterogeneous quality: Some forecasters are much better than others.
  • Herding: Many forecasters may simply copy the current consensus.
  • Sparse participation: Not all forecasters predict on all questions.
  • Strategic behavior: Some forecasters may report extreme views to move the consensus.

Robust Aggregation Methods

Trimmed Mean: Remove the most extreme forecasts before averaging.

$$\bar{p}_{\text{trimmed}} = \frac{1}{K - 2m}\sum_{i=m+1}^{K-m} p_{(i)}$$

where $p_{(i)}$ is the $i$-th smallest forecast and $m$ is the number of forecasts trimmed from each end.

Median: The 50th percentile of the forecast distribution. Maximally robust to outliers:

$$\tilde{p} = \text{median}(p_1, \ldots, p_K)$$

Winsorized Mean: Instead of removing extremes, clip them to the $m$-th and $(K-m)$-th values:

$$\bar{p}_{\text{Winsor}} = \frac{1}{K}\left[m \cdot p_{(m+1)} + \sum_{i=m+1}^{K-m} p_{(i)} + m \cdot p_{(K-m)}\right]$$

Performance-Weighted Aggregation

Track each forecaster's historical accuracy and weight accordingly:

$$\hat{p} = \frac{\sum_{i=1}^{K} s_i \cdot p_i}{\sum_{i=1}^{K} s_i}$$

where $s_i$ is forecaster $i$'s accuracy score (e.g., inverse Brier score).

The Metaculus Approach

Metaculus, a prominent forecasting platform, uses a sophisticated aggregation algorithm that:

  1. Weights forecasters by their track record.
  2. Recency-weights forecasts so that recent updates matter more.
  3. Applies a form of extremizing to the aggregate.
  4. Uses a mixture model to handle multi-modal forecast distributions.

The community prediction (CP) and the Metaculus prediction (MP) represent different aggregation strategies: the CP is closer to a simple average, while the MP uses more sophisticated techniques.

Python Forecast Aggregator

import numpy as np
from scipy import stats

class ForecastAggregator:
    """
    Aggregate multiple forecaster predictions using various methods.
    """

    def __init__(self):
        self.methods = {
            'mean': self._mean,
            'median': self._median,
            'trimmed_mean': self._trimmed_mean,
            'winsorized_mean': self._winsorized_mean,
            'weighted_mean': self._weighted_mean,
            'extremized_mean': self._extremized_mean,
        }

    def aggregate(self, forecasts, method='mean', **kwargs):
        """
        Aggregate forecasts using the specified method.

        Parameters
        ----------
        forecasts : array-like
            Individual probability forecasts.
        method : str
            Aggregation method name.
        **kwargs
            Method-specific parameters.

        Returns
        -------
        float
            Aggregated forecast.
        """
        forecasts = np.array(forecasts, dtype=float)
        if method not in self.methods:
            raise ValueError(f"Unknown method: {method}")
        return self.methods[method](forecasts, **kwargs)

    def _mean(self, forecasts, **kwargs):
        return float(np.mean(forecasts))

    def _median(self, forecasts, **kwargs):
        return float(np.median(forecasts))

    def _trimmed_mean(self, forecasts, trim_fraction=0.1, **kwargs):
        return float(stats.trim_mean(forecasts, trim_fraction))

    def _winsorized_mean(self, forecasts, limits=(0.1, 0.1), **kwargs):
        winsorized = stats.mstats.winsorize(forecasts, limits=limits)
        return float(np.mean(winsorized))

    def _weighted_mean(self, forecasts, weights=None, **kwargs):
        if weights is None:
            weights = np.ones(len(forecasts))
        weights = np.array(weights)
        weights /= weights.sum()
        return float(np.dot(weights, forecasts))

    def _extremized_mean(self, forecasts, d=1.5, **kwargs):
        from scipy.special import logit, expit
        avg = np.mean(forecasts)
        avg = np.clip(avg, 1e-6, 1 - 1e-6)
        return float(expit(d * logit(avg)))

    def compare_methods(self, forecasts, outcome=None, **kwargs):
        """
        Compare all aggregation methods on a set of forecasts.

        Parameters
        ----------
        forecasts : array-like
            Individual probability forecasts.
        outcome : float, optional
            Binary outcome (0 or 1) for computing Brier score.

        Returns
        -------
        dict
            Method name -> aggregated forecast (and Brier score if
            outcome provided).
        """
        results = {}
        for method_name in self.methods:
            try:
                agg = self.aggregate(forecasts, method=method_name, **kwargs)
                result = {'forecast': agg}
                if outcome is not None:
                    result['brier'] = (agg - outcome) ** 2
                results[method_name] = result
            except Exception as e:
                results[method_name] = {'error': str(e)}
        return results


# Example
np.random.seed(42)
true_p = 0.75

# Simulate 20 forecasters with varying skill
n_forecasters = 20
skill_levels = np.random.uniform(0.05, 0.25, n_forecasters)
forecasts = np.clip(
    true_p + np.random.normal(0, skill_levels),
    0.01, 0.99
)
# Add a few extreme outliers
forecasts[0] = 0.15  # Very wrong
forecasts[1] = 0.95  # Overly confident

outcome = 1.0  # Event occurred

agg = ForecastAggregator()
results = agg.compare_methods(forecasts, outcome=outcome)

print("Aggregation Method Comparison")
print("-" * 50)
for method, res in sorted(results.items(),
                           key=lambda x: x[1].get('brier', 99)):
    if 'error' not in res:
        bs = res.get('brier', None)
        bs_str = f"  Brier: {bs:.4f}" if bs is not None else ""
        print(f"  {method:20s}: {res['forecast']:.4f}{bs_str}")

25.10 Diversity and Correlation of Models

Why Diversity Matters

Diversity among ensemble members is the single most important factor determining ensemble quality. To see why, recall the decomposition of ensemble error.

For a simple average of $K$ models, the ensemble MSE can be decomposed as:

$$\text{MSE}_{\text{ensemble}} = \overline{\text{MSE}} - \overline{\text{Diversity}}$$

where $\overline{\text{MSE}}$ is the average MSE of the individual models and $\overline{\text{Diversity}}$ measures the average disagreement among models. This is the ambiguity decomposition (Krogh and Vedelsby, 1995):

$$\overline{\text{Diversity}} = \frac{1}{K}\sum_{i=1}^{K}(\hat{p}_i - \bar{p})^2$$

This tells us something profound: the ensemble error is always less than or equal to the average individual error, and the gap is exactly equal to the diversity. More diverse ensembles produce greater error reduction.

Measuring Model Diversity

Several metrics quantify diversity:

1. Pairwise Correlation of Errors:

$$\rho_{ij} = \text{Corr}(\epsilon_i, \epsilon_j)$$

where $\epsilon_i = \hat{p}_i - y$ is the error of model $i$. Lower average pairwise correlation indicates higher diversity.

2. Disagreement (Variance of Predictions):

$$D = \frac{1}{K}\sum_{i=1}^{K}(\hat{p}_i - \bar{p})^2$$

Higher disagreement indicates higher diversity.

3. Double Fault Measure:

The fraction of instances where both models $i$ and $j$ are wrong. Lower double fault = higher diversity.

4. Q-Statistic (Yule's Q):

$$Q_{ij} = \frac{N_{11}N_{00} - N_{01}N_{10}}{N_{11}N_{00} + N_{01}N_{10}}$$

where $N_{11}$ is the count of instances where both models are correct, $N_{00}$ where both are wrong, etc. $Q = 0$ indicates independence; $Q = 1$ indicates perfect positive dependence.

Why Correlated Models Add Less Value

If you already have three polling-based models in your ensemble, adding a fourth polling-based model provides much less improvement than adding a fundamentals-based model. The fourth polling model makes errors correlated with the existing three, so it contributes little new information.

Mathematically, the marginal value of adding model $K+1$ to an ensemble of $K$ models is approximately:

$$\Delta \text{MSE} \approx \frac{(1 - \bar{\rho}_{K+1})\sigma_{K+1}^2}{(K+1)^2}$$

where $\bar{\rho}_{K+1}$ is the average correlation of the new model's errors with the existing models. When $\bar{\rho}_{K+1}$ is high, the marginal improvement is tiny.

Strategies for Creating Diverse Models

  1. Different algorithms: Use fundamentally different model types (logistic regression, random forest, neural network, naive Bayes).

  2. Different feature sets: Train models on different subsets of features. A polling model and a fundamentals model naturally use different features.

  3. Different training data: Use different time periods, different subsets of events, or bootstrapped samples.

  4. Different preprocessing: Vary the feature engineering, normalization, or missing value handling.

  5. Different hyperparameters: Use different regularization strengths, tree depths, or learning rates.

  6. Different information sources: Combine model predictions with market prices, expert forecasts, and survey data.

  7. Temporal diversity: Combine models trained at different points in time.

Python Diversity Metrics

import numpy as np
from itertools import combinations

class DiversityAnalyzer:
    """
    Analyze the diversity of an ensemble of forecasting models.
    """

    def __init__(self, predictions_matrix, outcomes):
        """
        Parameters
        ----------
        predictions_matrix : np.ndarray, shape (T, K)
            Forecasts from K models over T instances.
        outcomes : np.ndarray, shape (T,)
            Binary outcomes.
        """
        self.predictions = np.array(predictions_matrix)
        self.outcomes = np.array(outcomes)
        self.T, self.K = self.predictions.shape
        self.errors = self.predictions - self.outcomes.reshape(-1, 1)

    def error_correlation_matrix(self):
        """Compute pairwise correlation of model errors."""
        return np.corrcoef(self.errors.T)

    def average_pairwise_correlation(self):
        """Average pairwise correlation of errors."""
        corr = self.error_correlation_matrix()
        mask = np.triu(np.ones_like(corr, dtype=bool), k=1)
        return float(corr[mask].mean())

    def prediction_disagreement(self):
        """
        Average disagreement (variance of predictions) across instances.
        """
        return float(np.mean(np.var(self.predictions, axis=1)))

    def ambiguity_decomposition(self):
        """
        Compute the ambiguity decomposition:
        Ensemble MSE = Average Individual MSE - Diversity
        """
        ensemble_preds = self.predictions.mean(axis=1)
        ensemble_mse = np.mean((ensemble_preds - self.outcomes) ** 2)

        individual_mses = np.mean(self.errors ** 2, axis=0)
        avg_individual_mse = np.mean(individual_mses)

        diversity = avg_individual_mse - ensemble_mse

        return {
            'ensemble_mse': ensemble_mse,
            'avg_individual_mse': avg_individual_mse,
            'diversity': diversity,
            'diversity_ratio': diversity / avg_individual_mse
                if avg_individual_mse > 0 else 0,
        }

    def q_statistic(self, threshold=0.5):
        """
        Compute pairwise Q-statistics (for binary predictions).

        Parameters
        ----------
        threshold : float
            Threshold for converting probabilities to binary.

        Returns
        -------
        np.ndarray
            K x K matrix of Q-statistics.
        """
        binary_preds = (self.predictions >= threshold).astype(int)
        binary_correct = (binary_preds ==
                          self.outcomes.reshape(-1, 1).astype(int))

        Q = np.zeros((self.K, self.K))

        for i, j in combinations(range(self.K), 2):
            n11 = np.sum(binary_correct[:, i] & binary_correct[:, j])
            n00 = np.sum(~binary_correct[:, i] & ~binary_correct[:, j])
            n10 = np.sum(binary_correct[:, i] & ~binary_correct[:, j])
            n01 = np.sum(~binary_correct[:, i] & binary_correct[:, j])

            denom = n11 * n00 + n01 * n10
            if denom > 0:
                Q[i, j] = Q[j, i] = (n11 * n00 - n01 * n10) / denom

        np.fill_diagonal(Q, 1.0)
        return Q

    def marginal_value(self):
        """
        Compute the marginal value of each model to the ensemble.
        Compare ensemble MSE with and without each model.
        """
        full_ensemble_mse = np.mean(
            (self.predictions.mean(axis=1) - self.outcomes) ** 2
        )

        marginal_values = {}
        for k in range(self.K):
            # Ensemble without model k
            mask = [i for i in range(self.K) if i != k]
            reduced_preds = self.predictions[:, mask].mean(axis=1)
            reduced_mse = np.mean((reduced_preds - self.outcomes) ** 2)
            marginal_values[k] = reduced_mse - full_ensemble_mse

        return marginal_values

    def summary(self, model_names=None):
        """Print a comprehensive diversity report."""
        if model_names is None:
            model_names = [f"Model {i}" for i in range(self.K)]

        decomp = self.ambiguity_decomposition()

        print("=" * 60)
        print("ENSEMBLE DIVERSITY REPORT")
        print("=" * 60)

        print(f"\nNumber of models: {self.K}")
        print(f"Number of instances: {self.T}")

        print(f"\nAmbiguity Decomposition:")
        print(f"  Avg Individual MSE: {decomp['avg_individual_mse']:.4f}")
        print(f"  Ensemble MSE:       {decomp['ensemble_mse']:.4f}")
        print(f"  Diversity:          {decomp['diversity']:.4f}")
        print(f"  Diversity Ratio:    {decomp['diversity_ratio']:.1%}")

        print(f"\nAvg Pairwise Error Correlation: "
              f"{self.average_pairwise_correlation():.3f}")
        print(f"Prediction Disagreement:         "
              f"{self.prediction_disagreement():.4f}")

        print(f"\nMarginal Values (MSE increase if removed):")
        mv = self.marginal_value()
        for k in sorted(mv, key=mv.get, reverse=True):
            print(f"  {model_names[k]:20s}: +{mv[k]:.4f}")


# Example
np.random.seed(42)
T = 300
true_p = np.random.beta(2, 2, T)
outcomes = (np.random.random(T) < true_p).astype(float)

# Create models with varying diversity
preds = np.column_stack([
    np.clip(true_p + np.random.normal(0, 0.10, T), 0.01, 0.99),
    np.clip(true_p + np.random.normal(0, 0.12, T), 0.01, 0.99),
    np.clip(true_p + np.random.normal(0, 0.15, T), 0.01, 0.99),
    np.clip(true_p + np.random.normal(0.05, 0.18, T), 0.01, 0.99),
])

analyzer = DiversityAnalyzer(preds, outcomes)
analyzer.summary(
    model_names=['Polling Model', 'Fundamentals', 'ML Model', 'Market Price']
)

25.11 Practical Ensemble Building

End-to-End Ensemble Workflow

Building an effective ensemble for prediction markets follows a structured workflow:

Step 1: Define the Problem - What are you forecasting? (Binary outcome, multi-class, continuous?) - What is your evaluation metric? (Brier score, log loss, calibration?) - What is your time horizon?

Step 2: Build Diverse Base Models - Start with at least 3-5 models using different approaches: - A simple baseline (historical base rates) - A statistical model (logistic regression on structured features) - A machine learning model (gradient boosting, random forest) - Market price (if available) - Expert or crowd forecast (if available) - Ensure diversity by varying algorithms, features, and data sources.

Step 3: Evaluate Individual Models - Assess each model's calibration, discrimination (AUC), and overall accuracy (Brier score). - Use proper scoring rules (see Chapter 6). - Evaluate on a held-out test set or via cross-validation.

Step 4: Choose a Combination Method - Start with simple averaging as a baseline. - Try weighted averaging if you have enough historical data (at least 50-100 resolved predictions per model). - Try extremizing if your models share significant information. - Try stacking if you have abundant data and features. - Try BMA if you want principled model uncertainty quantification.

Step 5: Validate the Ensemble - Use time-series cross-validation (never peek at the future). - Compare the ensemble to each individual model. - Check calibration of the ensemble. - Analyze the diversity metrics to understand why the ensemble works.

Step 6: Monitor and Update - Track ensemble performance over time. - Periodically re-estimate combination weights. - Add new models or remove underperformers. - Watch for concept drift (systematic changes in the forecasting environment).

How Many Models to Combine

Empirical research suggests:

  • 3-5 models: The sweet spot for most applications. Most of the ensemble benefit comes from the first few diverse models.
  • 5-10 models: Marginally better, but diminishing returns set in quickly.
  • 10+ models: Usually not worth the added complexity unless models are very diverse.

The key insight is that the marginal value of adding a new model decreases rapidly as the ensemble grows, especially if the new model is correlated with existing members.

A useful heuristic: add a new model only if it has sufficiently low correlation with the existing ensemble (e.g., average pairwise error correlation below 0.5 with existing models).

When to Stop Adding Models

Stop adding models when: 1. The marginal improvement in cross-validated performance is below a meaningful threshold (e.g., less than 0.001 Brier score improvement). 2. The new model is highly correlated with existing models (average error correlation > 0.7). 3. The computational cost of maintaining the model outweighs the accuracy gains. 4. You cannot reliably estimate the combination weights with your available data.

Monitoring Ensemble Performance Over Time

Ensembles are not "set and forget." You must monitor:

  • Performance drift: Is the ensemble's Brier score trending upward?
  • Weight stability: Are the optimal weights changing dramatically over time?
  • Calibration drift: Is the ensemble becoming overconfident or underconfident?
  • Model degradation: Is any individual model suddenly performing much worse?

A rolling window evaluation (e.g., evaluating performance over the last 100 predictions) helps detect these issues early.

def rolling_brier_score(forecasts, outcomes, window=50):
    """Compute rolling Brier score to detect performance drift."""
    forecasts = np.array(forecasts)
    outcomes = np.array(outcomes)
    n = len(forecasts)

    rolling_scores = []
    for i in range(window, n + 1):
        window_forecasts = forecasts[i-window:i]
        window_outcomes = outcomes[i-window:i]
        bs = np.mean((window_forecasts - window_outcomes) ** 2)
        rolling_scores.append(bs)

    return np.array(rolling_scores)

25.12 Chapter Summary

This chapter has covered the full spectrum of ensemble methods for prediction market forecasting:

  1. Simple averaging is a surprisingly strong baseline that reduces variance without increasing bias. It works best when models are diverse and you have limited historical data for estimating weights.

  2. Weighted averaging improves on simple averaging when you can reliably estimate model quality, but beware of overfitting. Shrinkage toward equal weights provides a safety net.

  3. Linear and logarithmic pooling offer principled frameworks for combining probability forecasts. The linear pool is conservative and preserves calibration; the logarithmic pool captures the strength of agreement but assumes independent evidence.

  4. Extremizing addresses the systematic moderation of averaged forecasts by pushing the aggregate away from 50%. It is theoretically justified when forecasters share information and empirically effective across many domains.

  5. Stacking uses a meta-learner to discover non-linear, context-dependent combination strategies. Cross-validated stacking prevents information leakage.

  6. Bayesian Model Averaging provides a principled framework that accounts for model uncertainty, automatically penalizing complexity through the marginal likelihood.

  7. Market-model combination treats the market price as one input to the ensemble, recognizing that both your models and the market have complementary strengths and weaknesses.

  8. Forecast aggregation at scale requires robust methods (trimmed means, medians) and performance-weighted approaches to handle heterogeneous forecasters.

  9. Diversity is the most important property of an effective ensemble. Models with uncorrelated errors contribute much more than models with correlated errors.

The overarching lesson is this: humility about any single model's accuracy, combined with disciplined combination methods, produces forecasts that are more accurate, better calibrated, and more robust than any individual approach.


What's Next

In Chapter 26, we will turn to Feature Engineering for Prediction Markets, where we explore how to construct powerful input features from raw data sources including market microstructure, historical prices, fundamental indicators, and textual data. Good features are the foundation on which all the models and ensembles we have discussed in this chapter are built, and the art of feature engineering often determines the difference between a mediocre forecaster and an exceptional one.


Key equations to remember:

Method Formula
Simple average $\hat{p} = \frac{1}{K}\sum_i \hat{p}_i$
Weighted average $\hat{p} = \sum_i w_i \hat{p}_i$
Log pool $\text{logit}(\hat{p}) = \sum_i w_i \cdot \text{logit}(\hat{p}_i)$
Extremizing $\text{logit}(\hat{p}_{\text{ext}}) = d \cdot \text{logit}(\bar{p})$
BMA $P(\Delta|D) = \sum_k P(\Delta|M_k,D) \cdot P(M_k|D)$
Ambiguity decomp. $\text{MSE}_{\text{ens}} = \overline{\text{MSE}} - \overline{\text{Diversity}}$