34 min read

> "The ideal forecaster should, like a good poker player, be well calibrated. When they say there is a 70% chance of rain, it should rain about 70% of the time."

In This Chapter

12.1 What Is Calibration?
12.2 Measuring Calibration
12.3 Reliability Diagrams
12.4 Brier Score Decomposition Revisited
12.5 Brier Skill Score
12.6 Sharpness and Resolution
12.7 Calibration of Prediction Markets
12.8 Recalibration Techniques
12.9 Calibration of Individual Traders
12.10 Advanced: Calibration in Multi-Outcome Settings
12.11 Practical Tools and Workflows
12.12 Chapter Summary
What's Next
Key Equations Reference

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 12: Calibration — Measuring Forecast Quality

"The ideal forecaster should, like a good poker player, be well calibrated. When they say there is a 70% chance of rain, it should rain about 70% of the time." — Adapted from Allan Murphy, pioneer of forecast verification

You have been trading on prediction markets for several months. Your portfolio shows a modest profit, and you feel increasingly confident in your ability to assess probabilities. But a nagging question remains: when you assign a probability of 70% to an event, does it actually happen about 70% of the time? Are your stated beliefs meaningfully connected to reality, or are you systematically fooling yourself?

This question cuts to the very heart of probabilistic reasoning, and answering it rigorously is what calibration analysis is all about. Calibration is, without exaggeration, the single most important quality metric for anyone who makes probabilistic forecasts — whether you are a weather forecaster, a machine learning engineer, a superforecaster, or a prediction market trader.

In this chapter, we will develop a thorough understanding of calibration: what it means, how to measure it, how to visualize it, and how to improve it. We will decompose scoring rules into components that isolate calibration from other aspects of forecast quality. We will examine whether real prediction markets are well-calibrated and what systematic biases exist. Finally, we will build practical tools for tracking and improving your own calibration as a trader.

This chapter completes Part II of the book. By its end, you will have a full toolkit for evaluating forecast quality — the essential bridge between understanding market microstructure and developing profitable trading strategies in Part III.

12.1 What Is Calibration?

The Core Definition

Calibration is a deceptively simple concept with profound implications.

Definition. A forecaster is perfectly calibrated if, among all the events to which they assign a probability of p, the fraction that actually occur is exactly p. Formally, for all probability values $p \in [0, 1]$:

$$P(\text{event occurs} \mid \text{forecast} = p) = p$$

In plain language: when you say 70%, it should happen 70% of the time. When you say 20%, it should happen 20% of the time. When you say 95%, it should happen 95% of the time.

This sounds trivially obvious — of course your probabilities should match reality. But in practice, achieving good calibration is remarkably difficult. Human cognitive biases, limited feedback, overreliance on narratives, and the sheer difficulty of probabilistic thinking all conspire against it.

A Concrete Example

Imagine two weather forecasters, Alice and Bob, who each make 1,000 precipitation forecasts over a year.

Alice's track record: - Of the 100 days she said "10% chance of rain," it rained on 10 of them. - Of the 200 days she said "30% chance of rain," it rained on 60 of them. - Of the 150 days she said "50% chance of rain," it rained on 75 of them. - Of the 250 days she said "70% chance of rain," it rained on 175 of them. - Of the 300 days she said "90% chance of rain," it rained on 270 of them.

Alice is perfectly calibrated. Every probability she stated matches the actual frequency exactly.

Bob's track record: - Of the 50 days he said "10% chance of rain," it rained on 25 of them (50% actual). - Of the 200 days he said "30% chance of rain," it rained on 100 of them (50% actual). - Of the 300 days he said "50% chance of rain," it rained on 150 of them (50% actual). - Of the 250 days he said "70% chance of rain," it rained on 125 of them (50% actual). - Of the 200 days he said "90% chance of rain," it rained on 100 of them (50% actual).

Bob is terribly calibrated. No matter what probability he states, the actual frequency is always about 50%. His forecasts carry no useful information about rain probability — he might as well be flipping a coin and then dressing it up in different numbers.

Why Calibration Matters for Prediction Markets

In prediction markets, calibration has a direct financial interpretation. If a market consistently prices events at 70% and those events occur 70% of the time, the market is well-calibrated. This means:

Traders can trust the prices. A price of $0.70 genuinely means a 70% probability, not something systematically higher or lower.
Mispricing is identifiable. If you discover that the market is overconfident (events priced at 70% happen only 55% of the time), you have found a systematic edge to exploit.
Decision-making is sound. Policymakers, journalists, and individuals who use prediction market probabilities to make decisions can rely on those numbers being meaningful.
Personal skill is measurable. As a trader, your calibration track record is the most honest assessment of your probabilistic reasoning ability.

Calibration vs. Other Forecast Qualities

Calibration is necessary but not sufficient for good forecasting. It is important to distinguish it from related concepts:

Accuracy (or discrimination) measures whether your forecasts distinguish events that happen from events that do not. A forecaster who always says "50%" for everything is perfectly calibrated in a world where the base rate is 50% — but utterly useless. They have zero discrimination.

Sharpness measures how extreme your forecasts are. Sharp forecasts are close to 0% or 100%. A sharp, well-calibrated forecaster is highly informative — they confidently identify high-probability and low-probability events, and they are right to be confident.

Resolution measures how much your forecasts vary with the actual outcome. High resolution means you assign high probabilities to events that happen and low probabilities to events that do not.

The ideal forecaster is simultaneously well-calibrated, sharp, and has high resolution. Calibration alone can be trivially achieved — but calibration combined with sharpness and resolution is the hallmark of genuine forecasting skill.

12.2 Measuring Calibration

The Binning Approach

Since we rarely have enough forecasts at any single probability value, we need to group forecasts into bins to assess calibration. The standard approach:

Sort all forecasts by their predicted probability.
Divide them into $K$ bins (typically 10 bins: [0, 0.1), [0.1, 0.2), ..., [0.9, 1.0]).
For each bin, compute the mean predicted probability and the observed frequency (fraction of events that actually occurred).
Compare the two: if they match closely, the forecaster is well-calibrated.

Expected Calibration Error (ECE)

The Expected Calibration Error is the most widely used scalar summary of calibration quality. It computes the weighted average absolute difference between predicted probabilities and observed frequencies across bins:

$$\text{ECE} = \sum_{k=1}^{K} \frac{n_k}{N} \left| \bar{p}_k - \bar{o}_k \right|$$

where: - $K$ is the number of bins - $n_k$ is the number of forecasts in bin $k$ - $N$ is the total number of forecasts - $\bar{p}_k$ is the mean predicted probability in bin $k$ - $\bar{o}_k$ is the observed frequency (fraction of positive outcomes) in bin $k$

A perfectly calibrated forecaster has $\text{ECE} = 0$. In practice, values below 0.02 indicate excellent calibration, 0.02-0.05 is good, 0.05-0.10 is mediocre, and above 0.10 suggests serious calibration problems.

Maximum Calibration Error (MCE)

While ECE gives the average calibration error, we sometimes care about the worst-case error across bins:

$$\text{MCE} = \max_{k \in \{1, \ldots, K\}} \left| \bar{p}_k - \bar{o}_k \right|$$

MCE is particularly useful when you need to ensure no single probability range is badly miscalibrated. A forecaster might have a low ECE but a high MCE if they are well-calibrated everywhere except for one problematic probability range.

Choosing the Number of Bins

The choice of $K$ involves a bias-variance tradeoff:

Too few bins (e.g., $K = 3$): Averages away important calibration details. A forecaster who is overconfident at 60% and underconfident at 80% might appear calibrated if both land in the same bin.
Too many bins (e.g., $K = 50$): Too few forecasts per bin, leading to noisy estimates. The observed frequency in a bin with only 5 forecasts is unreliable.
Common choice: $K = 10$ (decile bins) or $K = 20$ (vingtile bins). With 1,000+ forecasts, $K = 10$ works well. With 10,000+, you can use $K = 20$ or even finer binning.

An alternative to fixed-width bins is adaptive binning (equal-frequency bins), where each bin contains the same number of forecasts. This avoids empty or near-empty bins but makes the bin boundaries data-dependent.

Python Implementation

import numpy as np

def compute_calibration_metrics(predictions, outcomes, n_bins=10, strategy='uniform'):
    """
    Compute calibration metrics for binary forecasts.

    Parameters
    ----------
    predictions : array-like
        Predicted probabilities in [0, 1].
    outcomes : array-like
        Binary outcomes (0 or 1).
    n_bins : int
        Number of calibration bins.
    strategy : str
        'uniform' for equal-width bins, 'quantile' for equal-frequency bins.

    Returns
    -------
    dict with keys: 'ece', 'mce', 'bin_edges', 'bin_means', 'bin_freqs', 'bin_counts'
    """
    predictions = np.array(predictions, dtype=float)
    outcomes = np.array(outcomes, dtype=float)

    if strategy == 'uniform':
        bin_edges = np.linspace(0, 1, n_bins + 1)
    elif strategy == 'quantile':
        quantiles = np.linspace(0, 100, n_bins + 1)
        bin_edges = np.percentile(predictions, quantiles)
        bin_edges[0] = 0.0
        bin_edges[-1] = 1.0
    else:
        raise ValueError(f"Unknown strategy: {strategy}")

    bin_means = []
    bin_freqs = []
    bin_counts = []

    for i in range(n_bins):
        if i < n_bins - 1:
            mask = (predictions >= bin_edges[i]) & (predictions < bin_edges[i + 1])
        else:
            mask = (predictions >= bin_edges[i]) & (predictions <= bin_edges[i + 1])

        count = mask.sum()
        bin_counts.append(count)

        if count > 0:
            bin_means.append(predictions[mask].mean())
            bin_freqs.append(outcomes[mask].mean())
        else:
            bin_means.append((bin_edges[i] + bin_edges[i + 1]) / 2)
            bin_freqs.append(0.0)

    bin_means = np.array(bin_means)
    bin_freqs = np.array(bin_freqs)
    bin_counts = np.array(bin_counts)

    # ECE: weighted average absolute difference
    weights = bin_counts / bin_counts.sum()
    ece = np.sum(weights * np.abs(bin_means - bin_freqs))

    # MCE: maximum absolute difference (over non-empty bins)
    non_empty = bin_counts > 0
    if non_empty.any():
        mce = np.max(np.abs(bin_means[non_empty] - bin_freqs[non_empty]))
    else:
        mce = 0.0

    return {
        'ece': ece,
        'mce': mce,
        'bin_edges': bin_edges,
        'bin_means': bin_means,
        'bin_freqs': bin_freqs,
        'bin_counts': bin_counts,
    }

This function forms the backbone of our calibration analysis toolkit. We will extend it throughout the chapter with visualization and decomposition capabilities.

Confidence Intervals for Calibration

A critical but often overlooked aspect of calibration measurement is statistical uncertainty. With finite samples, the observed frequency in each bin is a noisy estimate of the true conditional probability. We can compute confidence intervals using the binomial distribution:

For a bin with $n_k$ forecasts and observed frequency $\hat{o}_k$, the standard error is:

$$\text{SE}_k = \sqrt{\frac{\hat{o}_k (1 - \hat{o}_k)}{n_k}}$$

A 95% confidence interval is approximately $\hat{o}_k \pm 1.96 \times \text{SE}_k$. When assessing whether a forecaster is miscalibrated in a particular bin, we should check whether the perfect calibration value $\bar{p}_k$ falls within this confidence interval.

12.3 Reliability Diagrams

What Is a Reliability Diagram?

A reliability diagram (also called a calibration plot) is the primary visual tool for assessing calibration. It plots the observed frequency of outcomes against the predicted probability for each bin.

The key elements of a reliability diagram:

The diagonal line ($y = x$): This represents perfect calibration. Any point on this line means "what you predicted is exactly what happened."
The calibration curve: The actual relationship between predicted probabilities and observed frequencies. Points above the diagonal indicate underconfidence (things happen more often than predicted); points below indicate overconfidence (things happen less often than predicted).
Histogram of forecast counts: Often shown as a secondary panel below the main plot, this shows how many forecasts fall in each bin. This is essential context — a wildly miscalibrated bin with only 3 forecasts is not concerning, while a slightly miscalibrated bin with 500 forecasts is.
Confidence bands: Shaded regions around the diagonal showing the range of observed frequencies consistent with perfect calibration at a given sample size.

Reading Reliability Diagrams

Let us examine common patterns you will encounter:

Pattern 1: Overconfidence. The calibration curve falls below the diagonal for high predicted probabilities and above it for low predicted probabilities. Events predicted at 90% occur only 75% of the time; events predicted at 10% occur 25% of the time. The forecaster's probabilities are too extreme — they are more confident than they should be.

This is the most common human bias. Studies of expert judgment consistently find overconfidence, particularly for extreme probabilities. In prediction markets, overconfidence manifests as the favorite-longshot bias (which we will explore in Chapter 15).

Pattern 2: Underconfidence. The calibration curve falls above the diagonal for high probabilities and below it for low probabilities. Events predicted at 90% occur 97% of the time; events predicted at 10% occur 3% of the time. The forecaster is not confident enough — their true beliefs are more extreme than what they state.

Underconfidence is less common than overconfidence but can arise from excessive hedging, anchoring to base rates, or institutional incentives (weather forecasters, for example, sometimes hedge to avoid blame for "missed" events).

Pattern 3: S-shaped curve. A sigmoidal calibration curve that is overconfident in some ranges and underconfident in others. This often results from a miscalibrated probability model that maps raw scores to probabilities incorrectly.

Pattern 4: Step function. The calibration curve has flat regions followed by jumps. This occurs when a forecaster uses only a few distinct probability values (e.g., 20%, 50%, 80%) rather than the full probability scale.

Python Reliability Diagram Generator

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec

def plot_reliability_diagram(predictions, outcomes, n_bins=10, strategy='uniform',
                              title='Reliability Diagram', figsize=(8, 8)):
    """
    Generate a publication-quality reliability diagram with confidence bands.

    Parameters
    ----------
    predictions : array-like
        Predicted probabilities in [0, 1].
    outcomes : array-like
        Binary outcomes (0 or 1).
    n_bins : int
        Number of bins.
    strategy : str
        'uniform' or 'quantile'.
    title : str
        Plot title.
    figsize : tuple
        Figure size.

    Returns
    -------
    fig : matplotlib Figure
    metrics : dict with ECE, MCE
    """
    predictions = np.array(predictions, dtype=float)
    outcomes = np.array(outcomes, dtype=float)

    # Compute calibration metrics
    from calibration_utils import compute_calibration_metrics  # or inline
    metrics = compute_calibration_metrics(predictions, outcomes, n_bins, strategy)

    fig = plt.figure(figsize=figsize)
    gs = GridSpec(2, 1, height_ratios=[3, 1], hspace=0.05)

    ax1 = fig.add_subplot(gs[0])
    ax2 = fig.add_subplot(gs[1])

    # --- Main calibration plot ---
    # Perfect calibration line
    ax1.plot([0, 1], [0, 1], 'k--', linewidth=1.5, label='Perfect calibration')

    # Confidence band (based on binomial uncertainty)
    x_smooth = np.linspace(0.01, 0.99, 200)
    n_per_bin = len(predictions) / n_bins  # approximate
    se = np.sqrt(x_smooth * (1 - x_smooth) / max(n_per_bin, 1))
    ax1.fill_between(x_smooth, x_smooth - 1.96 * se, x_smooth + 1.96 * se,
                     alpha=0.15, color='gray', label='95% confidence band')

    # Calibration curve
    non_empty = metrics['bin_counts'] > 0
    ax1.plot(metrics['bin_means'][non_empty], metrics['bin_freqs'][non_empty],
             's-', color='#1f77b4', markersize=8, linewidth=2,
             label=f"Forecaster (ECE={metrics['ece']:.4f})")

    # Error bars on observed frequencies
    for i in range(n_bins):
        if metrics['bin_counts'][i] > 0:
            se_i = np.sqrt(metrics['bin_freqs'][i] * (1 - metrics['bin_freqs'][i])
                           / metrics['bin_counts'][i])
            ax1.errorbar(metrics['bin_means'][i], metrics['bin_freqs'][i],
                        yerr=1.96 * se_i, color='#1f77b4', capsize=3,
                        linewidth=1, alpha=0.6)

    ax1.set_xlim(-0.02, 1.02)
    ax1.set_ylim(-0.02, 1.02)
    ax1.set_ylabel('Observed Frequency', fontsize=12)
    ax1.set_title(title, fontsize=14)
    ax1.legend(loc='upper left', fontsize=10)
    ax1.set_xticklabels([])
    ax1.grid(True, alpha=0.3)

    # --- Histogram of predictions ---
    ax2.bar(metrics['bin_means'][non_empty], metrics['bin_counts'][non_empty],
            width=1.0 / n_bins * 0.8, color='#1f77b4', alpha=0.6, edgecolor='white')
    ax2.set_xlim(-0.02, 1.02)
    ax2.set_xlabel('Predicted Probability', fontsize=12)
    ax2.set_ylabel('Count', fontsize=12)
    ax2.grid(True, alpha=0.3)

    plt.tight_layout()
    return fig, metrics

This function produces a two-panel figure: the calibration curve with confidence bands on top, and the forecast distribution histogram on the bottom. The confidence bands are critical for interpretation — they show the range of observed frequencies that would be consistent with perfect calibration given the sample size in each bin.

12.4 Brier Score Decomposition Revisited

The Murphy Decomposition

In Chapter 9, we introduced the Brier score as a proper scoring rule for binary outcomes. Now we can connect it directly to calibration through the Murphy decomposition (Murphy, 1973), which splits the Brier score into three interpretable components:

$$\text{BS} = \underbrace{\frac{1}{N} \sum_{k=1}^{K} n_k (\bar{p}_k - \bar{o}_k)^2}_{\text{Reliability (REL)}} - \underbrace{\frac{1}{N} \sum_{k=1}^{K} n_k (\bar{o}_k - \bar{o})^2}_{\text{Resolution (RES)}} + \underbrace{\bar{o}(1 - \bar{o})}_{\text{Uncertainty (UNC)}}$$

In compact form:

$$\text{BS} = \text{REL} - \text{RES} + \text{UNC}$$

Let us understand each component.

Reliability (REL)

$$\text{REL} = \frac{1}{N} \sum_{k=1}^{K} n_k (\bar{p}_k - \bar{o}_k)^2$$

Reliability measures the weighted squared difference between predicted probabilities and observed frequencies. It is essentially the squared version of ECE. A perfectly calibrated forecaster has $\text{REL} = 0$.

Key insight: Reliability is the only component that measures calibration directly. It is always non-negative, and lower is better (since it contributes positively to the Brier score).

Resolution (RES)

$$\text{RES} = \frac{1}{N} \sum_{k=1}^{K} n_k (\bar{o}_k - \bar{o})^2$$

Resolution measures how much the observed frequency varies across bins. A forecaster whose bins all have the same observed frequency (regardless of the predicted probability) has zero resolution — their forecasts provide no information about which events are more likely.

Resolution is always non-negative, and higher is better (since it is subtracted from the Brier score). A forecaster with high resolution successfully separates events into groups with different outcome rates.

Uncertainty (UNC)

$$\text{UNC} = \bar{o}(1 - \bar{o})$$

Uncertainty depends only on the base rate of the events being predicted. It measures the inherent difficulty of the prediction problem. When the base rate is 50%, uncertainty is maximized at 0.25. When the base rate is close to 0% or 100%, uncertainty is low because the events are intrinsically easy to predict.

Uncertainty is a property of the dataset, not the forecaster. It cannot be changed by improving your forecasting skill.

Interpreting the Decomposition

The Brier score equation tells us:

$$\text{BS} = \text{REL} - \text{RES} + \text{UNC}$$

To minimize the Brier score, you want: - Low reliability (good calibration: your probabilities match reality) - High resolution (good discrimination: your forecasts separate events well) - Uncertainty is fixed, so it does not affect relative comparisons

This decomposition reveals a fundamental truth: a well-calibrated forecaster can still have a poor Brier score if they lack resolution. The forecaster who always predicts the base rate is perfectly calibrated (REL = 0) but has zero resolution (RES = 0), resulting in BS = UNC, which is the worst possible score for a calibrated forecaster.

Conversely, a forecaster with poor calibration (high REL) can still achieve a decent Brier score if their resolution is high enough to compensate. This is somewhat surprising — it means a forecaster who is "wrong in the right direction" (systematically biased but still discriminating) can outperform a calibrated but uninformative forecaster.

However, the key insight is that for any level of resolution, improving calibration always improves the Brier score. Calibration is a "free lunch" in the sense that you can always improve your score by recalibrating without losing any resolution.

Python Decomposition

import numpy as np

def murphy_decomposition(predictions, outcomes, n_bins=10):
    """
    Compute the Murphy decomposition of the Brier score.

    Returns
    -------
    dict with keys: 'brier_score', 'reliability', 'resolution', 'uncertainty',
                    'bin_means', 'bin_freqs', 'bin_counts'
    """
    predictions = np.array(predictions, dtype=float)
    outcomes = np.array(outcomes, dtype=float)
    N = len(predictions)
    base_rate = outcomes.mean()

    # Bin the predictions
    bin_edges = np.linspace(0, 1, n_bins + 1)
    bin_means = []
    bin_freqs = []
    bin_counts = []

    for i in range(n_bins):
        if i < n_bins - 1:
            mask = (predictions >= bin_edges[i]) & (predictions < bin_edges[i + 1])
        else:
            mask = (predictions >= bin_edges[i]) & (predictions <= bin_edges[i + 1])

        count = mask.sum()
        bin_counts.append(count)

        if count > 0:
            bin_means.append(predictions[mask].mean())
            bin_freqs.append(outcomes[mask].mean())
        else:
            bin_means.append(0)
            bin_freqs.append(0)

    bin_means = np.array(bin_means)
    bin_freqs = np.array(bin_freqs)
    bin_counts = np.array(bin_counts)

    # Brier score
    brier_score = np.mean((predictions - outcomes) ** 2)

    # Reliability
    reliability = np.sum(bin_counts * (bin_means - bin_freqs) ** 2) / N

    # Resolution
    resolution = np.sum(bin_counts * (bin_freqs - base_rate) ** 2) / N

    # Uncertainty
    uncertainty = base_rate * (1 - base_rate)

    return {
        'brier_score': brier_score,
        'reliability': reliability,
        'resolution': resolution,
        'uncertainty': uncertainty,
        'base_rate': base_rate,
        'bin_means': bin_means,
        'bin_freqs': bin_freqs,
        'bin_counts': bin_counts,
        'check': reliability - resolution + uncertainty,  # should equal brier_score
    }

Note the check field: the sum $\text{REL} - \text{RES} + \text{UNC}$ should closely match the directly computed Brier score. Small discrepancies arise from binning (forecasts within a bin are approximated by the bin mean), but they should be negligible with a reasonable number of bins.

Visualization of the Decomposition

A powerful way to communicate the decomposition is a stacked bar chart showing the three components. Consider comparing two forecasters:

Component	Forecaster A	Forecaster B
Reliability	0.003	0.025
Resolution	0.060	0.085
Uncertainty	0.250	0.250
Brier Score	0.193	0.190

Forecaster B has a slightly better Brier score despite worse calibration, because their much higher resolution more than compensates. However, if Forecaster B were recalibrated (reducing REL to near zero), their Brier score would drop to approximately $0 - 0.085 + 0.250 = 0.165$, a substantial improvement.

12.5 Brier Skill Score

Definition

The Brier Skill Score (BSS) measures forecast performance relative to a reference forecast. It answers the question: "How much better is my forecasting system than some baseline?"

$$\text{BSS} = 1 - \frac{\text{BS}_{\text{forecast}}}{\text{BS}_{\text{reference}}}$$

where $\text{BS}_{\text{forecast}}$ is the Brier score of the forecast being evaluated and $\text{BS}_{\text{reference}}$ is the Brier score of the reference forecast.

Choosing the Reference

The most common reference forecast is the climatological baseline (also called the "constant forecaster" or "base rate forecaster"), which always predicts the overall base rate $\bar{o}$:

$$\text{BS}_{\text{climatological}} = \bar{o}(1 - \bar{o}) = \text{UNC}$$

With this reference:

$$\text{BSS} = 1 - \frac{\text{BS}}{\text{UNC}} = 1 - \frac{\text{REL} - \text{RES} + \text{UNC}}{\text{UNC}} = \frac{\text{RES} - \text{REL}}{\text{UNC}}$$

This elegant formula shows that BSS is the ratio of "useful signal" (resolution minus reliability) to "total uncertainty." A BSS of 0 means you are no better than always guessing the base rate. A BSS of 1 means perfect forecasting. Negative BSS means you are worse than the base rate — an embarrassing outcome that signals serious problems.

Interpreting BSS Values

BSS Range	Interpretation
BSS = 1.0	Perfect forecast
BSS > 0.3	Excellent skill (rare in most domains)
0.1 < BSS < 0.3	Good skill
0.0 < BSS < 0.1	Marginal skill
BSS = 0.0	No skill (equivalent to base rate)
BSS < 0.0	Negative skill (worse than base rate)

For context, weather forecasts for precipitation typically achieve BSS of 0.2-0.4 at short lead times. Expert geopolitical forecasters (like those in the Good Judgment Project) achieve BSS of around 0.1-0.2. Prediction markets for political events often achieve BSS of 0.15-0.35.

Other Reference Forecasts

While the climatological baseline is most common, other references can be informative:

Persistence forecast: The last observed outcome. Useful for time-series predictions.
Random forecast: Uniform random probabilities. Useful as a sanity check.
Simple model: A basic predictive model. Useful for measuring the marginal value of a more complex system.
Market consensus: The prediction market price. Useful for measuring whether an individual trader adds value beyond the market.

Python BSS Calculator

def brier_skill_score(predictions, outcomes, reference=None):
    """
    Compute the Brier Skill Score relative to a reference forecast.

    Parameters
    ----------
    predictions : array-like
        Predicted probabilities.
    outcomes : array-like
        Binary outcomes.
    reference : array-like or None
        Reference forecast probabilities. If None, uses the base rate.

    Returns
    -------
    dict with 'bss', 'bs_forecast', 'bs_reference'
    """
    predictions = np.array(predictions, dtype=float)
    outcomes = np.array(outcomes, dtype=float)

    bs_forecast = np.mean((predictions - outcomes) ** 2)

    if reference is None:
        base_rate = outcomes.mean()
        bs_reference = base_rate * (1 - base_rate)
    else:
        reference = np.array(reference, dtype=float)
        bs_reference = np.mean((reference - outcomes) ** 2)

    if bs_reference == 0:
        bss = 1.0 if bs_forecast == 0 else float('-inf')
    else:
        bss = 1 - bs_forecast / bs_reference

    return {
        'bss': bss,
        'bs_forecast': bs_forecast,
        'bs_reference': bs_reference,
    }

BSS for Comparing Platforms

One of the most practical applications of BSS is comparing prediction platforms. If Polymarket and Metaculus both make forecasts on the same set of events, we can compute BSS for each (against the base rate) to determine which platform produces more skillful forecasts.

We can also use one platform as the reference for another. For example, computing BSS for Metaculus with Polymarket as the reference tells us whether Metaculus adds value beyond what Polymarket already captures.

12.6 Sharpness and Resolution

Sharpness

Sharpness measures how extreme (confident) a forecaster's predictions are. A forecaster who always predicts near 50% is not sharp; a forecaster who frequently predicts near 0% or 100% is very sharp.

A simple measure of sharpness is the variance of the predicted probabilities:

$$\text{Sharpness} = \frac{1}{N} \sum_{i=1}^{N} (p_i - \bar{p})^2$$

An alternative, more interpretable measure is the mean deviation from 50%:

$$\text{Sharpness}_{\text{alt}} = \frac{1}{N} \sum_{i=1}^{N} |p_i - 0.5|$$

This ranges from 0 (all predictions are 50%) to 0.5 (all predictions are 0% or 100%).

Why Sharpness Matters

The forecasting community has a maxim attributed to Tilmann Gneiting:

"Maximize sharpness, subject to calibration."

This encapsulates the ideal forecasting strategy. Among all calibrated forecasters, the sharpest one is the best. Sharpness without calibration is dangerous (overconfident), but sharpness with calibration is maximally informative.

For prediction market traders, this has a direct financial implication. A trader who is well-calibrated but always predicts near 50% will rarely find profitable trading opportunities — the market is usually close to 50% for genuinely uncertain events. A trader who is well-calibrated AND sharp will identify many situations where they confidently disagree with the market, generating more trading opportunities and larger expected profits.

Resolution Revisited

Resolution, as defined in the Murphy decomposition, is:

$$\text{RES} = \frac{1}{N} \sum_{k=1}^{K} n_k (\bar{o}_k - \bar{o})^2$$

Resolution captures how much the observed outcomes vary across your forecast bins. If you assign 80% to events that actually occur 80% of the time and 20% to events that occur 20% of the time, you have high resolution — your forecasts genuinely differentiate between different classes of events.

Resolution is related to but distinct from sharpness. A forecaster can be sharp (issuing extreme probabilities) but have low resolution (events predicted at 90% happen no more frequently than events predicted at 10%). Sharpness is about the distribution of forecasts; resolution is about the relationship between forecasts and outcomes.

The Calibration-Sharpness Tradeoff

In practice, there is often a tension between calibration and sharpness. A simple way to improve calibration is to "shrink" your forecasts toward the base rate — predict 55% instead of 70%, or 40% instead of 20%. This typically reduces reliability error but also reduces sharpness and resolution.

The optimal tradeoff is achieved when you recalibrate without destroying information. Recalibration techniques (Section 12.8) are designed to do exactly this — improve calibration while preserving as much resolution and sharpness as possible.

Python Metrics

def forecast_quality_metrics(predictions, outcomes, n_bins=10):
    """
    Compute a comprehensive set of forecast quality metrics.

    Returns
    -------
    dict with all major forecast quality metrics
    """
    predictions = np.array(predictions, dtype=float)
    outcomes = np.array(outcomes, dtype=float)

    # Basic
    brier_score = np.mean((predictions - outcomes) ** 2)
    base_rate = outcomes.mean()

    # Sharpness measures
    sharpness_var = np.var(predictions)
    sharpness_mad = np.mean(np.abs(predictions - 0.5))

    # Murphy decomposition
    decomp = murphy_decomposition(predictions, outcomes, n_bins)

    # BSS
    bss_result = brier_skill_score(predictions, outcomes)

    # ECE and MCE
    cal_metrics = compute_calibration_metrics(predictions, outcomes, n_bins)

    return {
        'brier_score': brier_score,
        'base_rate': base_rate,
        'reliability': decomp['reliability'],
        'resolution': decomp['resolution'],
        'uncertainty': decomp['uncertainty'],
        'bss': bss_result['bss'],
        'ece': cal_metrics['ece'],
        'mce': cal_metrics['mce'],
        'sharpness_variance': sharpness_var,
        'sharpness_mad': sharpness_mad,
        'n_forecasts': len(predictions),
    }

12.7 Calibration of Prediction Markets

Are Prediction Markets Well-Calibrated?

This is one of the most important empirical questions in forecasting research, and the answer is nuanced: prediction markets are generally well-calibrated, but they exhibit systematic biases that create exploitable opportunities.

Empirical Evidence

Several large-scale studies have examined prediction market calibration:

Political prediction markets (PredictIt, Polymarket, Iowa Electronic Markets): - Events priced between 0.40 and 0.60 are generally well-calibrated. - Events priced above 0.80 tend to be slightly overpriced (overconfidence bias). - Events priced below 0.20 tend to be slightly underpriced (the favorite-longshot bias in reverse, or more precisely, the longshot end of the favorite-longshot bias). - Overall ECE values typically range from 0.02 to 0.06.

Forecasting tournaments (Good Judgment Project, Metaculus): - Aggregated crowd forecasts achieve ECE of 0.01 to 0.03 — among the best-calibrated sources available. - Individual forecasters range from ECE of 0.03 (superforecasters) to ECE of 0.15 (average participants). - Superforecasters tend to be slightly underconfident rather than overconfident, a distinguishing characteristic of the best human forecasters.

Sports betting markets: - Extremely well-calibrated for major events (NFL, Premier League) with ECE often below 0.02. - Less well-calibrated for minor events where liquidity is lower. - Exhibit the classic favorite-longshot bias: favorites are slightly underpriced, longshots are slightly overpriced.

Systematic Biases

Several systematic calibration biases have been documented in prediction markets:

1. Favorite-Longshot Bias. This is the most well-documented and robust finding across prediction markets, sports betting, and horse racing. Events with very high probabilities (favorites) tend to be underpriced (probability is slightly too low), while events with very low probabilities (longshots) tend to be overpriced (probability is slightly too high).

In terms of calibration: if you buy contracts priced at $0.95 (implied 95% probability), they pay off more than 95% of the time. If you buy contracts priced at $0.05 (implied 5% probability), they pay off less than 5% of the time.

We will explore the favorite-longshot bias in detail in Chapter 15.

2. Temporal Miscalibration. Markets tend to be less well-calibrated far in advance of event resolution. A contract trading at $0.60 one year before resolution has a wider calibration error than a contract trading at $0.60 one day before resolution. This makes intuitive sense — more information arrives closer to resolution, and the market has less time to correct errors.

3. Category-Dependent Calibration. Markets tend to be better calibrated for events they have experience with (recurring elections, regular sporting events) than for novel events (pandemic outcomes, geopolitical crises). This reflects the role of historical base rates in calibrating expectations.

4. Liquidity-Dependent Calibration. Markets with higher trading volume and more sophisticated participants tend to be better calibrated. Thin markets (few traders, low volume) can exhibit substantial miscalibration because there is not enough "wisdom of crowds" to correct individual errors.

Comparison: Polymarket vs. Metaculus vs. Polls

Let us sketch a comparison framework (with representative calibration profiles):

Feature	Polymarket	Metaculus	Polls
Mechanism	Financial market	Crowd forecast	Survey sampling
Incentive	Profit motive	Reputation/scoring	N/A
Typical ECE	0.03-0.05	0.01-0.03	0.05-0.10
Main bias	Favorite-longshot	Slight underconfidence	Systematic lean
Best domain	High-profile events	Science/tech	Elections
Worst domain	Low-liquidity markets	Short-term events	Non-election politics

These comparisons must be taken with appropriate caveats. The calibration of any platform depends on the specific events analyzed, the time period, and the methodology used for binning and evaluation.

Implications for Traders

If prediction markets are generally well-calibrated but exhibit systematic biases, what does this mean for traders?

Don't assume prices are wrong. On average, market prices are close to the "true" probability. The naive strategy of betting against every market price will lose money.
Look for systematic biases. The favorite-longshot bias and liquidity effects create consistent, exploitable patterns. A strategy that buys underpriced favorites and sells overpriced longshots can generate modest but reliable returns.
Exploit calibration gaps. If you can identify specific categories or time periods where the market is miscalibrated (e.g., low-liquidity markets for novel events), you have a potential edge.
Benchmark yourself against the market. The market's calibration is your benchmark. If your personal calibration is no better than the market, you are unlikely to profit from disagreeing with it.

12.8 Recalibration Techniques

Why Recalibrate?

Suppose you have a forecasting model (or your own human judgment) that has good resolution but poor calibration. Perhaps it tends to be overconfident, assigning 80% when the true probability is closer to 65%. Rather than redesigning the model, you can apply a recalibration function that maps the original forecast $p$ to a better-calibrated forecast $q = g(p)$.

Recalibration preserves the rank-ordering of forecasts (if $g$ is monotonically increasing) while adjusting the probability scale to better match observed frequencies. This is why we said calibration is a "free lunch" — recalibration improves reliability without losing resolution, at least in theory.

Method 1: Platt Scaling

Platt scaling fits a logistic regression model to map original forecasts to calibrated probabilities:

$$q = \sigma(a \cdot \text{logit}(p) + b)$$

where $\sigma(x) = 1/(1 + e^{-x})$ is the sigmoid function, $\text{logit}(p) = \ln(p/(1-p))$, and $a$ and $b$ are parameters learned from data.

Platt scaling has two parameters, making it robust with small datasets. It works best when the miscalibration has a simple parametric form (e.g., overall overconfidence or a constant bias on the log-odds scale).

When to use Platt scaling: - You have a modest amount of calibration data (50-500 forecasts). - The miscalibration pattern is approximately monotonic and smooth. - You want a simple, interpretable recalibration function.

Method 2: Isotonic Regression

Isotonic regression fits a non-parametric, monotonically increasing step function to the data. It finds the monotone function that minimizes the mean squared error between calibrated forecasts and outcomes.

Unlike Platt scaling, isotonic regression makes no parametric assumptions about the shape of the miscalibration. It can capture complex non-linear patterns (S-curves, local biases, etc.). However, it requires more data to fit reliably and can overfit with small datasets.

When to use isotonic regression: - You have a large amount of calibration data (500+ forecasts). - The miscalibration pattern is complex or non-monotonic in probability space. - You want maximum flexibility in the recalibration function.

Method 3: Beta Calibration

Beta calibration fits a parametric model based on the beta distribution:

$$q = \frac{p^a}{p^a + ((1-p)^b / c)}$$

where $a$, $b$, and $c$ are learned parameters. This three-parameter model is more flexible than Platt scaling but more constrained than isotonic regression, often striking a good balance.

Method 4: Histogram Binning

The simplest recalibration approach: divide the forecast range into bins, and replace each forecast with the observed frequency in its bin. This is the most direct approach but can produce discontinuities at bin boundaries and requires enough data per bin.

Python Recalibration Pipeline

from sklearn.isotonic import IsotonicRegression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
import numpy as np

class RecalibrationPipeline:
    """A pipeline for recalibrating probability forecasts."""

    def __init__(self, method='isotonic'):
        """
        Parameters
        ----------
        method : str
            'platt', 'isotonic', or 'histogram'
        """
        self.method = method
        self.model = None

    def fit(self, predictions, outcomes):
        """Fit the recalibration model."""
        predictions = np.array(predictions, dtype=float)
        outcomes = np.array(outcomes, dtype=float)

        if self.method == 'platt':
            # Platt scaling via logistic regression on logit(p)
            logits = np.log(np.clip(predictions, 1e-10, 1 - 1e-10) /
                          (1 - np.clip(predictions, 1e-10, 1 - 1e-10)))
            self.model = LogisticRegression(C=1e10, solver='lbfgs')
            self.model.fit(logits.reshape(-1, 1), outcomes)

        elif self.method == 'isotonic':
            self.model = IsotonicRegression(
                y_min=0, y_max=1, out_of_bounds='clip'
            )
            self.model.fit(predictions, outcomes)

        elif self.method == 'histogram':
            n_bins = 10
            bin_edges = np.linspace(0, 1, n_bins + 1)
            self.bin_edges = bin_edges
            self.bin_values = []
            for i in range(n_bins):
                if i < n_bins - 1:
                    mask = (predictions >= bin_edges[i]) & (predictions < bin_edges[i + 1])
                else:
                    mask = (predictions >= bin_edges[i]) & (predictions <= bin_edges[i + 1])
                if mask.sum() > 0:
                    self.bin_values.append(outcomes[mask].mean())
                else:
                    self.bin_values.append((bin_edges[i] + bin_edges[i + 1]) / 2)
            self.bin_values = np.array(self.bin_values)
            self.model = 'histogram'

        return self

    def transform(self, predictions):
        """Apply recalibration to new predictions."""
        predictions = np.array(predictions, dtype=float)

        if self.method == 'platt':
            logits = np.log(np.clip(predictions, 1e-10, 1 - 1e-10) /
                          (1 - np.clip(predictions, 1e-10, 1 - 1e-10)))
            return self.model.predict_proba(logits.reshape(-1, 1))[:, 1]

        elif self.method == 'isotonic':
            return self.model.transform(predictions)

        elif self.method == 'histogram':
            indices = np.digitize(predictions, self.bin_edges) - 1
            indices = np.clip(indices, 0, len(self.bin_values) - 1)
            return self.bin_values[indices]

    def fit_transform(self, predictions, outcomes):
        """Fit and transform in one step."""
        self.fit(predictions, outcomes)
        return self.transform(predictions)

Avoiding Overfitting in Recalibration

A critical warning: recalibration models can overfit. If you fit a recalibration model on the same data you use to evaluate it, you will overestimate the improvement. Always use one of:

Train/test split: Fit recalibration on one portion of the data, evaluate on another.
Cross-validation: Use k-fold cross-validation to get out-of-sample recalibrated predictions.
Temporal split: For time-series data, fit on historical data and evaluate on future data.

The following pattern uses cross-validation:

from sklearn.model_selection import KFold

def cross_validated_recalibration(predictions, outcomes, method='isotonic', n_folds=5):
    """
    Perform cross-validated recalibration to avoid overfitting.

    Returns recalibrated predictions (out-of-fold).
    """
    predictions = np.array(predictions, dtype=float)
    outcomes = np.array(outcomes, dtype=float)
    recalibrated = np.zeros_like(predictions)

    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)

    for train_idx, test_idx in kf.split(predictions):
        pipeline = RecalibrationPipeline(method=method)
        pipeline.fit(predictions[train_idx], outcomes[train_idx])
        recalibrated[test_idx] = pipeline.transform(predictions[test_idx])

    return recalibrated

12.9 Calibration of Individual Traders

Why Personal Calibration Tracking Matters

As a prediction market trader, your profitability ultimately depends on the quality of your probabilistic judgments. Calibration tracking provides the most rigorous feedback loop for improving those judgments.

Without tracking, you are flying blind. You might feel like you have good intuition, but cognitive biases (hindsight bias, confirmation bias, motivated reasoning) will systematically distort your self-assessment. You might remember the time you "knew" a 30% event would happen (and it did) while conveniently forgetting the other three times you were equally confident and wrong.

Building a Personal Calibration Tracker

A basic calibration tracker needs to record:

The event description: What exactly are you forecasting?
Your predicted probability: Your honest assessment, ideally before seeing the market price.
The market price at the time: For comparison.
The date of the forecast: When did you make this prediction?
The resolution date and outcome: What actually happened?
Category tags: (politics, sports, technology, etc.) for analyzing calibration by domain.

Here is a minimal Python implementation:

import json
import os
from datetime import datetime

class CalibrationTracker:
    """Personal calibration tracking tool for prediction market traders."""

    def __init__(self, filepath='calibration_log.json'):
        self.filepath = filepath
        if os.path.exists(filepath):
            with open(filepath, 'r') as f:
                self.log = json.load(f)
        else:
            self.log = []

    def add_forecast(self, event, probability, market_price=None,
                     category='general', notes=''):
        """Record a new forecast."""
        entry = {
            'id': len(self.log) + 1,
            'event': event,
            'probability': probability,
            'market_price': market_price,
            'category': category,
            'notes': notes,
            'forecast_date': datetime.now().isoformat(),
            'outcome': None,
            'resolution_date': None,
        }
        self.log.append(entry)
        self._save()
        return entry['id']

    def resolve(self, event_id, outcome):
        """Record the outcome of a forecasted event."""
        for entry in self.log:
            if entry['id'] == event_id:
                entry['outcome'] = int(outcome)
                entry['resolution_date'] = datetime.now().isoformat()
                self._save()
                return True
        return False

    def get_resolved(self, category=None):
        """Get all resolved forecasts, optionally filtered by category."""
        resolved = [e for e in self.log if e['outcome'] is not None]
        if category:
            resolved = [e for e in resolved if e['category'] == category]
        return resolved

    def calibration_report(self, category=None, n_bins=5):
        """Generate a calibration report."""
        resolved = self.get_resolved(category)
        if len(resolved) < 10:
            return "Not enough resolved forecasts for a meaningful report."

        predictions = np.array([e['probability'] for e in resolved])
        outcomes = np.array([e['outcome'] for e in resolved])

        metrics = forecast_quality_metrics(predictions, outcomes, n_bins)

        report = []
        report.append(f"=== Calibration Report ===")
        report.append(f"Category: {category or 'All'}")
        report.append(f"Resolved forecasts: {len(resolved)}")
        report.append(f"Brier Score: {metrics['brier_score']:.4f}")
        report.append(f"BSS: {metrics['bss']:.4f}")
        report.append(f"ECE: {metrics['ece']:.4f}")
        report.append(f"MCE: {metrics['mce']:.4f}")
        report.append(f"Reliability: {metrics['reliability']:.4f}")
        report.append(f"Resolution: {metrics['resolution']:.4f}")
        report.append(f"Sharpness (MAD): {metrics['sharpness_mad']:.4f}")
        report.append(f"Base rate: {metrics['base_rate']:.4f}")

        return '\n'.join(report)

    def _save(self):
        with open(self.filepath, 'w') as f:
            json.dump(self.log, f, indent=2)

Learning from Calibration Feedback

Once you have accumulated enough resolved forecasts (at least 50-100), patterns will emerge. Here is how to interpret them:

If you are overconfident (ECE is high, reliability diagram below diagonal for high probabilities): - Practice assigning lower probabilities to events you feel "sure" about. - Before finalizing a forecast, explicitly consider reasons the event might not occur. - Use the "outside view" — what is the base rate for this type of event? - Try reducing all extreme forecasts by 5-10 percentage points and see if calibration improves.

If you are underconfident (reliability diagram above diagonal for high probabilities): - You are hedging too much. Practice committing to your beliefs. - When you believe an event is very likely, push your probability toward the extremes. - Underconfidence often comes from excessive anchoring to the base rate. Force yourself to update more aggressively on new evidence.

If you are miscalibrated in one category but not others: - You may have domain-specific blind spots. Are you overconfident about politics but well-calibrated for sports? - Consider spending more time researching your weak categories, or avoid trading in them until your calibration improves.

If your resolution is low: - Your forecasts are not differentiating between events that happen and events that do not. - Focus on identifying the most informative signals for each event type. - Try making more extreme predictions — you may be hedging away useful information.

Deliberate Practice for Calibration

Improving calibration is a skill that responds to deliberate practice. Research on superforecasters has identified several effective practices:

Granularity. Practice using fine-grained probabilities (not just 25/50/75). The act of distinguishing between 65% and 70% forces more careful reasoning.
Feedback speed. Seek events that resolve quickly (days to weeks rather than months to years). Faster feedback accelerates learning.
Base rate awareness. Before forecasting any event, explicitly estimate the base rate for that category of events. Then adjust from the base rate based on specific evidence.
Pre-mortem analysis. Before committing to a forecast, imagine the event has already resolved in the opposite direction of your expectation. What would explain that outcome? This counteracts confirmation bias.
Regular review. Review your calibration report weekly or monthly. Look for patterns, celebrate improvements, and identify persistent weaknesses.

12.10 Advanced: Calibration in Multi-Outcome Settings

Beyond Binary Events

So far we have focused on binary events (yes/no, happens/does not happen). But many prediction market questions have more than two outcomes:

"Who will win the 2028 presidential election?" (multiple candidates)
"What will the Fed funds rate be in December?" (range of values)
"When will AGI be achieved?" (probability distribution over time)

Calibration concepts extend naturally to these settings, though the mathematics becomes more involved.

Calibration for Categorical Outcomes

For a $C$-class prediction problem (e.g., predicting which of $C$ candidates wins), the forecaster provides a probability vector $\mathbf{p} = (p_1, p_2, \ldots, p_C)$ for each event, where $\sum_{c=1}^{C} p_c = 1$.

The calibration condition extends as follows: among all events where you assign probability $p_c$ to class $c$, class $c$ should occur with frequency $p_c$. This is assessed class-by-class, treating each class as a binary "one vs. rest" prediction.

Multi-class ECE:

$$\text{ECE}_{\text{multi}} = \frac{1}{C} \sum_{c=1}^{C} \text{ECE}_c$$

where $\text{ECE}_c$ is the binary ECE computed for class $c$ using the marginal probability $p_c$ and the indicator $\mathbb{1}[\text{outcome} = c]$.

Calibration of Probability Distributions

For continuous outcomes (e.g., "What will the S&P 500 be on Dec 31?"), the forecaster provides a full probability distribution $F(x)$. Calibration is assessed using the Probability Integral Transform (PIT):

If the forecaster is well-calibrated, the values $u_i = F(x_i)$ (where $x_i$ is the realized outcome) should be uniformly distributed on $[0, 1]$. A histogram of PIT values that deviates from uniformity indicates miscalibration.

Specific patterns in the PIT histogram are diagnostic: - U-shaped: Forecaster is overconfident (distribution is too narrow). - Inverse-U-shaped: Forecaster is underconfident (distribution is too wide). - Skewed: Forecaster has a systematic location bias.

CRPS Calibration

The Continuous Ranked Probability Score (CRPS) is the analog of the Brier score for distributional forecasts:

$$\text{CRPS}(F, x) = \int_{-\infty}^{\infty} (F(y) - \mathbb{1}[y \geq x])^2 \, dy$$

Like the Brier score, CRPS can be decomposed into reliability and resolution components. The reliability component measures distributional calibration — whether the CDF probabilities match observed frequencies across all threshold values.

The CRPS decomposition is:

$$\text{CRPS} = \text{Reliability}_{\text{CRPS}} - \text{Resolution}_{\text{CRPS}} + \text{Uncertainty}_{\text{CRPS}}$$

The mathematics are more involved than the binary case, but the interpretation is analogous: you want low reliability (good calibration) and high resolution (informative forecasts).

Python Multi-Outcome Calibration

def multiclass_calibration(probability_matrix, outcomes, n_bins=10):
    """
    Compute calibration metrics for multi-class predictions.

    Parameters
    ----------
    probability_matrix : array of shape (n_samples, n_classes)
        Predicted probabilities for each class.
    outcomes : array of shape (n_samples,)
        Integer class labels (0, 1, ..., n_classes-1).
    n_bins : int
        Number of bins per class.

    Returns
    -------
    dict with per-class ECE, MCE, and overall multi-class ECE.
    """
    probability_matrix = np.array(probability_matrix, dtype=float)
    outcomes = np.array(outcomes, dtype=int)
    n_classes = probability_matrix.shape[1]

    class_ece = []
    class_mce = []

    for c in range(n_classes):
        binary_pred = probability_matrix[:, c]
        binary_outcome = (outcomes == c).astype(float)
        metrics = compute_calibration_metrics(binary_pred, binary_outcome, n_bins)
        class_ece.append(metrics['ece'])
        class_mce.append(metrics['mce'])

    return {
        'class_ece': class_ece,
        'class_mce': class_mce,
        'mean_ece': np.mean(class_ece),
        'max_mce': np.max(class_mce),
    }

Confidence-Based Multi-Class Calibration

An alternative to per-class calibration is confidence calibration, which bins predictions by their maximum probability (confidence):

$$\text{ECE}_{\text{conf}} = \sum_{k=1}^{K} \frac{n_k}{N} | \text{acc}_k - \text{conf}_k |$$

where $\text{acc}_k$ is the accuracy (fraction correctly predicted) in bin $k$, and $\text{conf}_k$ is the mean confidence (maximum predicted probability) in bin $k$.

This measures whether the model's confidence level matches its accuracy level. A model that says "I'm 90% confident" and is correct 90% of the time is well-calibrated in this sense.

12.11 Practical Tools and Workflows

End-to-End Calibration Analysis Workflow

Here is a complete workflow for performing calibration analysis on a prediction market or forecasting system:

Step 1: Data Collection

Gather matched pairs of (prediction, outcome). For prediction markets, this means: - Extracting historical market prices at specific time points (e.g., closing price on each day). - Recording the eventual resolution of each market. - Ensuring proper alignment (the price snapshot should precede the resolution).

def prepare_calibration_data(market_data):
    """
    Prepare raw market data for calibration analysis.

    Parameters
    ----------
    market_data : list of dict
        Each dict has 'price' (float), 'resolved' (bool), 'outcome' (0 or 1).

    Returns
    -------
    predictions, outcomes : numpy arrays
    """
    resolved = [m for m in market_data if m['resolved']]
    predictions = np.array([m['price'] for m in resolved])
    outcomes = np.array([m['outcome'] for m in resolved])
    return predictions, outcomes

Step 2: Compute Metrics

Run the full suite of calibration and forecast quality metrics.

predictions, outcomes = prepare_calibration_data(market_data)
metrics = forecast_quality_metrics(predictions, outcomes, n_bins=10)
decomp = murphy_decomposition(predictions, outcomes, n_bins=10)

Step 3: Generate Visualizations

Produce reliability diagrams, decomposition bar charts, and forecast distribution histograms.

Step 4: Diagnose Problems

Interpret the results: - Is ECE acceptable (< 0.05)? - Is there a systematic pattern in the reliability diagram (overconfidence, underconfidence)? - Is resolution high enough to be useful? - Are there specific probability ranges with poor calibration (high MCE)?

Step 5: Recalibrate (if needed)

If calibration is poor, apply recalibration using cross-validated isotonic regression or Platt scaling.

Step 6: Monitor Over Time

Track calibration metrics over rolling windows to detect drift. A forecasting system that was well-calibrated in 2024 might not be well-calibrated in 2025 if the underlying environment has changed.

Automated Calibration Reports

For systematic trading, you want automated reports that run periodically:

def generate_calibration_report(predictions, outcomes, categories=None,
                                  output_dir='reports'):
    """
    Generate a comprehensive calibration report with metrics and plots.

    Parameters
    ----------
    predictions : array-like
        Predicted probabilities.
    outcomes : array-like
        Binary outcomes.
    categories : array-like or None
        Category labels for each forecast.
    output_dir : str
        Directory to save report files.
    """
    os.makedirs(output_dir, exist_ok=True)

    # Overall metrics
    overall = forecast_quality_metrics(predictions, outcomes)
    decomp = murphy_decomposition(predictions, outcomes)

    # Write summary
    report_lines = [
        "CALIBRATION ANALYSIS REPORT",
        "=" * 50,
        f"Date: {datetime.now().strftime('%Y-%m-%d %H:%M')}",
        f"Total forecasts: {len(predictions)}",
        "",
        "OVERALL METRICS",
        "-" * 30,
        f"Brier Score:  {overall['brier_score']:.4f}",
        f"BSS:          {overall['bss']:.4f}",
        f"ECE:          {overall['ece']:.4f}",
        f"MCE:          {overall['mce']:.4f}",
        f"Reliability:  {decomp['reliability']:.6f}",
        f"Resolution:   {decomp['resolution']:.6f}",
        f"Uncertainty:  {decomp['uncertainty']:.6f}",
        f"Sharpness:    {overall['sharpness_mad']:.4f}",
        f"Base rate:    {overall['base_rate']:.4f}",
    ]

    # Category breakdown
    if categories is not None:
        categories = np.array(categories)
        unique_cats = np.unique(categories)
        report_lines.append("")
        report_lines.append("CATEGORY BREAKDOWN")
        report_lines.append("-" * 30)

        for cat in unique_cats:
            mask = categories == cat
            if mask.sum() >= 10:
                cat_metrics = forecast_quality_metrics(
                    np.array(predictions)[mask], np.array(outcomes)[mask]
                )
                report_lines.append(
                    f"  {cat}: n={mask.sum()}, ECE={cat_metrics['ece']:.4f}, "
                    f"BSS={cat_metrics['bss']:.4f}"
                )

    report_text = '\n'.join(report_lines)

    with open(os.path.join(output_dir, 'calibration_report.txt'), 'w') as f:
        f.write(report_text)

    return report_text

Integrating Calibration Checks into Trading Systems

For algorithmic traders, calibration monitoring should be integrated into the trading pipeline:

Pre-trade calibration check. Before placing a trade based on a model's probability estimate, check whether the model is currently well-calibrated in the relevant probability range and category.
Real-time calibration monitoring. Maintain a rolling window of recent resolved forecasts and compute ECE in real time. If calibration degrades beyond a threshold, reduce position sizes or halt trading.
Post-trade analysis. After each batch of trades resolves, update calibration metrics and generate reports. Look for emerging biases.
Recalibration triggers. Automatically recalibrate the model when ECE exceeds a threshold (e.g., 0.05), using the most recent resolved forecasts as calibration data.

class CalibrationMonitor:
    """Real-time calibration monitoring for trading systems."""

    def __init__(self, window_size=200, ece_threshold=0.05):
        self.window_size = window_size
        self.ece_threshold = ece_threshold
        self.predictions = []
        self.outcomes = []

    def update(self, prediction, outcome):
        """Add a new resolved forecast."""
        self.predictions.append(prediction)
        self.outcomes.append(outcome)

        # Keep only the most recent window
        if len(self.predictions) > self.window_size:
            self.predictions = self.predictions[-self.window_size:]
            self.outcomes = self.outcomes[-self.window_size:]

    def check_calibration(self):
        """Check if calibration is within acceptable bounds."""
        if len(self.predictions) < 50:
            return {'status': 'insufficient_data', 'n': len(self.predictions)}

        metrics = compute_calibration_metrics(
            self.predictions, self.outcomes, n_bins=5
        )

        status = 'ok' if metrics['ece'] < self.ece_threshold else 'warning'

        return {
            'status': status,
            'ece': metrics['ece'],
            'mce': metrics['mce'],
            'n': len(self.predictions),
        }

    def should_recalibrate(self):
        """Determine if recalibration is recommended."""
        check = self.check_calibration()
        return check.get('status') == 'warning'

12.12 Chapter Summary

This chapter has provided a comprehensive treatment of calibration — the most fundamental measure of forecast quality. Let us recapitulate the key ideas.

Core Concepts

Calibration means your stated probabilities match observed frequencies. When you say 70%, events should happen about 70% of the time. This is the cornerstone of probabilistic integrity.

Expected Calibration Error (ECE) quantifies calibration as the weighted average absolute gap between predicted probabilities and observed frequencies across bins. Maximum Calibration Error (MCE) captures the worst-case gap.

Reliability diagrams visualize calibration by plotting observed frequency against predicted probability. Points above the diagonal indicate underconfidence; points below indicate overconfidence.

Brier Score Decomposition

The Murphy decomposition splits the Brier score into three components:

$$\text{BS} = \text{REL} - \text{RES} + \text{UNC}$$

Reliability (REL): Calibration error (lower is better).
Resolution (RES): Discriminative power (higher is better).
Uncertainty (UNC): Inherent difficulty (fixed for a given dataset).

The Brier Skill Score measures improvement over a baseline forecast, with the elegant formula $\text{BSS} = (\text{RES} - \text{REL}) / \text{UNC}$.

Sharpness and Resolution

The ideal forecaster is sharp (uses extreme probabilities) and well-calibrated (those extreme probabilities are accurate). The maxim "maximize sharpness, subject to calibration" captures the optimal strategy.

Prediction Market Calibration

Prediction markets are generally well-calibrated (ECE of 0.02-0.06) but exhibit systematic biases, notably the favorite-longshot bias. Markets with higher liquidity and more participants tend to be better calibrated. Calibration quality varies by category and time horizon.

Recalibration

When a forecasting system has good resolution but poor calibration, recalibration techniques (Platt scaling, isotonic regression, histogram binning) can improve calibration without losing resolution. Always use cross-validation to avoid overfitting the recalibration model.

Personal Improvement

Building a personal calibration tracker is the most valuable investment you can make as a prediction market trader. Regular review of calibration reports, combined with deliberate practice techniques, can substantially improve your probabilistic reasoning over time.

Completing Part II

This chapter concludes Part II of the book: Market Microstructure and Pricing. Over the past several chapters, you have learned how prediction markets work mechanically (order books, market makers), how prices relate to probabilities, how to measure forecast quality (scoring rules, calibration), and what systematic biases exist in real markets.

You now have the foundational knowledge needed to move from understanding markets to actively trading them.

What's Next

Part III: Trading Strategies begins with Chapter 13, where we transition from measurement to action. Armed with your understanding of calibration, scoring rules, and market microstructure, you are ready to develop systematic strategies for finding and exploiting mispricings in prediction markets.

In Chapter 13, we will explore portfolio construction for prediction markets — how to size positions, manage risk across multiple correlated markets, and build a diversified portfolio of probability bets. You will learn that trading prediction markets successfully is not just about being right; it is about being right in the right amounts, at the right times, with appropriate risk management.

The transition from Part II to Part III marks a shift from "how do markets work?" to "how do I make money in markets?" The calibration tools you built in this chapter will be your constant companions — they are the compass that tells you whether your trading edge is real or imagined.

Key Equations Reference

Concept	Equation
Perfect calibration	$P(\text{outcome} \mid \text{forecast} = p) = p$
ECE	$\sum_{k} \frac{n_k}{N} \\|\bar{p}_k - \bar{o}_k\\|$
MCE	$\max_{k} \\|\bar{p}_k - \bar{o}_k\\|$
Murphy decomposition	$\text{BS} = \text{REL} - \text{RES} + \text{UNC}$
Reliability	$\frac{1}{N} \sum_{k} n_k (\bar{p}_k - \bar{o}_k)^2$
Resolution	$\frac{1}{N} \sum_{k} n_k (\bar{o}_k - \bar{o})^2$
Uncertainty	$\bar{o}(1 - \bar{o})$
BSS	$1 - \text{BS}/\text{BS}_{\text{ref}}$
BSS (decomposed)	$(\text{RES} - \text{REL})/\text{UNC}$
Sharpness	$\frac{1}{N} \sum_i \\|p_i - 0.5\\|$
Platt scaling	$q = \sigma(a \cdot \text{logit}(p) + b)$

Next chapter: Chapter 13 — Portfolio Construction for Prediction Markets

In This Chapter

Chapter 12: Calibration — Measuring Forecast Quality

12.1 What Is Calibration?

The Core Definition

A Concrete Example

Why Calibration Matters for Prediction Markets

Calibration vs. Other Forecast Qualities

12.2 Measuring Calibration

The Binning Approach

Expected Calibration Error (ECE)

Maximum Calibration Error (MCE)

Choosing the Number of Bins

Python Implementation

Confidence Intervals for Calibration

12.3 Reliability Diagrams

What Is a Reliability Diagram?

Reading Reliability Diagrams

Python Reliability Diagram Generator

12.4 Brier Score Decomposition Revisited

The Murphy Decomposition

Reliability (REL)

Resolution (RES)

Uncertainty (UNC)

Interpreting the Decomposition

Python Decomposition

Visualization of the Decomposition

12.5 Brier Skill Score

Definition

Choosing the Reference

Interpreting BSS Values

Other Reference Forecasts

Python BSS Calculator

BSS for Comparing Platforms

12.6 Sharpness and Resolution

Sharpness

Why Sharpness Matters

Resolution Revisited

The Calibration-Sharpness Tradeoff

Python Metrics

12.7 Calibration of Prediction Markets

Are Prediction Markets Well-Calibrated?

Empirical Evidence

Systematic Biases

Comparison: Polymarket vs. Metaculus vs. Polls

Implications for Traders

12.8 Recalibration Techniques

Why Recalibrate?

Method 1: Platt Scaling

Method 2: Isotonic Regression

Method 3: Beta Calibration

Method 4: Histogram Binning

Python Recalibration Pipeline

Avoiding Overfitting in Recalibration

12.9 Calibration of Individual Traders

Why Personal Calibration Tracking Matters

Building a Personal Calibration Tracker

Learning from Calibration Feedback

Deliberate Practice for Calibration

12.10 Advanced: Calibration in Multi-Outcome Settings

Beyond Binary Events

Calibration for Categorical Outcomes

Calibration of Probability Distributions

CRPS Calibration

Python Multi-Outcome Calibration

Confidence-Based Multi-Class Calibration

12.11 Practical Tools and Workflows

End-to-End Calibration Analysis Workflow

Automated Calibration Reports

Integrating Calibration Checks into Trading Systems

12.12 Chapter Summary

Core Concepts

Brier Score Decomposition

Sharpness and Resolution

Prediction Market Calibration

Recalibration

Personal Improvement

Completing Part II

What's Next

Key Equations Reference

Related Reading