Case Study 1: Evaluating Forecaster Performance in a Prediction Tournament

Background

The Global Forecasting Challenge (GFC) is an annual prediction tournament where 50 forecasters make probabilistic predictions on 100 binary events. Events range from geopolitics ("Will Country X hold elections in 2025?") to economics ("Will inflation exceed 3%?") to technology ("Will Company Y release Product Z?"). Each event has a known resolution date and a clear resolution criterion.

At the end of the tournament, organizers need to:

  1. Rank all 50 forecasters by accuracy
  2. Identify the most skilled forecasters (distinguishing skill from luck)
  3. Diagnose strengths and weaknesses (calibration vs. resolution)
  4. Evaluate whether different scoring rules produce different rankings
  5. Design a fair award structure

This case study walks through the complete analysis using Python.

The Data

We will simulate realistic tournament data. The simulation includes:

  • 50 forecasters with varying skill levels
  • 100 binary events with base rates ranging from 5% to 95%
  • Forecasters have different strengths: some are well-calibrated but lack resolution, others have high resolution but poor calibration, and a few are excellent at both
import numpy as np
import matplotlib.pyplot as plt
from dataclasses import dataclass
from typing import List, Dict, Tuple
np.random.seed(42)

# --- Simulation Parameters ---
N_FORECASTERS = 50
N_QUESTIONS = 100

# Generate true probabilities for each question (unknown to forecasters)
true_probs = np.random.beta(2, 2, size=N_QUESTIONS)  # Most between 0.2-0.8

# Generate outcomes based on true probabilities
outcomes = (np.random.random(N_QUESTIONS) < true_probs).astype(int)
base_rate = outcomes.mean()
print(f"Base rate: {base_rate:.2f} ({outcomes.sum()} events occurred out of {N_QUESTIONS})")


def simulate_forecaster(true_probs, outcomes, skill_level, calibration_bias=0.0,
                        resolution_factor=1.0):
    """
    Simulate a forecaster with given characteristics.

    Parameters
    ----------
    true_probs : np.ndarray
        The true underlying probabilities.
    outcomes : np.ndarray
        Realized outcomes.
    skill_level : float
        How much signal the forecaster has (0 = none, 1 = perfect).
    calibration_bias : float
        Systematic bias toward overconfidence (>0) or underconfidence (<0).
    resolution_factor : float
        How much the forecaster differentiates between questions (0 = constant, 1 = normal).
    """
    # Start with noisy perception of true probabilities
    noise = np.random.normal(0, 0.3 * (1 - skill_level), size=len(true_probs))
    forecasts = true_probs * skill_level + (1 - skill_level) * 0.5 + noise

    # Apply resolution factor (shrink toward 0.5)
    forecasts = 0.5 + resolution_factor * (forecasts - 0.5)

    # Apply calibration bias
    forecasts = forecasts + calibration_bias

    # Clip to valid range
    forecasts = np.clip(forecasts, 0.01, 0.99)

    return forecasts


# Create diverse forecasters
forecaster_names = [f"Forecaster_{i+1:02d}" for i in range(N_FORECASTERS)]
all_forecasts = {}

for i in range(N_FORECASTERS):
    # Assign random characteristics
    if i < 5:  # Top tier: high skill
        skill = np.random.uniform(0.6, 0.8)
        bias = np.random.normal(0, 0.02)
        resolution = np.random.uniform(0.8, 1.2)
    elif i < 15:  # Good forecasters
        skill = np.random.uniform(0.4, 0.6)
        bias = np.random.normal(0, 0.05)
        resolution = np.random.uniform(0.6, 1.0)
    elif i < 35:  # Average forecasters
        skill = np.random.uniform(0.2, 0.4)
        bias = np.random.normal(0, 0.08)
        resolution = np.random.uniform(0.4, 0.8)
    else:  # Below average
        skill = np.random.uniform(0.0, 0.2)
        bias = np.random.normal(0, 0.12)
        resolution = np.random.uniform(0.2, 0.5)

    all_forecasts[forecaster_names[i]] = simulate_forecaster(
        true_probs, outcomes, skill, bias, resolution
    )

print(f"Simulated {N_FORECASTERS} forecasters on {N_QUESTIONS} questions")

Step 1: Computing Scores

We compute three scoring rules for every forecaster on every question.

def brier_score(p, y):
    """Brier score: (p - y)^2. Lower is better."""
    return (p - y) ** 2

def log_score(p, y, eps=1e-10):
    """Log score: y*ln(p) + (1-y)*ln(1-p). Higher (less negative) is better."""
    p = np.clip(p, eps, 1 - eps)
    return y * np.log(p) + (1 - y) * np.log(1 - p)

def spherical_score(p, y):
    """Spherical score. Higher is better."""
    norm = np.sqrt(p**2 + (1-p)**2)
    if y == 1:
        return p / norm
    else:
        return (1 - p) / norm

# Vectorized versions
def mean_brier(forecasts, outcomes):
    return np.mean((forecasts - outcomes) ** 2)

def mean_log(forecasts, outcomes, eps=1e-10):
    f = np.clip(forecasts, eps, 1 - eps)
    return np.mean(outcomes * np.log(f) + (1 - outcomes) * np.log(1 - f))

def mean_spherical(forecasts, outcomes):
    norms = np.sqrt(forecasts**2 + (1 - forecasts)**2)
    scores = np.where(outcomes == 1, forecasts / norms, (1 - forecasts) / norms)
    return np.mean(scores)


# Compute all scores
results = {}
for name, forecasts in all_forecasts.items():
    results[name] = {
        'brier': mean_brier(forecasts, outcomes),
        'log': mean_log(forecasts, outcomes),
        'spherical': mean_spherical(forecasts, outcomes),
        'forecasts': forecasts,
    }

# Display top 10 by Brier score
print("\n--- Top 10 by Brier Score (lower is better) ---")
sorted_brier = sorted(results.items(), key=lambda x: x[1]['brier'])
for rank, (name, scores) in enumerate(sorted_brier[:10], 1):
    print(f"  {rank}. {name}: Brier={scores['brier']:.4f}  "
          f"Log={scores['log']:.4f}  Spherical={scores['spherical']:.4f}")

# Display top 10 by Log score
print("\n--- Top 10 by Log Score (higher is better) ---")
sorted_log = sorted(results.items(), key=lambda x: x[1]['log'], reverse=True)
for rank, (name, scores) in enumerate(sorted_log[:10], 1):
    print(f"  {rank}. {name}: Log={scores['log']:.4f}  "
          f"Brier={scores['brier']:.4f}  Spherical={scores['spherical']:.4f}")

Step 2: Comparing Rankings Across Scoring Rules

Do different scoring rules produce different rankings? Let us find out.

# Compute ranks for each scoring rule
names = list(results.keys())
brier_ranks = {name: rank for rank, (name, _) in
               enumerate(sorted(results.items(), key=lambda x: x[1]['brier']), 1)}
log_ranks = {name: rank for rank, (name, _) in
             enumerate(sorted(results.items(), key=lambda x: x[1]['log'], reverse=True), 1)}
spherical_ranks = {name: rank for rank, (name, _) in
                   enumerate(sorted(results.items(), key=lambda x: x[1]['spherical'],
                                    reverse=True), 1)}

# Compute rank correlations
from scipy.stats import spearmanr

brier_r = [brier_ranks[n] for n in names]
log_r = [log_ranks[n] for n in names]
sph_r = [spherical_ranks[n] for n in names]

corr_bl, _ = spearmanr(brier_r, log_r)
corr_bs, _ = spearmanr(brier_r, sph_r)
corr_ls, _ = spearmanr(log_r, sph_r)

print(f"\nSpearman Rank Correlations:")
print(f"  Brier vs Log:       {corr_bl:.4f}")
print(f"  Brier vs Spherical: {corr_bs:.4f}")
print(f"  Log vs Spherical:   {corr_ls:.4f}")

# Identify biggest rank disagreements
print("\n--- Largest Rank Disagreements (Brier vs Log) ---")
disagreements = [(name, brier_ranks[name], log_ranks[name],
                  abs(brier_ranks[name] - log_ranks[name]))
                 for name in names]
disagreements.sort(key=lambda x: x[3], reverse=True)

for name, br, lr, diff in disagreements[:5]:
    print(f"  {name}: Brier rank={br}, Log rank={lr}, Difference={diff}")

# Visualization: rank comparison scatter plot
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

axes[0].scatter(brier_r, log_r, alpha=0.6)
axes[0].plot([1, 50], [1, 50], 'k--', alpha=0.3)
axes[0].set_xlabel('Brier Rank')
axes[0].set_ylabel('Log Rank')
axes[0].set_title(f'Brier vs Log (r={corr_bl:.3f})')

axes[1].scatter(brier_r, sph_r, alpha=0.6)
axes[1].plot([1, 50], [1, 50], 'k--', alpha=0.3)
axes[1].set_xlabel('Brier Rank')
axes[1].set_ylabel('Spherical Rank')
axes[1].set_title(f'Brier vs Spherical (r={corr_bs:.3f})')

axes[2].scatter(log_r, sph_r, alpha=0.6)
axes[2].plot([1, 50], [1, 50], 'k--', alpha=0.3)
axes[2].set_xlabel('Log Rank')
axes[2].set_ylabel('Spherical Rank')
axes[2].set_title(f'Log vs Spherical (r={corr_ls:.3f})')

plt.tight_layout()
plt.savefig('rank_comparison.png', dpi=150)
plt.show()

Step 3: Brier Score Decomposition

We now decompose each forecaster's Brier score into calibration, resolution, and uncertainty.

def brier_decomposition(forecasts, outcomes, n_bins=10):
    """
    Decompose the Brier score into calibration, resolution, and uncertainty.

    Returns
    -------
    dict with keys: 'brier', 'calibration', 'resolution', 'uncertainty',
                    'bin_forecasts', 'bin_outcomes', 'bin_counts'
    """
    N = len(forecasts)
    base_rate = outcomes.mean()

    # Bin forecasts
    bin_edges = np.linspace(0, 1, n_bins + 1)
    bin_indices = np.digitize(forecasts, bin_edges) - 1
    bin_indices = np.clip(bin_indices, 0, n_bins - 1)

    calibration = 0.0
    resolution = 0.0
    bin_forecasts_avg = []
    bin_outcomes_avg = []
    bin_counts = []

    for k in range(n_bins):
        mask = bin_indices == k
        n_k = mask.sum()
        if n_k == 0:
            bin_forecasts_avg.append(np.nan)
            bin_outcomes_avg.append(np.nan)
            bin_counts.append(0)
            continue

        p_bar_k = forecasts[mask].mean()
        y_bar_k = outcomes[mask].mean()

        calibration += n_k * (p_bar_k - y_bar_k) ** 2
        resolution += n_k * (y_bar_k - base_rate) ** 2

        bin_forecasts_avg.append(p_bar_k)
        bin_outcomes_avg.append(y_bar_k)
        bin_counts.append(n_k)

    calibration /= N
    resolution /= N
    uncertainty = base_rate * (1 - base_rate)
    brier = np.mean((forecasts - outcomes) ** 2)

    return {
        'brier': brier,
        'calibration': calibration,
        'resolution': resolution,
        'uncertainty': uncertainty,
        'reconstructed': calibration - resolution + uncertainty,
        'bin_forecasts': bin_forecasts_avg,
        'bin_outcomes': bin_outcomes_avg,
        'bin_counts': bin_counts,
    }


# Decompose for all forecasters
decompositions = {}
for name, forecasts in all_forecasts.items():
    decompositions[name] = brier_decomposition(forecasts, outcomes)

# Display decomposition for top 5 and bottom 5
print("--- Brier Decomposition: Top 5 Forecasters ---")
print(f"{'Name':<18} {'Brier':>8} {'Calib':>8} {'Resol':>8} {'Uncert':>8}")
for name, _ in sorted_brier[:5]:
    d = decompositions[name]
    print(f"{name:<18} {d['brier']:8.4f} {d['calibration']:8.4f} "
          f"{d['resolution']:8.4f} {d['uncertainty']:8.4f}")

print("\n--- Brier Decomposition: Bottom 5 Forecasters ---")
print(f"{'Name':<18} {'Brier':>8} {'Calib':>8} {'Resol':>8} {'Uncert':>8}")
for name, _ in sorted_brier[-5:]:
    d = decompositions[name]
    print(f"{name:<18} {d['brier']:8.4f} {d['calibration']:8.4f} "
          f"{d['resolution']:8.4f} {d['uncertainty']:8.4f}")

Step 4: Calibration Plots

Visualize how well each forecaster's stated probabilities match reality.

def plot_calibration(name, decomp, ax):
    """Plot calibration diagram for a single forecaster."""
    bin_f = [x for x in decomp['bin_forecasts'] if not np.isnan(x)]
    bin_o = [x for x in decomp['bin_outcomes'] if not np.isnan(x)]
    bin_c = [c for c, f in zip(decomp['bin_counts'], decomp['bin_forecasts'])
             if not np.isnan(f)]

    # Size proportional to count
    sizes = [max(20, c * 3) for c in bin_c]

    ax.scatter(bin_f, bin_o, s=sizes, alpha=0.7, zorder=3)
    ax.plot([0, 1], [0, 1], 'k--', alpha=0.5, label='Perfect calibration')
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)
    ax.set_xlabel('Forecast Probability')
    ax.set_ylabel('Observed Frequency')
    ax.set_title(f'{name}\nCal={decomp["calibration"]:.4f} '
                 f'Res={decomp["resolution"]:.4f}')
    ax.set_aspect('equal')
    ax.grid(True, alpha=0.3)


# Plot calibration for 6 diverse forecasters
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Pick top 2, middle 2, bottom 2
selected = ([sorted_brier[0][0], sorted_brier[1][0]] +
            [sorted_brier[24][0], sorted_brier[25][0]] +
            [sorted_brier[-2][0], sorted_brier[-1][0]])

for ax, name in zip(axes.flatten(), selected):
    plot_calibration(name, decompositions[name], ax)

plt.suptitle('Calibration Diagrams: Top, Middle, and Bottom Forecasters',
             fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig('calibration_plots.png', dpi=150, bbox_inches='tight')
plt.show()

Step 5: Identifying Calibration vs. Resolution Differences

Some forecasters are well-calibrated but lack resolution (they say "50-50" for everything), while others have high resolution but poor calibration (they differentiate well between events but are systematically biased).

# Scatter plot: calibration vs resolution
fig, ax = plt.subplots(figsize=(10, 8))

cals = [decompositions[n]['calibration'] for n in names]
ress = [decompositions[n]['resolution'] for n in names]
briers = [results[n]['brier'] for n in names]

scatter = ax.scatter(cals, ress, c=briers, cmap='RdYlGn_r', s=80, alpha=0.7)
plt.colorbar(scatter, label='Brier Score (lower = better)')

# Label notable forecasters
for name in [sorted_brier[0][0], sorted_brier[-1][0]]:
    cal = decompositions[name]['calibration']
    res = decompositions[name]['resolution']
    ax.annotate(name, (cal, res), fontsize=8,
                xytext=(5, 5), textcoords='offset points')

ax.set_xlabel('Calibration (lower = better calibrated)')
ax.set_ylabel('Resolution (higher = better discrimination)')
ax.set_title('Calibration vs Resolution for All Forecasters')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('cal_vs_res.png', dpi=150)
plt.show()

# Categorize forecasters
print("\n--- Forecaster Categories ---")
median_cal = np.median(cals)
median_res = np.median(ress)

categories = {
    'Stars (low cal, high res)': [],
    'Calibrated but Flat (low cal, low res)': [],
    'Sharp but Biased (high cal, high res)': [],
    'Poor (high cal, low res)': [],
}

for name in names:
    cal = decompositions[name]['calibration']
    res = decompositions[name]['resolution']
    if cal <= median_cal and res >= median_res:
        categories['Stars (low cal, high res)'].append(name)
    elif cal <= median_cal and res < median_res:
        categories['Calibrated but Flat (low cal, low res)'].append(name)
    elif cal > median_cal and res >= median_res:
        categories['Sharp but Biased (high cal, high res)'].append(name)
    else:
        categories['Poor (high cal, low res)'].append(name)

for cat, members in categories.items():
    print(f"\n  {cat}: {len(members)} forecasters")
    avg_brier = np.mean([results[n]['brier'] for n in members])
    print(f"    Average Brier: {avg_brier:.4f}")

Step 6: Statistical Significance via Bootstrap

With only 100 questions, luck plays a significant role. We use bootstrap resampling to estimate confidence intervals on forecaster rankings.

def bootstrap_scores(all_forecasts, outcomes, n_bootstrap=5000):
    """
    Bootstrap resample to get confidence intervals on scores.

    Returns dict mapping forecaster name to (mean, lower_95, upper_95).
    """
    N = len(outcomes)
    names = list(all_forecasts.keys())
    bootstrap_means = {name: [] for name in names}

    for _ in range(n_bootstrap):
        # Resample question indices
        idx = np.random.choice(N, size=N, replace=True)
        resampled_outcomes = outcomes[idx]

        for name in names:
            resampled_forecasts = all_forecasts[name][idx]
            bs = np.mean((resampled_forecasts - resampled_outcomes) ** 2)
            bootstrap_means[name].append(bs)

    results = {}
    for name in names:
        scores = np.array(bootstrap_means[name])
        results[name] = {
            'mean': np.mean(scores),
            'lower_95': np.percentile(scores, 2.5),
            'upper_95': np.percentile(scores, 97.5),
            'std': np.std(scores),
        }
    return results


bootstrap_results = bootstrap_scores(all_forecasts, outcomes, n_bootstrap=5000)

# Plot confidence intervals for top 20 forecasters
top_20_names = [name for name, _ in sorted_brier[:20]]
means = [bootstrap_results[n]['mean'] for n in top_20_names]
lowers = [bootstrap_results[n]['lower_95'] for n in top_20_names]
uppers = [bootstrap_results[n]['upper_95'] for n in top_20_names]

fig, ax = plt.subplots(figsize=(12, 8))
y_pos = range(len(top_20_names))
ax.barh(y_pos, means, xerr=[[m - l for m, l in zip(means, lowers)],
                              [u - m for m, u in zip(means, uppers)]],
        capsize=3, color='steelblue', alpha=0.7)
ax.set_yticks(y_pos)
ax.set_yticklabels(top_20_names)
ax.invert_yaxis()
ax.set_xlabel('Brier Score (95% CI)')
ax.set_title('Top 20 Forecasters with Bootstrap Confidence Intervals')
ax.grid(True, axis='x', alpha=0.3)

plt.tight_layout()
plt.savefig('bootstrap_ci.png', dpi=150)
plt.show()

# Identify significantly better-than-average forecasters
avg_brier = np.mean([results[n]['brier'] for n in names])
print(f"\nOverall average Brier score: {avg_brier:.4f}")
print(f"\nForecasters significantly better than average (95% CI upper < average):")
for name in names:
    if bootstrap_results[name]['upper_95'] < avg_brier:
        print(f"  {name}: {bootstrap_results[name]['mean']:.4f} "
              f"[{bootstrap_results[name]['lower_95']:.4f}, "
              f"{bootstrap_results[name]['upper_95']:.4f}]")

Step 7: Award Structure Design

Based on the analysis, we design a fair award structure.

# Final leaderboard with combined scoring
print("\n" + "=" * 80)
print("GLOBAL FORECASTING CHALLENGE - FINAL RESULTS")
print("=" * 80)

# Normalize scores to [0, 1] range for comparison
brier_scores = {n: results[n]['brier'] for n in names}
log_scores = {n: results[n]['log'] for n in names}

# For Brier: 0 is best, 1 is worst -> normalize
brier_min = min(brier_scores.values())
brier_max = max(brier_scores.values())

# For log: 0 is best, more negative is worse -> normalize
log_max_val = max(log_scores.values())  # least negative = best
log_min_val = min(log_scores.values())  # most negative = worst

combined = {}
for name in names:
    # Normalize Brier (0 = best -> 1.0, 1 = worst -> 0.0)
    norm_brier = 1 - (brier_scores[name] - brier_min) / (brier_max - brier_min)
    # Normalize log (0 = best -> 1.0, most negative = worst -> 0.0)
    norm_log = (log_scores[name] - log_min_val) / (log_max_val - log_min_val)
    # Combined score (equal weight)
    combined[name] = 0.5 * norm_brier + 0.5 * norm_log

final_ranking = sorted(combined.items(), key=lambda x: x[1], reverse=True)

print(f"\n{'Rank':<6} {'Name':<18} {'Combined':>10} {'Brier':>10} "
      f"{'Log':>10} {'Cal':>10} {'Res':>10}")
print("-" * 80)
for rank, (name, score) in enumerate(final_ranking[:20], 1):
    d = decompositions[name]
    print(f"{rank:<6} {name:<18} {score:10.4f} {results[name]['brier']:10.4f} "
          f"{results[name]['log']:10.4f} {d['calibration']:10.4f} "
          f"{d['resolution']:10.4f}")

# Awards
print("\n--- AWARDS ---")
print(f"  Grand Champion: {final_ranking[0][0]}")
print(f"  Runner-Up:      {final_ranking[1][0]}")
print(f"  Third Place:    {final_ranking[2][0]}")

# Best calibration award
best_cal = min(names, key=lambda n: decompositions[n]['calibration'])
print(f"  Best Calibration: {best_cal} "
      f"(calibration = {decompositions[best_cal]['calibration']:.4f})")

# Best resolution award
best_res = max(names, key=lambda n: decompositions[n]['resolution'])
print(f"  Best Resolution: {best_res} "
      f"(resolution = {decompositions[best_res]['resolution']:.4f})")

# Most improved (in a real tournament, compared to previous year)
print(f"\n  Note: 'Most Improved' award requires previous year's data.")

Key Findings and Lessons

1. Scoring Rules Mostly Agree on Rankings

The Spearman rank correlation between the Brier and log score rankings is typically very high (above 0.90). The scoring rules tend to agree on who the best and worst forecasters are, but may disagree on specific orderings in the middle of the pack.

2. The Biggest Disagreements Involve Extreme Forecasts

Forecasters who frequently use extreme probabilities (near 0 or 1) tend to rank differently under Brier vs. log scoring. The log score heavily penalizes extreme forecasts that are wrong, so forecasters who are confidently wrong on even a few questions can plummet in the log score rankings while remaining respectable under Brier scoring.

3. Calibration and Resolution Are Distinct Skills

The decomposition analysis reveals that calibration and resolution are largely independent. Some forecasters are well-calibrated but lack resolution (they are "wishy-washy" -- always near 50%), while others have excellent resolution but are systematically biased (e.g., consistently overconfident).

4. Luck Matters Over 100 Questions

The bootstrap analysis shows wide confidence intervals, especially for middle-ranked forecasters. With 100 questions, distinguishing the 10th-best from the 20th-best forecaster is often not possible with statistical confidence. Tournaments need either more questions or multiple years of data to reliably identify the most skilled forecasters.

5. Combined Scoring Is Robust

Using a combination of Brier and log scores reduces the impact of any single scoring rule's idiosyncrasies and produces rankings that capture both calibration and discrimination ability.

Recommendations for Tournament Organizers

  1. Report multiple scoring rules to provide a comprehensive picture of forecaster performance.
  2. Include the Brier decomposition to help forecasters understand their strengths and weaknesses.
  3. Use bootstrap confidence intervals to distinguish genuine skill from luck.
  4. Consider combined scoring (equally-weighted normalized Brier and log scores) for the final ranking.
  5. Run multi-year tournaments to accumulate enough data for reliable rankings.
  6. Award calibration and resolution separately in addition to overall accuracy, to incentivize both skills.