Case Study 2: The Extremizing Experiment

Overview

A persistent finding in forecasting research is that aggregated probability forecasts tend to be insufficiently extreme. The simple average of multiple well-informed forecasters often produces probabilities that are too close to 50%, even when the true probability is much higher or lower. Extremizing -- pushing the aggregate away from 50% -- has been proposed as a correction.

In this case study, we will:

Simulate 500 resolved prediction markets with crowd forecasts
Test whether extremizing improves calibration and Brier score
Find the optimal extremizing parameter across different conditions
Analyze when extremizing helps versus hurts
Compare different extremizing strategies

The Setup

We simulate a prediction platform where multiple forecasters contribute probability estimates to questions that eventually resolve as true or false.

"""
The Extremizing Experiment
Case Study 2 - Chapter 25

Test whether extremizing improves prediction market calibration
across 500 resolved markets.
"""

import numpy as np
import pandas as pd
from scipy.special import logit, expit
from scipy.optimize import minimize_scalar
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import brier_score_loss
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

np.random.seed(2024)

# ============================================================
# STEP 1: Simulate 500 resolved prediction markets
# ============================================================

def simulate_forecasting_platform(n_questions=500,
                                   n_forecasters_range=(5, 50),
                                   shared_info_fraction=0.4):
    """
    Simulate a forecasting platform with multiple questions
    and varying numbers of forecasters.

    Parameters
    ----------
    n_questions : int
        Number of questions to simulate.
    n_forecasters_range : tuple
        (min, max) number of forecasters per question.
    shared_info_fraction : float
        Fraction of each forecaster's information that is shared
        with all other forecasters (0 to 1).

    Returns
    -------
    list of dicts
        Each dict contains question data including individual
        forecasts, aggregate, and outcome.
    """
    questions = []

    for q in range(n_questions):
        # True probability: draw from a realistic distribution
        # Mix of confident and uncertain questions
        if np.random.random() < 0.3:
            true_p = np.random.beta(1, 1)       # Uniform: uncertain
        elif np.random.random() < 0.5:
            true_p = np.random.beta(8, 2)       # Likely true
        else:
            true_p = np.random.beta(2, 8)       # Likely false

        # Number of forecasters for this question
        n_f = np.random.randint(*n_forecasters_range)

        # Generate individual forecasts
        # Model: each forecaster observes shared signal + private signal
        shared_signal = logit(true_p)

        # Shared noise (common to all forecasters)
        shared_noise = np.random.normal(0, 0.3)
        shared_component = shared_signal + shared_noise

        forecasts = []
        for f in range(n_f):
            # Private signal
            private_noise = np.random.normal(0, 0.5)
            private_component = shared_signal + private_noise

            # Forecaster's belief: weighted mix of shared and private
            belief_logit = (shared_info_fraction * shared_component +
                          (1 - shared_info_fraction) * private_component)

            # Convert to probability with some additional noise
            belief = expit(belief_logit + np.random.normal(0, 0.1))
            belief = np.clip(belief, 0.01, 0.99)
            forecasts.append(belief)

        forecasts = np.array(forecasts)
        outcome = 1 if np.random.random() < true_p else 0

        # Compute aggregates
        simple_avg = np.mean(forecasts)
        median_f = np.median(forecasts)
        log_odds_avg = expit(np.mean(logit(np.clip(forecasts, 0.01, 0.99))))

        questions.append({
            'question_id': q,
            'true_prob': true_p,
            'outcome': outcome,
            'n_forecasters': n_f,
            'forecasts': forecasts,
            'simple_avg': simple_avg,
            'median': median_f,
            'log_odds_avg': log_odds_avg,
            'forecast_std': np.std(forecasts),
            'shared_info_fraction': shared_info_fraction,
        })

    return questions


# Generate data with different levels of shared information
print("Generating simulated prediction markets...")
print("=" * 60)

questions_low_shared = simulate_forecasting_platform(
    n_questions=500, shared_info_fraction=0.2)
questions_med_shared = simulate_forecasting_platform(
    n_questions=500, shared_info_fraction=0.4)
questions_high_shared = simulate_forecasting_platform(
    n_questions=500, shared_info_fraction=0.7)

# Primary analysis uses medium shared info
questions = questions_med_shared
df = pd.DataFrame([{k: v for k, v in q.items() if k != 'forecasts'}
                    for q in questions])

print(f"Generated {len(df)} questions")
print(f"Average forecasters per question: {df['n_forecasters'].mean():.1f}")
print(f"Outcome base rate: {df['outcome'].mean():.2%}")
print(f"Average forecast std: {df['forecast_std'].mean():.3f}")

Step 2: The Moderation Problem

# ============================================================
# STEP 2: Demonstrate the moderation problem
# ============================================================

print("\n" + "=" * 60)
print("THE MODERATION PROBLEM")
print("=" * 60)

# Calibration analysis of the simple average
def reliability_diagram(forecasts, outcomes, n_bins=10, label=''):
    """Compute reliability diagram data."""
    bins = np.linspace(0, 1, n_bins + 1)
    bin_data = []

    for i in range(n_bins):
        mask = (forecasts >= bins[i]) & (forecasts < bins[i+1])
        if mask.sum() >= 5:  # Minimum sample requirement
            bin_data.append({
                'bin_center': (bins[i] + bins[i+1]) / 2,
                'mean_forecast': np.mean(forecasts[mask]),
                'mean_outcome': np.mean(outcomes[mask]),
                'count': mask.sum(),
            })

    return bin_data


forecasts = df['simple_avg'].values
outcomes = df['outcome'].values

cal_data = reliability_diagram(forecasts, outcomes, n_bins=10)

print("\nCalibration of Simple Average:")
print(f"{'Bin Center':>12s} {'Avg Forecast':>13s} {'Avg Outcome':>12s} "
      f"{'Count':>6s} {'Gap':>8s}")
print("-" * 55)
for bd in cal_data:
    gap = bd['mean_forecast'] - bd['mean_outcome']
    print(f"  {bd['bin_center']:>8.2f}   {bd['mean_forecast']:>10.3f}   "
          f"{bd['mean_outcome']:>9.3f}   {bd['count']:>5d}   "
          f"{gap:>+7.3f}")

# Key diagnostic: are extreme forecasts underconfident?
extreme_mask = (forecasts > 0.75) | (forecasts < 0.25)
moderate_mask = (forecasts >= 0.40) & (forecasts <= 0.60)

print(f"\nExtreme forecasts (p < 0.25 or p > 0.75):")
print(f"  Mean |forecast - 0.5|: "
      f"{np.mean(np.abs(forecasts[extreme_mask] - 0.5)):.3f}")
print(f"  Mean |outcome - 0.5|:  "
      f"{np.mean(np.abs(outcomes[extreme_mask] - 0.5)):.3f}")

print(f"\nModerate forecasts (0.40 <= p <= 0.60):")
print(f"  Mean |forecast - 0.5|: "
      f"{np.mean(np.abs(forecasts[moderate_mask] - 0.5)):.3f}")
print(f"  Mean |outcome - 0.5|:  "
      f"{np.mean(np.abs(outcomes[moderate_mask] - 0.5)):.3f}")

Step 3: Testing Extremizing

# ============================================================
# STEP 3: Test extremizing across a range of parameters
# ============================================================

print("\n" + "=" * 60)
print("EXTREMIZING EXPERIMENT")
print("=" * 60)

def extremize(p, d):
    """Apply extremizing transformation."""
    p = np.clip(p, 1e-4, 1 - 1e-4)
    return expit(d * logit(p))


# Test a range of extremizing factors
d_values = np.arange(0.5, 4.1, 0.1)
brier_scores = []
log_losses = []

for d in d_values:
    ext = extremize(forecasts, d)
    bs = brier_score_loss(outcomes, ext)
    ll = -np.mean(outcomes * np.log(np.clip(ext, 1e-10, 1)) +
                  (1 - outcomes) * np.log(np.clip(1 - ext, 1e-10, 1)))
    brier_scores.append(bs)
    log_losses.append(ll)

brier_scores = np.array(brier_scores)
log_losses = np.array(log_losses)

# Find optimal d
optimal_d_brier = d_values[np.argmin(brier_scores)]
optimal_d_logloss = d_values[np.argmin(log_losses)]

print(f"\nOptimal extremizing factor (Brier):    d = {optimal_d_brier:.1f}")
print(f"Optimal extremizing factor (Log Loss): d = {optimal_d_logloss:.1f}")
print(f"\nBrier score at d=1.0 (no extremizing): {brier_scores[d_values == 1.0][0]:.4f}")
print(f"Brier score at optimal d={optimal_d_brier:.1f}:         "
      f"{np.min(brier_scores):.4f}")
print(f"Improvement: {(brier_scores[d_values == 1.0][0] - np.min(brier_scores)):.4f}")

# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.plot(d_values, brier_scores, 'b-', linewidth=2)
ax1.axvline(x=1.0, color='gray', linestyle='--', alpha=0.5,
            label='No extremizing')
ax1.axvline(x=optimal_d_brier, color='red', linestyle='--',
            label=f'Optimal d={optimal_d_brier:.1f}')
ax1.set_xlabel('Extremizing Factor (d)')
ax1.set_ylabel('Brier Score')
ax1.set_title('Brier Score vs. Extremizing Factor')
ax1.legend()

ax2.plot(d_values, log_losses, 'g-', linewidth=2)
ax2.axvline(x=1.0, color='gray', linestyle='--', alpha=0.5,
            label='No extremizing')
ax2.axvline(x=optimal_d_logloss, color='red', linestyle='--',
            label=f'Optimal d={optimal_d_logloss:.1f}')
ax2.set_xlabel('Extremizing Factor (d)')
ax2.set_ylabel('Log Loss')
ax2.set_title('Log Loss vs. Extremizing Factor')
ax2.legend()

plt.tight_layout()
plt.savefig('extremizing_vs_d.png', dpi=150, bbox_inches='tight')
plt.close()
print("\nPlot saved to extremizing_vs_d.png")

Step 4: Logistic Recalibration

# ============================================================
# STEP 4: Logistic recalibration to find optimal d
# ============================================================

print("\n" + "=" * 60)
print("LOGISTIC RECALIBRATION")
print("=" * 60)

# Split into train/test (first 300 for training, last 200 for testing)
train_idx = np.arange(300)
test_idx = np.arange(300, 500)

train_forecasts = forecasts[train_idx]
train_outcomes = outcomes[train_idx]
test_forecasts = forecasts[test_idx]
test_outcomes = outcomes[test_idx]

# Fit logistic regression on logit of average forecast
X_train = logit(np.clip(train_forecasts, 1e-4, 1 - 1e-4)).reshape(-1, 1)
lr = LogisticRegression(C=1e6, fit_intercept=True, max_iter=1000)
lr.fit(X_train, train_outcomes)

d_learned = lr.coef_[0, 0]
b_learned = lr.intercept_[0]

print(f"Learned extremizing factor: d = {d_learned:.3f}")
print(f"Learned bias correction:    b = {b_learned:.3f}")

# Apply to test set
X_test = logit(np.clip(test_forecasts, 1e-4, 1 - 1e-4)).reshape(-1, 1)
recalibrated_test = lr.predict_proba(X_test)[:, 1]

# Compare on test set
print(f"\nTest Set Results:")
print(f"  Raw average Brier:        "
      f"{brier_score_loss(test_outcomes, test_forecasts):.4f}")
print(f"  Fixed d=1.5 Brier:        "
      f"{brier_score_loss(test_outcomes, extremize(test_forecasts, 1.5)):.4f}")
print(f"  Fixed d=2.0 Brier:        "
      f"{brier_score_loss(test_outcomes, extremize(test_forecasts, 2.0)):.4f}")
print(f"  Recalibrated Brier:       "
      f"{brier_score_loss(test_outcomes, recalibrated_test):.4f}")

# Calibration comparison
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for ax, (label, preds) in zip(axes, [
    ('Raw Average', test_forecasts),
    (f'Extremized (d={optimal_d_brier:.1f})',
     extremize(test_forecasts, optimal_d_brier)),
    ('Logistic Recalibration', recalibrated_test),
]):
    cal = reliability_diagram(preds, test_outcomes, n_bins=8)
    centers = [c['mean_forecast'] for c in cal]
    means = [c['mean_outcome'] for c in cal]
    counts = [c['count'] for c in cal]

    ax.plot([0, 1], [0, 1], 'k--', alpha=0.5)
    ax.scatter(centers, means, s=[c*3 for c in counts], alpha=0.7)
    bs = brier_score_loss(test_outcomes, preds)
    ax.set_title(f'{label}\nBrier = {bs:.4f}')
    ax.set_xlabel('Forecast Probability')
    ax.set_ylabel('Observed Frequency')
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)

plt.tight_layout()
plt.savefig('extremizing_calibration.png', dpi=150, bbox_inches='tight')
plt.close()
print("\nCalibration plot saved to extremizing_calibration.png")

Step 5: When Does Extremizing Help vs. Hurt?

# ============================================================
# STEP 5: Analyze when extremizing helps vs. hurts
# ============================================================

print("\n" + "=" * 60)
print("WHEN DOES EXTREMIZING HELP VS. HURT?")
print("=" * 60)

# Factor 1: Number of forecasters
print("\n--- By Number of Forecasters ---")
n_forecaster_bins = [(5, 15), (15, 30), (30, 50)]

for lo, hi in n_forecaster_bins:
    mask = (df['n_forecasters'] >= lo) & (df['n_forecasters'] < hi)
    sub_f = forecasts[mask]
    sub_o = outcomes[mask]
    n = mask.sum()

    if n < 20:
        continue

    bs_raw = brier_score_loss(sub_o, sub_f)

    # Find optimal d for this subset
    best_d = 1.0
    best_bs = bs_raw
    for d in np.arange(0.8, 3.5, 0.1):
        bs = brier_score_loss(sub_o, extremize(sub_f, d))
        if bs < best_bs:
            best_bs = bs
            best_d = d

    improvement = bs_raw - best_bs
    print(f"  {lo}-{hi} forecasters (n={n}): "
          f"raw={bs_raw:.4f}, best d={best_d:.1f}, "
          f"extremized={best_bs:.4f}, gain={improvement:+.4f}")


# Factor 2: Forecast certainty (distance from 0.5)
print("\n--- By Forecast Certainty ---")
certainty_bins = [(0.0, 0.1), (0.1, 0.2), (0.2, 0.3), (0.3, 0.5)]

for lo, hi in certainty_bins:
    certainty = np.abs(forecasts - 0.5)
    mask = (certainty >= lo) & (certainty < hi)
    sub_f = forecasts[mask]
    sub_o = outcomes[mask]
    n = mask.sum()

    if n < 20:
        continue

    bs_raw = brier_score_loss(sub_o, sub_f)

    best_d = 1.0
    best_bs = bs_raw
    for d in np.arange(0.8, 3.5, 0.1):
        bs = brier_score_loss(sub_o, extremize(sub_f, d))
        if bs < best_bs:
            best_bs = bs
            best_d = d

    improvement = bs_raw - best_bs
    print(f"  Certainty [{lo:.1f}, {hi:.1f}) (n={n}): "
          f"raw={bs_raw:.4f}, best d={best_d:.1f}, "
          f"extremized={best_bs:.4f}, gain={improvement:+.4f}")


# Factor 3: Forecaster agreement (low vs. high std)
print("\n--- By Forecaster Agreement ---")
std_median = df['forecast_std'].median()

for label, mask in [('High agreement (low std)',
                     df['forecast_std'] < std_median),
                    ('Low agreement (high std)',
                     df['forecast_std'] >= std_median)]:
    sub_f = forecasts[mask]
    sub_o = outcomes[mask]
    n = mask.sum()

    bs_raw = brier_score_loss(sub_o, sub_f)

    best_d = 1.0
    best_bs = bs_raw
    for d in np.arange(0.8, 3.5, 0.1):
        bs = brier_score_loss(sub_o, extremize(sub_f, d))
        if bs < best_bs:
            best_bs = bs
            best_d = d

    improvement = bs_raw - best_bs
    print(f"  {label} (n={n}): "
          f"raw={bs_raw:.4f}, best d={best_d:.1f}, "
          f"extremized={best_bs:.4f}, gain={improvement:+.4f}")

Step 6: Effect of Shared Information

# ============================================================
# STEP 6: How shared information affects optimal extremizing
# ============================================================

print("\n" + "=" * 60)
print("SHARED INFORMATION AND OPTIMAL EXTREMIZING")
print("=" * 60)

shared_fractions = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]
optimal_ds = []
improvements = []

for sf in shared_fractions:
    qs = simulate_forecasting_platform(
        n_questions=500, shared_info_fraction=sf)
    q_df = pd.DataFrame([{k: v for k, v in q.items() if k != 'forecasts'}
                          for q in qs])

    f = q_df['simple_avg'].values
    o = q_df['outcome'].values

    bs_raw = brier_score_loss(o, f)

    best_d = 1.0
    best_bs = bs_raw
    for d in np.arange(0.5, 4.0, 0.05):
        bs = brier_score_loss(o, extremize(f, d))
        if bs < best_bs:
            best_bs = bs
            best_d = d

    optimal_ds.append(best_d)
    improvements.append(bs_raw - best_bs)

    print(f"  Shared info = {sf:.1f}: optimal d = {best_d:.2f}, "
          f"Brier improvement = {bs_raw - best_bs:.4f}")

# Theoretical prediction
# d_theory = (K*n + m) / (K*n + K*m) approximately
# With our parameterization, shared_info_fraction ~ m/(n+m)
# So d ~ 1 / (1 - shared_frac + shared_frac/K)
# For large K, d ~ 1 / (1 - shared_frac) approximately for linear pool

print("\n  Theoretical prediction (approximation):")
for sf in shared_fractions:
    d_theory = 1.0 / (1.0 - sf + sf / 25)  # avg K ~ 25
    print(f"    Shared info = {sf:.1f}: d_theory ~ {d_theory:.2f}")

# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

ax1.plot(shared_fractions, optimal_ds, 'bo-', linewidth=2, markersize=8)
ax1.set_xlabel('Shared Information Fraction')
ax1.set_ylabel('Optimal Extremizing Factor (d)')
ax1.set_title('Optimal d vs. Shared Information')
ax1.grid(True, alpha=0.3)

ax2.plot(shared_fractions, improvements, 'ro-', linewidth=2, markersize=8)
ax2.set_xlabel('Shared Information Fraction')
ax2.set_ylabel('Brier Score Improvement')
ax2.set_title('Benefit of Extremizing vs. Shared Information')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('shared_info_extremizing.png', dpi=150, bbox_inches='tight')
plt.close()
print("\nPlot saved to shared_info_extremizing.png")

Step 7: Comparing Extremizing Strategies

# ============================================================
# STEP 7: Compare different extremizing strategies
# ============================================================

print("\n" + "=" * 60)
print("COMPARING EXTREMIZING STRATEGIES")
print("=" * 60)

# Use cross-validation
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)

strategies = {
    'No extremizing': lambda f, o, tr, te: f[te],
    'Fixed d=1.5': lambda f, o, tr, te: extremize(f[te], 1.5),
    'Fixed d=2.0': lambda f, o, tr, te: extremize(f[te], 2.0),
    'Grid search d': None,  # Will implement below
    'Logistic recal.': None,  # Will implement below
    'Log-odds avg': lambda f, o, tr, te: df['log_odds_avg'].values[te],
}

cv_results = {s: [] for s in strategies}

for fold, (train_idx, test_idx) in enumerate(kf.split(forecasts)):
    train_f, test_f = forecasts[train_idx], forecasts[test_idx]
    train_o, test_o = outcomes[train_idx], outcomes[test_idx]

    # Fixed strategies
    for name in ['No extremizing', 'Fixed d=1.5', 'Fixed d=2.0',
                 'Log-odds avg']:
        preds = strategies[name](forecasts, outcomes, train_idx, test_idx)
        bs = brier_score_loss(test_o, preds)
        cv_results[name].append(bs)

    # Grid search d on training set
    best_d_gs = 1.0
    best_bs_gs = float('inf')
    for d in np.arange(0.5, 4.0, 0.1):
        bs = brier_score_loss(train_o, extremize(train_f, d))
        if bs < best_bs_gs:
            best_bs_gs = bs
            best_d_gs = d
    preds_gs = extremize(test_f, best_d_gs)
    cv_results['Grid search d'].append(
        brier_score_loss(test_o, preds_gs))

    # Logistic recalibration on training set
    X_lr = logit(np.clip(train_f, 1e-4, 1 - 1e-4)).reshape(-1, 1)
    lr_cv = LogisticRegression(C=1e6, max_iter=1000)
    lr_cv.fit(X_lr, train_o)
    X_test_lr = logit(np.clip(test_f, 1e-4, 1 - 1e-4)).reshape(-1, 1)
    preds_lr = lr_cv.predict_proba(X_test_lr)[:, 1]
    cv_results['Logistic recal.'].append(
        brier_score_loss(test_o, preds_lr))

print(f"\n{'Strategy':<25s} {'Mean Brier':>10s} {'Std':>8s} {'Rank':>6s}")
print("-" * 55)

ranked = sorted(cv_results.items(), key=lambda x: np.mean(x[1]))
for rank, (name, scores) in enumerate(ranked, 1):
    mean = np.mean(scores)
    std = np.std(scores)
    print(f"  {name:<23s} {mean:>10.4f} {std:>8.4f} {rank:>5d}")

Summary of Findings

# ============================================================
# STEP 8: Summary and conclusions
# ============================================================

print("\n" + "=" * 60)
print("SUMMARY OF FINDINGS")
print("=" * 60)

print("""
1. EXTREMIZING IMPROVES AGGREGATE FORECASTS
   The simple average of crowd forecasts is systematically too
   moderate. Extremizing with an appropriate factor consistently
   improves both Brier score and calibration.

2. OPTIMAL EXTREMIZING FACTOR DEPENDS ON SHARED INFORMATION
   When forecasters share more common information (higher shared
   information fraction), the optimal extremizing factor increases.
   This matches the theoretical prediction: shared information
   causes over-counting in simple averaging.

3. EXTREMIZING HELPS MORE FOR:
   - Questions with more forecasters (more shared information
     to correct for)
   - Moderate forecasts (near 0.5) where there is room to move
   - High forecaster agreement (agreement amplifies the
     shared-information problem)

4. EXTREMIZING CAN HURT WHEN:
   - Forecasters have genuinely independent information
     (low shared info fraction)
   - The aggregate is already extreme (close to 0 or 1)
   - The extremizing factor is poorly calibrated

5. LOGISTIC RECALIBRATION IS THE SAFEST APPROACH
   Rather than guessing an extremizing factor, fitting a logistic
   regression to calibration data learns both the optimal
   extremizing factor AND a bias correction. This data-driven
   approach is more robust than fixed extremizing.

6. PRACTICAL RECOMMENDATION
   Start with d = 1.5 as a reasonable default. If you have
   calibration data (50+ resolved questions), use logistic
   recalibration to learn the optimal factor. Re-estimate
   periodically as the platform evolves.
""")

Discussion Questions

Why does the optimal extremizing factor increase with the number of forecasters? Think about what happens to the ratio of shared to unique information as more people contribute.
If you were building a real forecasting platform, would you extremize the aggregate before or after displaying it to users? Consider the incentive effects: if users see an extremized aggregate, they might anchor to it differently.
How would you handle the case where different questions have different levels of shared information? Could you predict the optimal extremizing factor from question characteristics?
The Good Judgment Project found optimal extremizing factors around d = 2.5 for geopolitical forecasting questions. Why might this domain have particularly high shared information among forecasters?
What are the risks of over-extremizing? Consider what happens when you push a 60% forecast to 90% and the event does not occur.