Case Study 2: The Extremizing Experiment
Overview
A persistent finding in forecasting research is that aggregated probability forecasts tend to be insufficiently extreme. The simple average of multiple well-informed forecasters often produces probabilities that are too close to 50%, even when the true probability is much higher or lower. Extremizing -- pushing the aggregate away from 50% -- has been proposed as a correction.
In this case study, we will:
- Simulate 500 resolved prediction markets with crowd forecasts
- Test whether extremizing improves calibration and Brier score
- Find the optimal extremizing parameter across different conditions
- Analyze when extremizing helps versus hurts
- Compare different extremizing strategies
The Setup
We simulate a prediction platform where multiple forecasters contribute probability estimates to questions that eventually resolve as true or false.
"""
The Extremizing Experiment
Case Study 2 - Chapter 25
Test whether extremizing improves prediction market calibration
across 500 resolved markets.
"""
import numpy as np
import pandas as pd
from scipy.special import logit, expit
from scipy.optimize import minimize_scalar
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import brier_score_loss
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
np.random.seed(2024)
# ============================================================
# STEP 1: Simulate 500 resolved prediction markets
# ============================================================
def simulate_forecasting_platform(n_questions=500,
n_forecasters_range=(5, 50),
shared_info_fraction=0.4):
"""
Simulate a forecasting platform with multiple questions
and varying numbers of forecasters.
Parameters
----------
n_questions : int
Number of questions to simulate.
n_forecasters_range : tuple
(min, max) number of forecasters per question.
shared_info_fraction : float
Fraction of each forecaster's information that is shared
with all other forecasters (0 to 1).
Returns
-------
list of dicts
Each dict contains question data including individual
forecasts, aggregate, and outcome.
"""
questions = []
for q in range(n_questions):
# True probability: draw from a realistic distribution
# Mix of confident and uncertain questions
if np.random.random() < 0.3:
true_p = np.random.beta(1, 1) # Uniform: uncertain
elif np.random.random() < 0.5:
true_p = np.random.beta(8, 2) # Likely true
else:
true_p = np.random.beta(2, 8) # Likely false
# Number of forecasters for this question
n_f = np.random.randint(*n_forecasters_range)
# Generate individual forecasts
# Model: each forecaster observes shared signal + private signal
shared_signal = logit(true_p)
# Shared noise (common to all forecasters)
shared_noise = np.random.normal(0, 0.3)
shared_component = shared_signal + shared_noise
forecasts = []
for f in range(n_f):
# Private signal
private_noise = np.random.normal(0, 0.5)
private_component = shared_signal + private_noise
# Forecaster's belief: weighted mix of shared and private
belief_logit = (shared_info_fraction * shared_component +
(1 - shared_info_fraction) * private_component)
# Convert to probability with some additional noise
belief = expit(belief_logit + np.random.normal(0, 0.1))
belief = np.clip(belief, 0.01, 0.99)
forecasts.append(belief)
forecasts = np.array(forecasts)
outcome = 1 if np.random.random() < true_p else 0
# Compute aggregates
simple_avg = np.mean(forecasts)
median_f = np.median(forecasts)
log_odds_avg = expit(np.mean(logit(np.clip(forecasts, 0.01, 0.99))))
questions.append({
'question_id': q,
'true_prob': true_p,
'outcome': outcome,
'n_forecasters': n_f,
'forecasts': forecasts,
'simple_avg': simple_avg,
'median': median_f,
'log_odds_avg': log_odds_avg,
'forecast_std': np.std(forecasts),
'shared_info_fraction': shared_info_fraction,
})
return questions
# Generate data with different levels of shared information
print("Generating simulated prediction markets...")
print("=" * 60)
questions_low_shared = simulate_forecasting_platform(
n_questions=500, shared_info_fraction=0.2)
questions_med_shared = simulate_forecasting_platform(
n_questions=500, shared_info_fraction=0.4)
questions_high_shared = simulate_forecasting_platform(
n_questions=500, shared_info_fraction=0.7)
# Primary analysis uses medium shared info
questions = questions_med_shared
df = pd.DataFrame([{k: v for k, v in q.items() if k != 'forecasts'}
for q in questions])
print(f"Generated {len(df)} questions")
print(f"Average forecasters per question: {df['n_forecasters'].mean():.1f}")
print(f"Outcome base rate: {df['outcome'].mean():.2%}")
print(f"Average forecast std: {df['forecast_std'].mean():.3f}")
Step 2: The Moderation Problem
# ============================================================
# STEP 2: Demonstrate the moderation problem
# ============================================================
print("\n" + "=" * 60)
print("THE MODERATION PROBLEM")
print("=" * 60)
# Calibration analysis of the simple average
def reliability_diagram(forecasts, outcomes, n_bins=10, label=''):
"""Compute reliability diagram data."""
bins = np.linspace(0, 1, n_bins + 1)
bin_data = []
for i in range(n_bins):
mask = (forecasts >= bins[i]) & (forecasts < bins[i+1])
if mask.sum() >= 5: # Minimum sample requirement
bin_data.append({
'bin_center': (bins[i] + bins[i+1]) / 2,
'mean_forecast': np.mean(forecasts[mask]),
'mean_outcome': np.mean(outcomes[mask]),
'count': mask.sum(),
})
return bin_data
forecasts = df['simple_avg'].values
outcomes = df['outcome'].values
cal_data = reliability_diagram(forecasts, outcomes, n_bins=10)
print("\nCalibration of Simple Average:")
print(f"{'Bin Center':>12s} {'Avg Forecast':>13s} {'Avg Outcome':>12s} "
f"{'Count':>6s} {'Gap':>8s}")
print("-" * 55)
for bd in cal_data:
gap = bd['mean_forecast'] - bd['mean_outcome']
print(f" {bd['bin_center']:>8.2f} {bd['mean_forecast']:>10.3f} "
f"{bd['mean_outcome']:>9.3f} {bd['count']:>5d} "
f"{gap:>+7.3f}")
# Key diagnostic: are extreme forecasts underconfident?
extreme_mask = (forecasts > 0.75) | (forecasts < 0.25)
moderate_mask = (forecasts >= 0.40) & (forecasts <= 0.60)
print(f"\nExtreme forecasts (p < 0.25 or p > 0.75):")
print(f" Mean |forecast - 0.5|: "
f"{np.mean(np.abs(forecasts[extreme_mask] - 0.5)):.3f}")
print(f" Mean |outcome - 0.5|: "
f"{np.mean(np.abs(outcomes[extreme_mask] - 0.5)):.3f}")
print(f"\nModerate forecasts (0.40 <= p <= 0.60):")
print(f" Mean |forecast - 0.5|: "
f"{np.mean(np.abs(forecasts[moderate_mask] - 0.5)):.3f}")
print(f" Mean |outcome - 0.5|: "
f"{np.mean(np.abs(outcomes[moderate_mask] - 0.5)):.3f}")
Step 3: Testing Extremizing
# ============================================================
# STEP 3: Test extremizing across a range of parameters
# ============================================================
print("\n" + "=" * 60)
print("EXTREMIZING EXPERIMENT")
print("=" * 60)
def extremize(p, d):
"""Apply extremizing transformation."""
p = np.clip(p, 1e-4, 1 - 1e-4)
return expit(d * logit(p))
# Test a range of extremizing factors
d_values = np.arange(0.5, 4.1, 0.1)
brier_scores = []
log_losses = []
for d in d_values:
ext = extremize(forecasts, d)
bs = brier_score_loss(outcomes, ext)
ll = -np.mean(outcomes * np.log(np.clip(ext, 1e-10, 1)) +
(1 - outcomes) * np.log(np.clip(1 - ext, 1e-10, 1)))
brier_scores.append(bs)
log_losses.append(ll)
brier_scores = np.array(brier_scores)
log_losses = np.array(log_losses)
# Find optimal d
optimal_d_brier = d_values[np.argmin(brier_scores)]
optimal_d_logloss = d_values[np.argmin(log_losses)]
print(f"\nOptimal extremizing factor (Brier): d = {optimal_d_brier:.1f}")
print(f"Optimal extremizing factor (Log Loss): d = {optimal_d_logloss:.1f}")
print(f"\nBrier score at d=1.0 (no extremizing): {brier_scores[d_values == 1.0][0]:.4f}")
print(f"Brier score at optimal d={optimal_d_brier:.1f}: "
f"{np.min(brier_scores):.4f}")
print(f"Improvement: {(brier_scores[d_values == 1.0][0] - np.min(brier_scores)):.4f}")
# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
ax1.plot(d_values, brier_scores, 'b-', linewidth=2)
ax1.axvline(x=1.0, color='gray', linestyle='--', alpha=0.5,
label='No extremizing')
ax1.axvline(x=optimal_d_brier, color='red', linestyle='--',
label=f'Optimal d={optimal_d_brier:.1f}')
ax1.set_xlabel('Extremizing Factor (d)')
ax1.set_ylabel('Brier Score')
ax1.set_title('Brier Score vs. Extremizing Factor')
ax1.legend()
ax2.plot(d_values, log_losses, 'g-', linewidth=2)
ax2.axvline(x=1.0, color='gray', linestyle='--', alpha=0.5,
label='No extremizing')
ax2.axvline(x=optimal_d_logloss, color='red', linestyle='--',
label=f'Optimal d={optimal_d_logloss:.1f}')
ax2.set_xlabel('Extremizing Factor (d)')
ax2.set_ylabel('Log Loss')
ax2.set_title('Log Loss vs. Extremizing Factor')
ax2.legend()
plt.tight_layout()
plt.savefig('extremizing_vs_d.png', dpi=150, bbox_inches='tight')
plt.close()
print("\nPlot saved to extremizing_vs_d.png")
Step 4: Logistic Recalibration
# ============================================================
# STEP 4: Logistic recalibration to find optimal d
# ============================================================
print("\n" + "=" * 60)
print("LOGISTIC RECALIBRATION")
print("=" * 60)
# Split into train/test (first 300 for training, last 200 for testing)
train_idx = np.arange(300)
test_idx = np.arange(300, 500)
train_forecasts = forecasts[train_idx]
train_outcomes = outcomes[train_idx]
test_forecasts = forecasts[test_idx]
test_outcomes = outcomes[test_idx]
# Fit logistic regression on logit of average forecast
X_train = logit(np.clip(train_forecasts, 1e-4, 1 - 1e-4)).reshape(-1, 1)
lr = LogisticRegression(C=1e6, fit_intercept=True, max_iter=1000)
lr.fit(X_train, train_outcomes)
d_learned = lr.coef_[0, 0]
b_learned = lr.intercept_[0]
print(f"Learned extremizing factor: d = {d_learned:.3f}")
print(f"Learned bias correction: b = {b_learned:.3f}")
# Apply to test set
X_test = logit(np.clip(test_forecasts, 1e-4, 1 - 1e-4)).reshape(-1, 1)
recalibrated_test = lr.predict_proba(X_test)[:, 1]
# Compare on test set
print(f"\nTest Set Results:")
print(f" Raw average Brier: "
f"{brier_score_loss(test_outcomes, test_forecasts):.4f}")
print(f" Fixed d=1.5 Brier: "
f"{brier_score_loss(test_outcomes, extremize(test_forecasts, 1.5)):.4f}")
print(f" Fixed d=2.0 Brier: "
f"{brier_score_loss(test_outcomes, extremize(test_forecasts, 2.0)):.4f}")
print(f" Recalibrated Brier: "
f"{brier_score_loss(test_outcomes, recalibrated_test):.4f}")
# Calibration comparison
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for ax, (label, preds) in zip(axes, [
('Raw Average', test_forecasts),
(f'Extremized (d={optimal_d_brier:.1f})',
extremize(test_forecasts, optimal_d_brier)),
('Logistic Recalibration', recalibrated_test),
]):
cal = reliability_diagram(preds, test_outcomes, n_bins=8)
centers = [c['mean_forecast'] for c in cal]
means = [c['mean_outcome'] for c in cal]
counts = [c['count'] for c in cal]
ax.plot([0, 1], [0, 1], 'k--', alpha=0.5)
ax.scatter(centers, means, s=[c*3 for c in counts], alpha=0.7)
bs = brier_score_loss(test_outcomes, preds)
ax.set_title(f'{label}\nBrier = {bs:.4f}')
ax.set_xlabel('Forecast Probability')
ax.set_ylabel('Observed Frequency')
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
plt.tight_layout()
plt.savefig('extremizing_calibration.png', dpi=150, bbox_inches='tight')
plt.close()
print("\nCalibration plot saved to extremizing_calibration.png")
Step 5: When Does Extremizing Help vs. Hurt?
# ============================================================
# STEP 5: Analyze when extremizing helps vs. hurts
# ============================================================
print("\n" + "=" * 60)
print("WHEN DOES EXTREMIZING HELP VS. HURT?")
print("=" * 60)
# Factor 1: Number of forecasters
print("\n--- By Number of Forecasters ---")
n_forecaster_bins = [(5, 15), (15, 30), (30, 50)]
for lo, hi in n_forecaster_bins:
mask = (df['n_forecasters'] >= lo) & (df['n_forecasters'] < hi)
sub_f = forecasts[mask]
sub_o = outcomes[mask]
n = mask.sum()
if n < 20:
continue
bs_raw = brier_score_loss(sub_o, sub_f)
# Find optimal d for this subset
best_d = 1.0
best_bs = bs_raw
for d in np.arange(0.8, 3.5, 0.1):
bs = brier_score_loss(sub_o, extremize(sub_f, d))
if bs < best_bs:
best_bs = bs
best_d = d
improvement = bs_raw - best_bs
print(f" {lo}-{hi} forecasters (n={n}): "
f"raw={bs_raw:.4f}, best d={best_d:.1f}, "
f"extremized={best_bs:.4f}, gain={improvement:+.4f}")
# Factor 2: Forecast certainty (distance from 0.5)
print("\n--- By Forecast Certainty ---")
certainty_bins = [(0.0, 0.1), (0.1, 0.2), (0.2, 0.3), (0.3, 0.5)]
for lo, hi in certainty_bins:
certainty = np.abs(forecasts - 0.5)
mask = (certainty >= lo) & (certainty < hi)
sub_f = forecasts[mask]
sub_o = outcomes[mask]
n = mask.sum()
if n < 20:
continue
bs_raw = brier_score_loss(sub_o, sub_f)
best_d = 1.0
best_bs = bs_raw
for d in np.arange(0.8, 3.5, 0.1):
bs = brier_score_loss(sub_o, extremize(sub_f, d))
if bs < best_bs:
best_bs = bs
best_d = d
improvement = bs_raw - best_bs
print(f" Certainty [{lo:.1f}, {hi:.1f}) (n={n}): "
f"raw={bs_raw:.4f}, best d={best_d:.1f}, "
f"extremized={best_bs:.4f}, gain={improvement:+.4f}")
# Factor 3: Forecaster agreement (low vs. high std)
print("\n--- By Forecaster Agreement ---")
std_median = df['forecast_std'].median()
for label, mask in [('High agreement (low std)',
df['forecast_std'] < std_median),
('Low agreement (high std)',
df['forecast_std'] >= std_median)]:
sub_f = forecasts[mask]
sub_o = outcomes[mask]
n = mask.sum()
bs_raw = brier_score_loss(sub_o, sub_f)
best_d = 1.0
best_bs = bs_raw
for d in np.arange(0.8, 3.5, 0.1):
bs = brier_score_loss(sub_o, extremize(sub_f, d))
if bs < best_bs:
best_bs = bs
best_d = d
improvement = bs_raw - best_bs
print(f" {label} (n={n}): "
f"raw={bs_raw:.4f}, best d={best_d:.1f}, "
f"extremized={best_bs:.4f}, gain={improvement:+.4f}")
Step 6: Effect of Shared Information
# ============================================================
# STEP 6: How shared information affects optimal extremizing
# ============================================================
print("\n" + "=" * 60)
print("SHARED INFORMATION AND OPTIMAL EXTREMIZING")
print("=" * 60)
shared_fractions = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]
optimal_ds = []
improvements = []
for sf in shared_fractions:
qs = simulate_forecasting_platform(
n_questions=500, shared_info_fraction=sf)
q_df = pd.DataFrame([{k: v for k, v in q.items() if k != 'forecasts'}
for q in qs])
f = q_df['simple_avg'].values
o = q_df['outcome'].values
bs_raw = brier_score_loss(o, f)
best_d = 1.0
best_bs = bs_raw
for d in np.arange(0.5, 4.0, 0.05):
bs = brier_score_loss(o, extremize(f, d))
if bs < best_bs:
best_bs = bs
best_d = d
optimal_ds.append(best_d)
improvements.append(bs_raw - best_bs)
print(f" Shared info = {sf:.1f}: optimal d = {best_d:.2f}, "
f"Brier improvement = {bs_raw - best_bs:.4f}")
# Theoretical prediction
# d_theory = (K*n + m) / (K*n + K*m) approximately
# With our parameterization, shared_info_fraction ~ m/(n+m)
# So d ~ 1 / (1 - shared_frac + shared_frac/K)
# For large K, d ~ 1 / (1 - shared_frac) approximately for linear pool
print("\n Theoretical prediction (approximation):")
for sf in shared_fractions:
d_theory = 1.0 / (1.0 - sf + sf / 25) # avg K ~ 25
print(f" Shared info = {sf:.1f}: d_theory ~ {d_theory:.2f}")
# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
ax1.plot(shared_fractions, optimal_ds, 'bo-', linewidth=2, markersize=8)
ax1.set_xlabel('Shared Information Fraction')
ax1.set_ylabel('Optimal Extremizing Factor (d)')
ax1.set_title('Optimal d vs. Shared Information')
ax1.grid(True, alpha=0.3)
ax2.plot(shared_fractions, improvements, 'ro-', linewidth=2, markersize=8)
ax2.set_xlabel('Shared Information Fraction')
ax2.set_ylabel('Brier Score Improvement')
ax2.set_title('Benefit of Extremizing vs. Shared Information')
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('shared_info_extremizing.png', dpi=150, bbox_inches='tight')
plt.close()
print("\nPlot saved to shared_info_extremizing.png")
Step 7: Comparing Extremizing Strategies
# ============================================================
# STEP 7: Compare different extremizing strategies
# ============================================================
print("\n" + "=" * 60)
print("COMPARING EXTREMIZING STRATEGIES")
print("=" * 60)
# Use cross-validation
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
strategies = {
'No extremizing': lambda f, o, tr, te: f[te],
'Fixed d=1.5': lambda f, o, tr, te: extremize(f[te], 1.5),
'Fixed d=2.0': lambda f, o, tr, te: extremize(f[te], 2.0),
'Grid search d': None, # Will implement below
'Logistic recal.': None, # Will implement below
'Log-odds avg': lambda f, o, tr, te: df['log_odds_avg'].values[te],
}
cv_results = {s: [] for s in strategies}
for fold, (train_idx, test_idx) in enumerate(kf.split(forecasts)):
train_f, test_f = forecasts[train_idx], forecasts[test_idx]
train_o, test_o = outcomes[train_idx], outcomes[test_idx]
# Fixed strategies
for name in ['No extremizing', 'Fixed d=1.5', 'Fixed d=2.0',
'Log-odds avg']:
preds = strategies[name](forecasts, outcomes, train_idx, test_idx)
bs = brier_score_loss(test_o, preds)
cv_results[name].append(bs)
# Grid search d on training set
best_d_gs = 1.0
best_bs_gs = float('inf')
for d in np.arange(0.5, 4.0, 0.1):
bs = brier_score_loss(train_o, extremize(train_f, d))
if bs < best_bs_gs:
best_bs_gs = bs
best_d_gs = d
preds_gs = extremize(test_f, best_d_gs)
cv_results['Grid search d'].append(
brier_score_loss(test_o, preds_gs))
# Logistic recalibration on training set
X_lr = logit(np.clip(train_f, 1e-4, 1 - 1e-4)).reshape(-1, 1)
lr_cv = LogisticRegression(C=1e6, max_iter=1000)
lr_cv.fit(X_lr, train_o)
X_test_lr = logit(np.clip(test_f, 1e-4, 1 - 1e-4)).reshape(-1, 1)
preds_lr = lr_cv.predict_proba(X_test_lr)[:, 1]
cv_results['Logistic recal.'].append(
brier_score_loss(test_o, preds_lr))
print(f"\n{'Strategy':<25s} {'Mean Brier':>10s} {'Std':>8s} {'Rank':>6s}")
print("-" * 55)
ranked = sorted(cv_results.items(), key=lambda x: np.mean(x[1]))
for rank, (name, scores) in enumerate(ranked, 1):
mean = np.mean(scores)
std = np.std(scores)
print(f" {name:<23s} {mean:>10.4f} {std:>8.4f} {rank:>5d}")
Summary of Findings
# ============================================================
# STEP 8: Summary and conclusions
# ============================================================
print("\n" + "=" * 60)
print("SUMMARY OF FINDINGS")
print("=" * 60)
print("""
1. EXTREMIZING IMPROVES AGGREGATE FORECASTS
The simple average of crowd forecasts is systematically too
moderate. Extremizing with an appropriate factor consistently
improves both Brier score and calibration.
2. OPTIMAL EXTREMIZING FACTOR DEPENDS ON SHARED INFORMATION
When forecasters share more common information (higher shared
information fraction), the optimal extremizing factor increases.
This matches the theoretical prediction: shared information
causes over-counting in simple averaging.
3. EXTREMIZING HELPS MORE FOR:
- Questions with more forecasters (more shared information
to correct for)
- Moderate forecasts (near 0.5) where there is room to move
- High forecaster agreement (agreement amplifies the
shared-information problem)
4. EXTREMIZING CAN HURT WHEN:
- Forecasters have genuinely independent information
(low shared info fraction)
- The aggregate is already extreme (close to 0 or 1)
- The extremizing factor is poorly calibrated
5. LOGISTIC RECALIBRATION IS THE SAFEST APPROACH
Rather than guessing an extremizing factor, fitting a logistic
regression to calibration data learns both the optimal
extremizing factor AND a bias correction. This data-driven
approach is more robust than fixed extremizing.
6. PRACTICAL RECOMMENDATION
Start with d = 1.5 as a reasonable default. If you have
calibration data (50+ resolved questions), use logistic
recalibration to learn the optimal factor. Re-estimate
periodically as the platform evolves.
""")
Discussion Questions
-
Why does the optimal extremizing factor increase with the number of forecasters? Think about what happens to the ratio of shared to unique information as more people contribute.
-
If you were building a real forecasting platform, would you extremize the aggregate before or after displaying it to users? Consider the incentive effects: if users see an extremized aggregate, they might anchor to it differently.
-
How would you handle the case where different questions have different levels of shared information? Could you predict the optimal extremizing factor from question characteristics?
-
The Good Judgment Project found optimal extremizing factors around d = 2.5 for geopolitical forecasting questions. Why might this domain have particularly high shared information among forecasters?
-
What are the risks of over-extremizing? Consider what happens when you push a 60% forecast to 90% and the event does not occur.