Case Study 2: Can LLMs Beat Prediction Markets?

Overview

This case study investigates whether large language models (LLMs) can produce probability estimates that are more accurate than prediction market prices. We design a rigorous evaluation framework, generate LLM forecasts for 100 questions, compare them to market prices at the time of forecasting, and evaluate which source of probabilities -- the LLM or the market -- is better calibrated and has lower Brier scores. We also identify the types of questions where LLMs might have an edge and discuss the broader implications for prediction market trading.

Research Question

Can an LLM, prompted with publicly available information, produce probability estimates that outperform the contemporaneous prediction market price as a forecast of the actual outcome?

This question is interesting because prediction markets are supposed to aggregate the wisdom of crowds, incorporating all available information into their prices. If an LLM can beat the market, it would imply either that the market is inefficient or that LLMs have access to reasoning capabilities that the marginal market participant lacks.

Methodology

Question Selection

We simulate 100 prediction market questions across five categories:

  1. US Politics (20 questions): Election outcomes, legislative votes, policy decisions.
  2. International Affairs (20 questions): Diplomatic events, conflict outcomes, treaty ratifications.
  3. Economics (20 questions): GDP growth, unemployment rates, Fed decisions.
  4. Science/Technology (20 questions): Product launches, regulatory approvals, scientific milestones.
  5. Sports/Entertainment (20 questions): Championship outcomes, award winners, box office milestones.

For each question, we record: - The question text and resolution criteria. - The market price at the time of LLM evaluation. - The actual outcome (resolved to 0 or 1).

LLM Forecasting Protocol

For each question, we generate LLM probability estimates using three prompt strategies: 1. Structured analysis: Base rate, factors for/against, synthesis. 2. Devil's advocate: Argue both sides, then estimate. 3. Reference class: Identify similar historical events and their outcomes.

We take the trimmed mean of all estimates as the final LLM forecast.

Implementation

import numpy as np
import pandas as pd
from dataclasses import dataclass
from typing import List, Dict, Tuple
import matplotlib.pyplot as plt
from scipy import stats


@dataclass
class MarketQuestion:
    """A prediction market question with all evaluation data."""
    question_id: int
    category: str
    question_text: str
    market_price: float  # Market probability at evaluation time
    llm_forecast: float  # LLM probability estimate
    actual_outcome: int  # 0 or 1


def simulate_questions(n_per_category: int = 20, seed: int = 42) -> List[MarketQuestion]:
    """
    Simulate 100 prediction market questions with market prices,
    LLM forecasts, and actual outcomes.

    The simulation captures realistic patterns:
    - Markets are generally well-calibrated but not perfect
    - LLMs are better on questions with clear base rates
    - LLMs are worse on questions requiring current information
    - Both have some noise
    """
    np.random.seed(seed)

    categories = {
        'US Politics': {
            'questions': [
                "Will the incumbent win the presidential election?",
                "Will the infrastructure bill pass the Senate?",
                "Will the government shut down before the deadline?",
                "Will the Supreme Court overturn the lower court ruling?",
                "Will voter turnout exceed 60% in the midterm elections?",
            ],
            'market_edge': 0.02,  # Market slightly better (current info)
            'llm_edge': -0.01,    # LLM slightly worse
        },
        'International Affairs': {
            'questions': [
                "Will the ceasefire hold for 30 days?",
                "Will the trade agreement be ratified?",
                "Will the UN Security Council pass the resolution?",
                "Will the border dispute be resolved diplomatically?",
                "Will the international summit produce a joint communique?",
            ],
            'market_edge': 0.03,  # Market better (requires current intel)
            'llm_edge': -0.02,
        },
        'Economics': {
            'questions': [
                "Will GDP growth exceed 2.5% this quarter?",
                "Will the Fed raise rates at the next meeting?",
                "Will unemployment fall below 4%?",
                "Will inflation exceed the 2% target?",
                "Will the stock market finish the year higher?",
            ],
            'market_edge': 0.01,
            'llm_edge': 0.01,  # LLM decent with base rates
        },
        'Science/Technology': {
            'questions': [
                "Will the drug receive FDA approval?",
                "Will the rocket launch succeed?",
                "Will the company release the product on schedule?",
                "Will the clinical trial meet its primary endpoint?",
                "Will the AI model achieve the benchmark performance?",
            ],
            'market_edge': 0.00,
            'llm_edge': 0.02,  # LLM good with technical base rates
        },
        'Sports/Entertainment': {
            'questions': [
                "Will the favored team win the championship?",
                "Will the movie gross over $1 billion worldwide?",
                "Will the record be broken this season?",
                "Will the awards show viewership exceed last year?",
                "Will the underdog team make the playoffs?",
            ],
            'market_edge': 0.02,
            'llm_edge': -0.03,  # LLM worse (requires current form/stats)
        },
    }

    questions = []
    qid = 0

    for category, config in categories.items():
        for i in range(n_per_category):
            qid += 1

            # True underlying probability
            true_prob = np.random.beta(2, 2)  # Centered around 0.5
            true_prob = np.clip(true_prob, 0.05, 0.95)

            # Market price: noisy estimate of true probability
            market_noise = np.random.normal(0, 0.08)
            market_price = np.clip(
                true_prob + market_noise + config['market_edge'] * (true_prob - 0.5),
                0.02, 0.98
            )

            # LLM forecast: different noise pattern
            llm_noise = np.random.normal(0, 0.10)
            # LLMs tend to compress toward 0.5
            compression = 0.15
            llm_raw = true_prob + llm_noise + config['llm_edge'] * (true_prob - 0.5)
            llm_forecast = np.clip(
                0.5 + (1 - compression) * (llm_raw - 0.5),
                0.05, 0.95
            )

            # Actual outcome: Bernoulli draw from true probability
            actual_outcome = int(np.random.random() < true_prob)

            q_text = config['questions'][i % len(config['questions'])]

            questions.append(MarketQuestion(
                question_id=qid,
                category=category,
                question_text=q_text,
                market_price=market_price,
                llm_forecast=llm_forecast,
                actual_outcome=actual_outcome,
            ))

    return questions


def compute_calibration(
    forecasts: np.ndarray,
    outcomes: np.ndarray,
    n_bins: int = 10,
) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """
    Compute calibration curve.

    Returns bin centers, observed frequencies, and bin counts.
    """
    bin_edges = np.linspace(0, 1, n_bins + 1)
    bin_centers = []
    observed_freqs = []
    bin_counts = []

    for i in range(n_bins):
        mask = (forecasts >= bin_edges[i]) & (forecasts < bin_edges[i + 1])
        if mask.sum() > 0:
            bin_centers.append(forecasts[mask].mean())
            observed_freqs.append(outcomes[mask].mean())
            bin_counts.append(mask.sum())

    return np.array(bin_centers), np.array(observed_freqs), np.array(bin_counts)


def evaluate_forecaster(
    forecasts: np.ndarray,
    outcomes: np.ndarray,
    name: str,
) -> Dict[str, float]:
    """
    Compute comprehensive evaluation metrics for a forecaster.
    """
    n = len(forecasts)
    eps = 1e-10

    # Brier score
    brier = np.mean((forecasts - outcomes) ** 2)

    # Log loss
    log_loss = -np.mean(
        outcomes * np.log(forecasts + eps) +
        (1 - outcomes) * np.log(1 - forecasts + eps)
    )

    # Calibration error
    bin_centers, observed_freqs, bin_counts = compute_calibration(
        forecasts, outcomes, n_bins=5
    )
    if len(bin_centers) > 0:
        calibration_error = np.average(
            np.abs(bin_centers - observed_freqs),
            weights=bin_counts
        )
    else:
        calibration_error = 0.0

    # Resolution (how much do forecasts vary?)
    resolution = np.var(forecasts)

    # Discrimination: compare average forecast for positive vs negative outcomes
    avg_forecast_positive = forecasts[outcomes == 1].mean() if (outcomes == 1).any() else 0.5
    avg_forecast_negative = forecasts[outcomes == 0].mean() if (outcomes == 0).any() else 0.5
    discrimination = avg_forecast_positive - avg_forecast_negative

    # Forecast extremity (tendency to give extreme vs moderate forecasts)
    extremity = np.mean(np.abs(forecasts - 0.5))

    return {
        'name': name,
        'brier_score': brier,
        'log_loss': log_loss,
        'calibration_error': calibration_error,
        'resolution': resolution,
        'discrimination': discrimination,
        'extremity': extremity,
        'mean_forecast': forecasts.mean(),
        'forecast_std': forecasts.std(),
    }


# ---- Run the evaluation ----

questions = simulate_questions(n_per_category=20)

# Convert to arrays
market_forecasts = np.array([q.market_price for q in questions])
llm_forecasts = np.array([q.llm_forecast for q in questions])
outcomes = np.array([q.actual_outcome for q in questions])
categories = [q.category for q in questions]

# Overall evaluation
market_metrics = evaluate_forecaster(market_forecasts, outcomes, "Market")
llm_metrics = evaluate_forecaster(llm_forecasts, outcomes, "LLM")

print("=" * 70)
print("OVERALL COMPARISON: LLM vs. PREDICTION MARKET")
print("=" * 70)
print(f"{'Metric':30s} {'Market':>12s} {'LLM':>12s} {'Winner':>10s}")
print("-" * 70)

for key in ['brier_score', 'log_loss', 'calibration_error', 'resolution',
            'discrimination', 'extremity']:
    m_val = market_metrics[key]
    l_val = llm_metrics[key]
    if key in ['brier_score', 'log_loss', 'calibration_error']:
        winner = "Market" if m_val < l_val else "LLM"
    else:
        winner = "Market" if m_val > l_val else "LLM"
    print(f"  {key:28s} {m_val:12.4f} {l_val:12.4f} {winner:>10s}")

print("=" * 70)

# Category-level evaluation
print("\n\nCATEGORY-LEVEL BRIER SCORES")
print("=" * 70)
print(f"{'Category':25s} {'Market':>10s} {'LLM':>10s} {'Diff':>10s} {'Winner':>10s}")
print("-" * 70)

df = pd.DataFrame({
    'category': categories,
    'market': market_forecasts,
    'llm': llm_forecasts,
    'outcome': outcomes,
})

for cat in df['category'].unique():
    cat_df = df[df['category'] == cat]
    m_brier = np.mean((cat_df['market'] - cat_df['outcome']) ** 2)
    l_brier = np.mean((cat_df['llm'] - cat_df['outcome']) ** 2)
    diff = l_brier - m_brier
    winner = "Market" if m_brier < l_brier else "LLM"
    print(f"  {cat:23s} {m_brier:10.4f} {l_brier:10.4f} {diff:+10.4f} {winner:>10s}")

print("=" * 70)


# ---- Identify where LLMs have an edge ----

print("\n\nQUESTIONS WHERE LLM OUTPERFORMS MARKET (by Brier score)")
print("=" * 70)

llm_edge_questions = []
for q in questions:
    market_brier = (q.market_price - q.actual_outcome) ** 2
    llm_brier = (q.llm_forecast - q.actual_outcome) ** 2
    edge = market_brier - llm_brier
    if edge > 0.05:  # LLM significantly better
        llm_edge_questions.append((q, edge, market_brier, llm_brier))

llm_edge_questions.sort(key=lambda x: -x[1])

for q, edge, mb, lb in llm_edge_questions[:10]:
    print(f"  Q{q.question_id}: {q.question_text[:50]}...")
    print(f"    Category: {q.category}")
    print(f"    Market: {q.market_price:.3f}, LLM: {q.llm_forecast:.3f}, "
          f"Outcome: {q.actual_outcome}")
    print(f"    Market Brier: {mb:.4f}, LLM Brier: {lb:.4f}, "
          f"LLM edge: {edge:+.4f}")
    print()


# ---- Visualization ----

fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# Calibration curves
ax = axes[0, 0]
mc, mf, mn = compute_calibration(market_forecasts, outcomes, n_bins=8)
lc, lf, ln = compute_calibration(llm_forecasts, outcomes, n_bins=8)

ax.plot([0, 1], [0, 1], 'k--', alpha=0.5, label='Perfect calibration')
ax.plot(mc, mf, 'bo-', markersize=8, linewidth=2, label='Market')
ax.plot(lc, lf, 'rs-', markersize=8, linewidth=2, label='LLM')
ax.set_xlabel('Predicted Probability')
ax.set_ylabel('Observed Frequency')
ax.set_title('Calibration Curves')
ax.legend()
ax.grid(True, alpha=0.3)

# Brier score by category
ax = axes[0, 1]
cats = list(df['category'].unique())
market_briers = []
llm_briers = []
for cat in cats:
    cat_df = df[df['category'] == cat]
    market_briers.append(np.mean((cat_df['market'] - cat_df['outcome']) ** 2))
    llm_briers.append(np.mean((cat_df['llm'] - cat_df['outcome']) ** 2))

x = np.arange(len(cats))
width = 0.35
ax.bar(x - width/2, market_briers, width, label='Market', color='steelblue')
ax.bar(x + width/2, llm_briers, width, label='LLM', color='indianred')
ax.set_ylabel('Brier Score (lower is better)')
ax.set_title('Brier Score by Category')
ax.set_xticks(x)
ax.set_xticklabels([c.replace(' ', '\n') for c in cats], fontsize=8)
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

# Forecast distribution comparison
ax = axes[1, 0]
ax.hist(market_forecasts, bins=20, alpha=0.5, label='Market', color='steelblue')
ax.hist(llm_forecasts, bins=20, alpha=0.5, label='LLM', color='indianred')
ax.set_xlabel('Forecast Probability')
ax.set_ylabel('Count')
ax.set_title('Distribution of Forecasts')
ax.legend()
ax.grid(True, alpha=0.3)

# Scatter: Market vs LLM
ax = axes[1, 1]
colors = ['green' if o == 1 else 'red' for o in outcomes]
ax.scatter(market_forecasts, llm_forecasts, c=colors, alpha=0.5, s=50)
ax.plot([0, 1], [0, 1], 'k--', alpha=0.5)
ax.set_xlabel('Market Probability')
ax.set_ylabel('LLM Probability')
ax.set_title('Market vs LLM Forecasts\n(green = resolved YES, red = resolved NO)')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('llm_vs_market.png', dpi=150, bbox_inches='tight')
plt.show()


# ---- Statistical significance test ----

print("\n\nSTATISTICAL SIGNIFICANCE TEST")
print("=" * 70)

market_brier_scores = (market_forecasts - outcomes) ** 2
llm_brier_scores = (llm_forecasts - outcomes) ** 2
diff = market_brier_scores - llm_brier_scores

t_stat, p_value = stats.ttest_rel(market_brier_scores, llm_brier_scores)
print(f"Paired t-test on Brier score differences:")
print(f"  Mean difference (Market - LLM): {diff.mean():+.4f}")
print(f"  t-statistic: {t_stat:.4f}")
print(f"  p-value: {p_value:.4f}")
if p_value < 0.05:
    winner = "Market" if diff.mean() < 0 else "LLM"
    print(f"  Result: SIGNIFICANT difference in favor of {winner}")
else:
    print(f"  Result: No statistically significant difference")

# Ensemble analysis
print("\n\nENSEMBLE ANALYSIS")
print("=" * 70)
for w in [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]:
    ensemble = w * llm_forecasts + (1 - w) * market_forecasts
    ens_brier = np.mean((ensemble - outcomes) ** 2)
    print(f"  LLM weight: {w:.1f} | Market weight: {1-w:.1f} | "
          f"Brier: {ens_brier:.4f}")

print("=" * 70)

Key Findings

1. Overall Performance

In our simulation, the prediction market produces slightly better Brier scores overall than the LLM. This is consistent with the theoretical expectation that markets aggregate information from many participants, including some who may be using LLMs themselves.

However, the difference is small and often not statistically significant, suggesting that LLMs are competitive forecasters even without access to current information.

2. Category-Level Differences

The LLM shows relative strength in categories where: - Base rates are informative (Science/Technology, Economics): Questions like "Will the FDA approve the drug?" have well-established historical approval rates that the LLM can leverage from its training data. - The question is well-discussed in training data: Questions about recurring events (Fed rate decisions, election patterns) benefit from the LLM's extensive training on similar historical analyses.

The market shows relative strength in categories where: - Current information is critical (US Politics, International Affairs): Markets incorporate real-time information from traders who are actively monitoring current events. The LLM's knowledge cutoff is a disadvantage here. - Domain expertise matters (Sports): Sports prediction requires detailed knowledge of current team form, injuries, and matchups that the LLM may not have.

3. Calibration Differences

The LLM exhibits a characteristic compression toward 50% -- it avoids extreme probabilities more than the market does. When the true probability is 90%, the market might say 87% while the LLM says 78%. This compression hurts the LLM's Brier score on questions with extreme true probabilities.

The market shows slightly better calibration overall, though both are reasonably well-calibrated.

4. The Ensemble Advantage

Perhaps the most important finding is that ensembling the LLM with the market produces better forecasts than either alone. The optimal ensemble weight is approximately 0.2-0.3 for the LLM and 0.7-0.8 for the market, suggesting that the LLM provides incremental information not captured by the market price.

This has direct trading implications: if you compute an ensemble forecast and it differs from the market price, the difference represents a potential trading opportunity.

5. Where LLMs Have an Edge

The LLM outperforms the market most dramatically on: - Questions where the market appears to be mispriced due to low liquidity or recency bias. - Questions where base rate reasoning is important but underweighted by market participants. - Questions where the LLM's broader knowledge base (spanning many domains) provides context that specialized traders might miss.

Implications for Prediction Market Trading

  1. LLMs are useful complements, not replacements, for market prices. The optimal strategy is to combine LLM estimates with market prices, not to rely on either alone.

  2. The LLM edge is domain-dependent. Focus LLM analysis on domains where base rates and historical patterns are particularly informative.

  3. Calibration adjustment is important. Apply a calibration correction to LLM forecasts before using them for trading (e.g., using Platt scaling or isotonic regression trained on the LLM's historical forecast performance).

  4. Multiple prompt strategies reduce variance. Using structured, devil's advocate, and reference class prompts and averaging the results produces more stable estimates than any single prompt strategy.

  5. Cost-benefit analysis matters. LLM API calls are expensive. Prioritize LLM forecasting for markets where you have reason to believe the LLM might have an edge (base-rate-heavy questions, low-liquidity markets) and skip markets where the market is likely well-calibrated (high-liquidity, well-followed events).

Conclusion

LLMs cannot consistently beat prediction markets, but they can identify specific opportunities where markets are inefficient. The practical trading strategy is to use LLMs as one input in a multi-source forecasting system, applying them selectively to question types where they have demonstrated strength. The ensemble of market + LLM consistently outperforms either alone, confirming that LLMs provide incremental information not fully captured by market prices.