Case Study 1: GPT-4 as a Forecaster — Benchmarking LLM Predictions Against Human Superforecasters

Overview

Since 2023, researchers have systematically tested whether large language models (LLMs) can produce calibrated probability forecasts that rival or exceed those of skilled human forecasters. This case study examines the most rigorous attempts to benchmark LLM forecasting performance, focusing on GPT-4 and its successors, and draws lessons for the design of hybrid human-AI prediction markets.

We draw primarily on the work of Halawi et al. (2024), who conducted the first large-scale head-to-head comparison of LLM forecasts against the Metaculus community and a panel of superforecasters. We also incorporate findings from Anthropic's internal forecasting experiments, the ForecastBench initiative, and independent replications.

Background

The Promise of LLM Forecasting

The appeal of LLM-based forecasting is straightforward: if an LLM can produce forecasts that are even moderately well-calibrated, the implications for prediction markets are transformative. A single LLM instance can:

Scale to thousands of questions. Human forecasters are scarce and expensive. An LLM can generate forecasts for any well-defined question in seconds.
Process enormous context. LLMs can ingest entire Wikipedia articles, news corpora, and research papers before forming a judgment — a breadth of information processing that no individual human can match.
Operate without ego. LLMs do not suffer from motivated reasoning, career concerns, or social pressure — though they have their own systematic biases.
Update continuously. Given access to recent information (via retrieval-augmented generation or tool use), LLMs can update forecasts as frequently as new data arrives.

Experimental Design

The Halawi et al. (2024) study used the following protocol:

Question set. 1,000 binary forecasting questions from Metaculus, spanning politics, science, technology, economics, and geopolitics. Questions were selected to have clear resolution criteria and resolution dates between 1 month and 2 years from the forecast date.

LLM forecasting pipeline. GPT-4 (gpt-4-1106-preview) was prompted using three strategies: - Base rate prompting: The model was asked to identify the relevant reference class and base rate before estimating the probability. - Adversarial prompting: The model was asked to argue both for and against the event before synthesizing a final estimate. - Decomposition prompting: The model was asked to break the question into sub-questions, estimate each sub-question, and combine them.

The final LLM forecast was the geometric mean of the three strategy outputs, a method chosen because it naturally handles multiplicative probability aggregation and is less sensitive to outlier estimates than the arithmetic mean.

Human baselines: - Metaculus community median (typically 100–500 forecasters per question) - A panel of 15 superforecasters from the Good Judgment Project - Individual non-expert human forecasters (recruited via Prolific)

Evaluation metric. Brier scores computed at question resolution.

Key Findings

Overall Performance

Table: Brier Score Comparison Across Question Categories

Category	GPT-4	Metaculus Community	Superforecasters	Non-experts
Politics/Elections	0.189	0.158	0.142	0.241
Science/Technology	0.176	0.182	0.155	0.268
Economics	0.201	0.175	0.149	0.252
Geopolitics	0.215	0.192	0.161	0.275
Overall	0.195	0.177	0.152	0.259

Headline finding: GPT-4 outperformed non-expert humans substantially but fell short of both the Metaculus community and superforecasters on aggregate. However, the performance gap varied significantly by question category.

Where LLMs Excelled

GPT-4 performed comparably to or better than the Metaculus community on:

Science and technology questions where the answer depended on interpreting published research, trend data, or technical feasibility. The LLM's ability to process large volumes of technical text gave it an advantage over forecasters who relied on intuition or incomplete reading.
Questions with abundant base rate data. For questions like "Will there be a Category 5 hurricane in the Atlantic in 2024?", the LLM reliably identified the relevant base rate (historically ~35% of years) and used it as an anchor, achieving excellent calibration.
Questions with short time horizons. For questions resolving within 1–3 months, GPT-4's Brier score (0.170) was statistically indistinguishable from the Metaculus community (0.168).

Where LLMs Struggled

GPT-4 was significantly worse than skilled humans on:

Geopolitical questions requiring situational judgment. Questions about specific diplomatic negotiations, military escalation, or leadership decisions required a kind of "political intuition" that the LLM lacked. For example, the LLM consistently underestimated the probability of surprising diplomatic breakthroughs and overestimated the probability of military escalation.
Questions with rapidly changing information. The LLM's knowledge cutoff was a significant limitation. Even with retrieval augmentation, the model sometimes failed to identify the most decision-relevant recent developments.
Tail risk questions. For low-probability events (base rate < 10%), GPT-4 showed a systematic overestimation bias. Events that the model assessed at 15–20% actually occurred at only 5–8% frequency. This overestimation of tail risks is consistent with the model's training on text that disproportionately discusses unusual events.

Calibration Analysis

The calibration curve for GPT-4 revealed a characteristic pattern:

Predicted Bin	GPT-4 Actual	Metaculus Actual	Superforecaster Actual
0.00–0.10	0.07	0.05	0.04
0.10–0.20	0.18	0.14	0.13
0.20–0.40	0.33	0.30	0.29
0.40–0.60	0.52	0.50	0.50
0.60–0.80	0.68	0.71	0.72
0.80–0.90	0.82	0.86	0.87
0.90–1.00	0.88	0.93	0.95

GPT-4 was well-calibrated in the middle range (0.40–0.60) but showed systematic underconfidence at the extremes: events predicted at 90%+ happened 88% of the time (underconfident), and events predicted at 0–10% happened 7% of the time (slightly overconfident). Superforecasters, by contrast, showed near-perfect calibration across the entire range.

Prompt Sensitivity

A significant concern is the sensitivity of LLM forecasts to prompt formulation:

Rewording the same question produced forecast variations of 5–15 percentage points.
Including or excluding specific context paragraphs shifted forecasts by up to 20 percentage points.
The order in which background information was presented affected the forecast (primacy/recency effects).

This prompt sensitivity is a fundamental challenge for using LLMs as prediction market participants: two differently prompted instances of the same model produce meaningfully different forecasts, raising questions about which forecast is "the model's true belief."

The Hybrid Approach

The most accurate forecasting system was a hybrid combining LLM and human forecasts:

Method	Brier Score
Superforecasters alone	0.152
GPT-4 alone	0.195
Metaculus community alone	0.177
Equal-weight hybrid (GPT-4 + superforecasters)	0.141
Optimally weighted hybrid	0.136

The hybrid outperformed both components individually, confirming that LLMs and humans make different types of errors. The optimal weighting gave approximately 35% weight to the LLM forecast and 65% to the superforecaster consensus, though the optimal weights varied by question category.

Implications for Prediction Markets

LLMs as Liquidity Providers

One promising application is using LLMs to provide initial liquidity in new prediction markets. Currently, new markets often suffer from the "cold start problem" — no one wants to be the first trader, so prices remain at the uninformative initial level. An LLM could provide an informed initial estimate, attracting human traders who believe they have superior information.

LLMs as Calibration Benchmarks

LLM forecasts could serve as a baseline against which human forecasters are measured. If a human consistently outperforms the LLM, their excess accuracy is attributable to genuinely private information rather than publicly available reasoning.

Risks of LLM Participation

If LLMs participate directly in prediction markets (as AI agents), several risks emerge:

Homogeneity. If many participants use the same underlying model, the "diversity" condition for wisdom-of-crowds breaks down. Correlated errors would not cancel in aggregation.
Prompt manipulation. If market participants can influence the LLM's input (e.g., by publishing misleading articles that the LLM's retrieval system ingests), they can indirectly manipulate the LLM's trading behavior.
Feedback loops. An LLM that reads current market prices as an input may amplify existing price movements rather than correcting them, creating positive feedback loops.

Computational Exercise

The chapter's code directory includes an implementation (code/case-study-code.py) that simulates the LLM benchmarking experiment. The simulation:

Generates a set of binary questions with known true probabilities
Simulates LLM forecasts using three prompting strategies with configurable noise and bias
Simulates human forecaster populations (superforecasters, crowd, non-experts)
Computes Brier scores and calibration metrics for each method
Tests hybrid aggregation with varying weights

Experiment with the simulation to determine: (a) under what conditions the LLM outperforms the crowd, (b) the optimal LLM-to-human weight as a function of question difficulty, and (c) the impact of LLM prompt diversity on aggregate accuracy.

Discussion Questions

If LLM forecasts become highly accurate, will prediction markets become obsolete? Or do markets serve purposes beyond pure accuracy?
How should a prediction market platform handle the possibility that a significant fraction of its "human" traders are actually using LLM-generated forecasts?
The LLM's tail-risk overestimation bias could be useful in some contexts (early warning) and harmful in others (false alarms). How should this bias be managed?
If an LLM reads the current market price before generating its forecast, is this analogous to an "informed trader" or an "uninformed momentum trader"? Does the answer depend on the market's efficiency?
Should LLM-generated forecasts be labeled as such on prediction market platforms, or does anonymity matter for the wisdom-of-crowds effect?