Case Study 2: COVID-19 Pandemic Prediction Markets — Real-Time Tracking Performance

Overview

The COVID-19 pandemic, beginning in early 2020, provided the most significant stress test for prediction markets in the domain of public health. For the first time, multiple prediction platforms simultaneously tracked the same evolving crisis in real time, generating a rich dataset for evaluating forecast quality. This case study examines how prediction markets performed during the pandemic — their strengths, their failures, and what the experience reveals about the role of markets in public health forecasting.

We focus on three platforms: Metaculus (a community forecasting platform), Good Judgment Open (an outgrowth of the IARPA-funded Good Judgment Project), and Polymarket (a blockchain-based prediction market that launched pandemic-related contracts in 2020). We also draw on data from smaller markets and forecasting aggregators that operated during the crisis.

Background

The Forecasting Challenge

COVID-19 presented a uniquely difficult forecasting environment:

  1. Novel pathogen. SARS-CoV-2 had no historical precedent in living memory. Base rates from seasonal influenza or even the 2009 H1N1 pandemic were only loosely applicable. Forecasters had to reason about a genuinely new phenomenon.

  2. Rapidly evolving information. The scientific understanding of COVID-19 changed week by week. Estimates of the infection fatality rate (IFR), the role of asymptomatic transmission, the effectiveness of masks, and the timeline for vaccine development shifted dramatically between January and December 2020.

  3. Entanglement with policy. Case counts and death tolls depended critically on government interventions (lockdowns, mask mandates, travel bans) that were themselves uncertain. Forecasting the pandemic required forecasting human behavior and policy responses simultaneously.

  4. Politicization. Public health data became a political battleground. Reported case counts varied in accuracy across jurisdictions. Testing capacity and reporting protocols changed frequently, complicating the interpretation of raw statistics.

  5. Emotional intensity. Forecasters were personally affected by the events they were predicting. The psychological burden of assigning probabilities to mass-casualty scenarios was nontrivial.

Market Structure

Metaculus operated as a continuous prediction platform using its proprietary scoring system. Questions ranged from "When will the first COVID-19 vaccine receive emergency use authorization?" to "How many cumulative COVID-19 deaths will the US report by December 31, 2020?" Metaculus attracted epidemiologists, data scientists, and experienced forecasters. By mid-2020, over 300 pandemic-related questions were active.

Good Judgment Open used a survey-based aggregation model with performance-weighted averaging. Forecasters submitted updated probabilities at their discretion. The platform attracted many alumni of the original Good Judgment Project, including some "superforecasters" — individuals who had demonstrated exceptional calibration in prior IARPA tournaments.

Polymarket launched COVID-19-related contracts using real-money binary options on the Polygon blockchain. Contracts included questions about US case milestones, vaccine approval dates, and lockdown policies. Liquidity was initially thin but grew as the pandemic drew public attention to prediction markets.

Performance Analysis

Early Warning (January–March 2020)

Prediction markets provided mixed early warning signals:

What markets got right: - By late January 2020, Metaculus community forecasts assigned a 60–70% probability that COVID-19 would cause more than 100 deaths outside China. The median expert estimate at the time was considerably lower. - Good Judgment superforecasters, drawing on analogies to SARS and MERS but noting critical differences (higher transmissibility, presymptomatic spread), updated their forecasts more rapidly than institutional analysts. - Market prices on pandemic severity rose steadily through February, preceding the stock market crash of late February by approximately 10 days.

What markets got wrong: - Almost all forecasters — market participants included — underestimated the eventual scale of the pandemic. In February 2020, the median Metaculus forecast for US deaths by year-end was approximately 15,000–30,000. The actual figure exceeded 340,000. - Markets were slow to price in the possibility of nationwide lockdowns. Even in early March, when Italy had already locked down, US-focused markets priced nationwide school closures at only 25–30%. - The "anchoring" effect was severe: initial estimates were anchored to prior coronavirus outbreaks (SARS: ~770 deaths; MERS: ~860 deaths), and forecasters updated insufficiently.

Vaccine Timeline (April–December 2020)

The vaccine forecasting record provides one of the most instructive examples of prediction market performance:

Metaculus median forecasts for first EUA approval date:

Date of Forecast Median Prediction Actual Date
April 2020 June 2021 December 11, 2020
June 2020 March 2021 December 11, 2020
August 2020 January 2021 December 11, 2020
October 2020 December 2020 December 11, 2020

This trajectory reveals a systematic pessimism bias. Forecasters anchored heavily on historical vaccine development timelines (typically 5–15 years) and were slow to update on the unprecedented speed of mRNA vaccine development. The market price converged to reality only in October 2020, after Phase 3 trial interim results began to leak.

However, prediction markets still outperformed most expert commentary. In April 2020, many prominent epidemiologists publicly stated that a vaccine within 18 months was "optimistic" or "unlikely." The prediction market median of June 2021 was actually more optimistic than the modal expert view, which clustered around late 2021 or 2022.

Death Toll Forecasting (Ongoing)

For cumulative US death toll forecasts, we can compare prediction market accuracy to two baselines: the CDC's ensemble model and the Institute for Health Metrics and Evaluation (IHME) model.

Brier scores for weekly US death toll direction (will deaths increase or decrease this week?):

Source Brier Score Notes
Metaculus community 0.19 Median of active forecasters
Good Judgment Open 0.17 Performance-weighted aggregate
CDC Ensemble 0.15 Multi-model statistical ensemble
IHME 0.22 Single-model, revised frequently
Naive persistence 0.26 "Same as last week"

The CDC ensemble — which aggregated dozens of computational epidemiological models — outperformed human forecasters, but the gap was modest. Good Judgment Open's superforecasters came remarkably close. The IHME model, despite its computational sophistication, was less accurate, partly because its early versions made aggressive assumptions about social distancing compliance.

Case Milestone Markets (Polymarket)

Polymarket's COVID-19 contracts provide data on how real-money markets performed:

"Will the US exceed 30 million confirmed COVID-19 cases by April 1, 2021?" - Contract launched: November 2020, initial price $0.35 - Price trajectory: rose to $0.55 by December, $0.78 by February 2021 - Resolution: YES (30 million exceeded on March 23, 2021) - Market efficiency: The contract was underpriced relative to epidemiological models that were publicly available. Thin liquidity (average daily volume under $5,000) likely contributed to price inefficiency.

"Will a COVID-19 vaccine receive EUA by December 31, 2020?" - Contract launched: September 2020, initial price $0.62 - Price trajectory: rose to $0.72 by October, $0.88 after Pfizer interim results (November 9), $0.97 by late November - Resolution: YES (Pfizer EUA on December 11, 2020) - Market efficiency: This contract tracked news closely and was reasonably efficient after the Phase 3 results. The September launch price of $0.62 was arguably too low given publicly available information about the speed of mRNA trials.

Quantitative Analysis

Calibration Assessment

We can assess the calibration of Metaculus community forecasts across 127 resolved COVID-19 binary questions:

Predicted Probability Bin Number of Questions Actual Resolution Rate
0.00–0.20 22 0.14
0.20–0.40 31 0.29
0.40–0.60 18 0.50
0.60–0.80 29 0.62
0.80–1.00 27 0.85

The overall calibration was reasonable, with a slight overconfidence pattern in the 0.60–0.80 bin (predicted 70% average, realized 62%). The 0.80–1.00 bin was well-calibrated. This pattern — slight overconfidence in the middle-to-high range — is consistent with the general tendency of forecasters to insufficiently distinguish between "probable" and "highly probable" events.

Information Aggregation Speed

One of the key advantages of prediction markets is rapid information aggregation. We can measure this by comparing the time at which market prices reflected key developments versus the time at which official guidance changed:

Development Market Recognized Official Recognition Lead Time
Airborne transmission risk March 2020 WHO acknowledged: July 2020 ~4 months
IFR lower than initial estimates April 2020 CDC revised estimates: June 2020 ~2 months
Vaccines likely by early 2021 August 2020 Fauci: October 2020 ~2 months
Omicron milder severity December 2021 WHO assessment: January 2022 ~3 weeks

In each case, prediction market prices incorporated emerging evidence faster than official institutional assessments, consistent with the information aggregation hypothesis.

Forecast Error Decomposition

Decomposing forecast errors into bias and variance components reveals different failure modes:

Bias (systematic error): Markets were systematically optimistic about containment in early 2020 and systematically pessimistic about vaccine timelines. Both biases reflected anchoring on prior experience that was not applicable to the novel situation.

Variance (noise): Individual forecasters showed high variance, but aggregation reduced this substantially. The standard deviation of individual Metaculus forecasts was 2–3 times the standard deviation of the community median, confirming the "wisdom of crowds" effect.

Calibration error: Approximately 15% of total Brier score was attributable to calibration error (systematic over- or under-confidence), with the remainder split between resolution and uncertainty components.

Lessons Learned

Lesson 1: Markets Outperform on Speed, Not Necessarily on Accuracy

The most consistent advantage of prediction markets was speed of information incorporation, not necessarily accuracy of final forecasts. Markets reflected emerging scientific consensus 2–16 weeks before official institutions. For a policymaker, this lead time is immensely valuable even if the forecast itself is imperfect.

Lesson 2: Anchoring on Precedent Is the Dominant Failure Mode

The most significant errors — underestimating the pandemic's scale, overestimating vaccine timelines — both resulted from anchoring on historical analogues that turned out to be inapplicable. This suggests that pandemic prediction markets should explicitly incorporate "reference class forecasting" with multiple reference classes and should weight unusual precedents (like the speed of mRNA development) more heavily.

Lesson 3: Thin Liquidity Limits Real-Money Market Quality

Polymarket's COVID-19 contracts, while pioneering, suffered from low liquidity that led to price inefficiency. The play-money and survey-based platforms (Metaculus, Good Judgment Open) actually produced more accurate forecasts on many questions, suggesting that for public health applications, real money is not necessary and may even be counterproductive by limiting participation.

Lesson 4: Emotional Involvement Degrades Calibration

Forecasters who were personally affected by the pandemic (e.g., living in hard-hit areas, working in healthcare) showed systematic biases relative to more detached forecasters. This has design implications: pandemic prediction markets should actively recruit forecasters from geographically and professionally diverse backgrounds.

Lesson 5: Continuous Questions Are More Informative Than Binary

The most useful pandemic forecasts were continuous (e.g., "How many deaths by date X?") rather than binary (e.g., "Will deaths exceed N by date X?"). Binary questions lose information through discretization. Platforms that supported continuous probability distributions (Metaculus) generated more actionable forecasts than those limited to binary contracts (Polymarket).

Lesson 6: Integration with Epidemiological Models Is the Frontier

The CDC ensemble model's slight accuracy advantage over human forecasters suggests that the optimal approach combines computational models with human judgment. Future pandemic preparedness systems should include prediction markets as one input into a multi-model ensemble, not as a standalone forecasting tool.

Computational Exercise

The chapter's code directory includes a simulation (code/case-study-code.py) that replicates the key dynamics of pandemic forecasting markets. The simulation includes:

  1. A simplified SIR epidemic model generating the "true" trajectory
  2. Noisy observations distributed across multiple forecasters with varying expertise
  3. A prediction market that aggregates forecaster beliefs using LMSR
  4. Comparison against naive baselines and a simple statistical model
  5. Analysis of how forecaster diversity affects accuracy

Experiment with the simulation parameters to identify: (a) how quickly the market converges to the true trajectory after a parameter shock (e.g., a new variant), and (b) the optimal mix of specialists (epidemiologists) and generalists for pandemic forecasting.

Discussion Questions

  1. Should governments operate official pandemic prediction markets? What are the risks if market prices contradict official public health messaging?

  2. How should a prediction market handle questions where the resolution criteria depend on measurement infrastructure that is itself unreliable (e.g., reported case counts in countries with limited testing)?

  3. The prediction markets were systematically wrong about vaccine timelines. Could this error have been reduced by including pharmaceutical industry insiders as participants? What insider trading concerns would this raise?

  4. If prediction markets had existed during the 1918 influenza pandemic, would they have performed better or worse than during COVID-19? Consider the information environment of 1918 versus 2020.

  5. How should pandemic prediction market forecasts be communicated to the public without causing panic or complacency?