Case Study 28-2: Superforecasting and Political Uncertainty — What Tetlock's Research Teaches About Expert Prediction

Overview

In 2011, Philip Tetlock and his team launched the Good Judgment Project (GJP) as part of a United States intelligence community initiative (the IARPA Aggregative Contingent Estimation program) to improve geopolitical forecasting. The project recruited ordinary volunteers — people without special access to classified intelligence, without advanced training in geopolitics — and trained them using insights from cognitive psychology and decision science to make better probabilistic predictions about real-world events.

The results were striking. The Good Judgment Project's superforecasters outperformed CIA analysts with access to classified information by an average margin of approximately 30%. They outperformed prediction markets. They outperformed a control group of ordinary forecasters using identical question sets. And they dramatically outperformed the "expert pundits" who appear on television and are regularly consulted by media organizations about political events.

This case study examines the GJP's methodology, its findings about expert prediction, the characteristics of superforecasters, and the implications for how we evaluate expert claims in the broader information environment.

Section 1: The Expert Prediction Problem

Tetlock's Earlier Research

Long before the Good Judgment Project, Tetlock spent two decades testing the accuracy of expert political judgment. His 2005 book Expert Political Judgment (Princeton University Press) presented findings from a 20-year longitudinal study in which nearly 300 political experts (academics, intelligence analysts, and journalists) made thousands of specific, verifiable predictions.

The core finding was sobering: expert predictions were barely better than chance. Experts performed no better than random number generation for long-range forecasts (beyond 5 years) and only marginally better for short-range forecasts. Crucially, experts' confidence in their predictions was not correlated with their accuracy — those who were most confident were not more accurate.

Tetlock identified a systematic difference between two types of expert: - Hedgehogs — experts who explained everything through one big idea or theoretical framework, and who gave confident, definitive predictions. These experts were frequently sought by media precisely because their certainty made for compelling content. They were systematically less accurate. - Foxes — experts who drew on many frameworks, expressed more uncertainty, updated more readily, and were less comfortable with grand narratives. These experts were less mediagenic but more accurate.

The hedgehog/fox distinction predicted accuracy better than domain expertise, years of experience, or credentials.

Why This Matters for Media Literacy

The media incentive structure strongly selects for hedgehogs. A pundit who says "I believe, with about 60-65% confidence, that the election outcome depends on a complex interaction of economic indicators, turnout models, and late-breaking news in several swing states, and I would not be surprised by either result" will rarely be booked for the second segment. A pundit who says "This election is a referendum on the incumbent, and the incumbent's numbers guarantee a loss" is more likely to be invited back — even though the latter has worse predictive accuracy on average.

Media consumers who understand this selection pressure can discount pundit confidence appropriately: confident, simple predictions are media-optimized, not accuracy-optimized.

Section 2: The Good Judgment Project Methodology

The Tournament Format

The IARPA forecasting tournaments ran from 2011 to 2015. Participants (roughly 3,000 forecasters across multiple teams) were presented with specific, resolvable questions about geopolitical events. Examples of question types:

"Will Greece exit the Eurozone by December 31, 2012?" (binary, resolved: No)
"Will North Korea conduct a nuclear weapons test before April 1, 2013?" (binary, resolved: Yes)
"What will the unemployment rate in the United States be on December 31, 2013?" (continuous range)
"Will the Syrian government lose control of Damascus before the end of 2013?" (binary, resolved: No)

Each question had a defined resolution date and criteria, ensuring that forecasts could be objectively evaluated. Forecasters provided probability estimates (not just directional guesses) and could update their forecasts as new information emerged.

Scoring

Forecasters were scored using Brier scores (see Chapter 28.7): a measure that rewards calibrated probability estimates. A forecaster who said "80% probability" for an event that occurred would receive a better Brier score than one who said "95% probability" for the same event — the latter is overconfident relative to the outcome. And a forecaster who said "50% probability" (essentially random) for every event would outperform anyone who confidently predicted the wrong outcome.

This scoring system is crucial: it makes overconfidence costly. Unlike the public punditry environment (where confidence is rewarded whether predictions come true or not), the tournament created genuine incentives for calibrated uncertainty expression.

The Discovery of Superforecasters

Within the forecasting population, Tetlock's team noticed a subset of forecasters who consistently performed in the top 2% year after year. These "superforecasters" were not a fixed group — new members joined the elite tier each year, and performance was not perfectly persistent — but the upper tail of the distribution showed notable stability.

Superforecasters outperformed: - The median GJP forecaster (by a large margin) - Prediction markets (by approximately 15-30%) - Intelligence analysts with access to classified information (by approximately 30%) - Formal mathematical models in many domains

The finding that non-expert volunteers, trained in probabilistic thinking and working without classified access, could outperform professional intelligence analysts was sufficiently striking that the intelligence community took notice.

Section 3: Characteristics of Superforecasters

Tetlock's team conducted extensive analysis of what distinguished superforecasters from ordinary forecasters. The characteristics clustered into several categories.

Epistemic Characteristics

Probabilistic thinking: Superforecasters naturally think in probabilities rather than binary categories. When asked what they thought about a geopolitical outcome, they would not say "X will happen" but rather "I think there's about a 65% chance of X." This is not hedging — it is a genuine probability assessment that they would then defend and refine.

Calibration consciousness: Superforecasters care deeply about whether their stated probabilities match their actual accuracy rates. They maintain mental track records. They know from experience how often they have been right when they said "70%" — and they use this knowledge to calibrate future estimates.

Granularity: Superforecasters make fine-grained probability distinctions that non-experts round away. Where an ordinary forecaster might say "50/50" or "70/30," a superforecaster might say "57%" or "73%." This is not false precision — it reflects a genuine analytical distinction, and the research shows that the fine-grained estimates have better predictive validity than the rounded ones.

Cognitive Style Characteristics

Active open-mindedness: Superforecasters actively seek out information that might disconfirm their current hypothesis. This is the deliberate override of confirmation bias — not just avoiding the selection of confirming information, but specifically hunting for disconfirming information.

Intellectual humility: Superforecasters acknowledge what they don't know. When they encounter a question where they have little relevant background, they say so explicitly and weight their estimate toward the base rate rather than inventing false precision.

Comfort with uncertainty: Superforecasters are genuinely comfortable holding probability estimates between 30% and 70% for extended periods. Non-experts are often uncomfortable with uncertainty and pressure themselves to commit to a more definitive view. Superforecasters resist this pressure.

Behavioral Characteristics

Frequent updating: Superforecasters update their forecasts readily when new information arrives. This is the forecasting analog of Bayesian belief updating: new evidence moves the probability appropriately, not trivially. Non-forecasters tend to either update too little (anchoring) or too much (overreacting to new information).

Deliberate decomposition: When faced with a complex question, superforecasters break it into sub-questions that can be independently estimated and then combined. "Will Country X default on its debt?" becomes: "What is the current debt-to-GDP ratio? What are the terms of the outstanding obligations? How have similar countries with similar ratios fared historically? What political dynamics affect the likelihood of austerity measures?"

Outside view before inside view: Superforecasters begin with reference class forecasting — "what is the base rate for events like this?" — before adding inside-view information about the specific situation. This prevents anchoring on inside-view details that seem important but have low predictive validity.

Section 4: Team Forecasting and the Wisdom of Crowds

One of the GJP's most important findings was that teams of superforecasters outperformed individual superforecasters. The mechanism is straightforward: different people's errors are partially independent, and when you aggregate many probability estimates, the independent errors partially cancel out, leaving the "signal" more prominent.

Effective Team Dynamics

Not all team aggregation works equally well. The GJP found that teams with poor dynamics could actually perform worse than their best individual members, because social dynamics pressure members toward premature consensus. Effective forecasting teams exhibit:

Explicit disagreement norms: Members are explicitly expected to challenge each other's estimates with specific reasoning
Genuine consideration of minority views: Dissenting positions are explored before being discarded, not immediately overruled
Separation of forecast rounds: Initial forecasts are made independently before discussion begins, preventing anchoring on the first speaker's estimate
Explicit probability expression: Discussion centers on "why is your estimate 65% rather than 75%?" not "do you think this will happen?"

Policy Implications: The Aggregation Lesson

The aggregation finding has important implications beyond professional forecasting. For news consumers: - The consensus of multiple independent expert assessments is a stronger signal than any single authoritative source, however confident - When multiple credentialed experts in a field reach similar probability assessments through independent analysis, this is stronger evidence than a single expert's confident claim - Consensus formation processes matter: a consensus achieved through genuine independent analysis is more reliable than one achieved through social pressure within a tight-knit professional community

Section 5: The Failure of Punditry and Its Media Logic

The Good Judgment Project's implicit comparison with media punditry is striking. The pundits who appear on television and in major newspapers as authoritative commentators on political events systematically perform below the superforecasters — often below the level of even ordinary probability-aware thinking.

Why Punditry Is Epistemically Inefficient

Pundits are not selected for accuracy. They are selected for: - Clarity and decisiveness of communication (favoring binary, confident predictions) - Consistency with a recognizable intellectual brand (favoring hedgehog-style single-framework explanation) - Entertainment and engagement value (favoring dramatic narratives over calibrated uncertainty) - Status and credentialing (favoringamous names and institutional affiliations over track record)

None of these selection criteria are correlated with accuracy. The result is a system that consistently elevates epistemically inefficient voices — confident, brand-consistent pundits — at the expense of the more accurate but less mediagenic probabilistic reasoners.

The Accountability Gap

Punditry has no accountability mechanism comparable to the Brier score. When a pundit confidently predicts an election outcome that doesn't materialize, the standard response is: - Reinterpretation ("the outcome was essentially what I predicted") - Contextual excuse ("unforeseen circumstances intervened") - Moving the goalposts ("my prediction was about the longer-term dynamic") - Simply moving on without acknowledgment

This accountability gap allows overconfident pundits to maintain their media presence despite poor track records. Tetlock proposed "probabilistic forecasting accountability" as a reform: requiring public commentators to make specific, probability-quantified predictions that can be evaluated after resolution.

Section 6: Lessons for Media Literacy

Evaluating Expert Claims

The superforecasting research gives us specific, empirically grounded heuristics for evaluating expert claims in the information environment:

Prefer hedgers over confident predictors (in expectation). The research consistently shows that calibrated uncertainty expression predicts better accuracy than confident assertion. A source that expresses appropriate uncertainty about complex predictions is more likely to be tracking the evidence carefully.

Ask for the probability estimate. When an expert says X will happen, ask: "What probability do you assign to this? At what probability would you say X is more likely than not-X?" Resistance to probability quantification is often a sign of unearned confidence.

Check for track records. Has this expert's previous predictions been verified? Most pundits do not maintain public track records. Those who do (forecasting platforms like Metaculus, PredictionBook, and others) provide the foundation for evidence-based credibility assessment.

Prefer the outside view. When an expert gives a highly specific, case-based explanation for why a prediction is certain, ask about the base rate: how often do situations like this turn out the way the expert predicts? Ignoring the base rate is a signature of inside-view overconfidence.

Weight diverse expert opinions before consensus forms. Multiple independent experts reaching similar conclusions is much stronger evidence than multiple experts who have read each other's work and been influenced by social consensus dynamics.

Forecasting and Misinformation

The superforecasting literature connects to misinformation in a specific way: confident false predictions are a form of misinformation. When a pundit confidently predicts a market crash, a political outcome, or a health event that doesn't materialize, this is epistemically harmful even if no deliberate deception was intended.

The harm is compounded by the media ecology: confident false predictions receive wide distribution; quiet acknowledgment of error receives far less. The Bayesian update for "this source confidently predicted X and X did not happen" is very seldom delivered to the audiences who received the original confident prediction.

This suggests that developing habits of forecasting accountability — personally tracking the predictions of sources you follow, noting when confident predictions fail, and appropriately updating your trust in those sources — is a legitimate media literacy practice.

Questions for Discussion

If superforecasters consistently outperform domain experts, should intelligence agencies restructure their analytical processes to incorporate superforecasting methodology? What organizational and institutional barriers would this face?
Tetlock's research suggests that media incentives systematically select for epistemically poor forecasters (confident hedgehogs) over more accurate ones (calibrated foxes). Is this a market failure that could be corrected, or is it an inherent feature of entertainment-driven media? What would epistemically responsible political journalism look like?
The Good Judgment Project's forecasters were volunteers who worked without compensation or classified access. What does this tell us about the value added by professional status, credentials, and classified access in political analysis?
Forecasting tournaments like the GJP cover questions with binary or near-binary resolutions that can be evaluated in the short to medium term (months to years). How well does the superforecasting methodology generalize to complex, long-range, or multi-dimensional questions (e.g., "Will climate change be a net positive or negative for global GDP by 2100?")?
Keeping an explicit prediction journal — writing down your predictions, assigning probabilities, and tracking outcomes — is a form of personal calibration training that anyone can adopt. What are the psychological barriers to this practice? Why do you think so few people do it?
The accountability gap in punditry means that inaccurate predictors maintain their media presence without consequence. Design a practical accountability mechanism for public political commentary that could be adopted by media organizations. What objections would this face, and how would you respond?