Chapter 20: When Models Fail: 2016, 2020, and Beyond

50 min read

The conference room on the fourteenth floor of Meridian Research Group's downtown office was usually a place of controlled professionalism — whiteboard diagrams, coffee rings on printed cross-tabs, the quiet hum of people who had long ago made their...

In This Chapter

20.1 The 2016 Autopsy: More Than a Black Swan
20.2 The 2020 Election: Better Models, Same Systematic Error
20.3 The 2022 Midterms: A Smaller Miss, But a Miss Nonetheless
20.4 International Failures: This Is Not Uniquely American
20.5 Systematic vs. Random Error: The Key Distinction
20.6 Correlated Errors: A Worked Numerical Example
20.7 Partisan Nonresponse Bias: The Mechanism
20.8 The Herding Problem
20.9 The Sociology of Forecasting: Why Forecasters Herd
20.10 Meridian's Postmortem: The Debrief
20.11 How Forecasters Responded After Each Failure
20.12 How to Consume Forecasts Critically
20.13 What Forecasters Learned and How They Adjusted
20.14 The Limits of Calibration: Genuine Unpredictability
20.15 The Gap Between Map and Territory
20.16 Conclusion: The Discipline of Being Wrong Well
Summary
Key Terms

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 20: When Models Fail: 2016, 2020, and Beyond

The conference room on the fourteenth floor of Meridian Research Group's downtown office was usually a place of controlled professionalism — whiteboard diagrams, coffee rings on printed cross-tabs, the quiet hum of people who had long ago made their peace with uncertainty. But on the Wednesday morning after the November election, three weeks before Thanksgiving, it had the atmosphere of a medical debriefing after a patient unexpectedly died on the table.

Dr. Vivian Park stood at the head of the table. She had been in political research for thirty years, had built Meridian from a two-person consultancy into one of the most respected survey firms in the country, and had survived enough election nights to understand that misses happened. But she had also spent enough time on the receiving end of client disappointment to know that understanding a miss and accepting one were different things.

"Let's be exact about what we're doing here," Vivian said, uncapping a marker. "We are not here to explain why the candidate lost. We're here to understand why our model of the electorate was wrong. Those are different questions."

Carlos Mendez, twenty-four and three months into his first real job after graduate school, had his laptop open and a legal pad covered in notes. He had been the one who ran the final weighted averages the week before the election. The numbers had looked solid. The fundamentals had looked solid. And then the results had come in, and the margin had been three and a half points in the wrong direction.

Trish McGovern, sitting to Vivian's right with her arms crossed and her reading glasses on the table, had managed field operations for seventeen cycles. She had a theory already. She had had it the night of the election, watching returns. She was waiting to see if Vivian's process arrived at the same place.

What follows in this chapter is the kind of honest accounting that Vivian was conducting that morning — applied not just to Meridian's particular miss but to the broader pattern of polling and forecasting failures that have defined American and international elections over the past decade. Understanding what went wrong is not merely an academic exercise. The errors were not random. They were systematic, predictable in retrospect, and in several cases warned about well in advance by researchers who were largely ignored.

20.1 The 2016 Autopsy: More Than a Black Swan

In the hours after the 2016 presidential election, the dominant media narrative settled quickly on surprise. The polls had been wrong. The models had been wrong. Nobody had seen it coming.

None of those statements were entirely true.

The polling averages in the final week of the 2016 campaign showed Hillary Clinton leading nationally by roughly three percentage points. The final result was Clinton plus 2.1 in the popular vote — a polling error of about one point in the national margin, which by historical standards is entirely ordinary. The models that gave Donald Trump low win probabilities were driven largely by state-level polls, and it was the state-level polls where the real damage happened.

In the three decisive Rust Belt states — Pennsylvania, Michigan, and Wisconsin — late polls systematically underestimated Trump. The average polling error in those states exceeded five points in Trump's direction. Wisconsin had not been surveyed seriously for weeks before the election; most major pollsters had pulled back, assuming it was safely in Clinton's column. This was not a random scatter of errors. The errors were correlated: they ran in the same direction, in the same kinds of states, by similar amounts.

Correlated state errors are the nightmare scenario for probabilistic forecasters. A model can survive random error — if some states go wrong in each direction, the overall forecast converges on something reasonable. But when errors cluster geographically and demographically, a model that treats states as semi-independent can grossly overstate confidence. FiveThirtyEight's final model gave Trump roughly a 29 percent chance of winning. The New York Times Upshot gave him 15 percent. The Princeton Election Consortium, at various points, put his chances below 5 percent. These were not equivalent assessments, and the variation between them reflects different assumptions about state error correlation — but all of them underweighted the possibility of a correlated miss.

💡 What is correlated state error? If Pennsylvania polls overestimate Democrats by 4 points and Wisconsin polls overestimate Democrats by 4 points for the same underlying reason (say, non-response from working-class whites without college degrees), those errors are correlated. A model that treats the two states as independent will dramatically understate uncertainty. In 2016, the correlated structure of state polling errors was the primary mechanism by which low-probability outcomes became reality.

20.1.1 Systematic Polling Error in 2016

The 2016 AAPOR (American Association for Public Opinion Research) post-election report, released in May 2017, identified several contributing factors to the polling miss:

1. Underrepresentation of non-college whites. Voters without four-year college degrees, particularly white voters without four-year college degrees, were underrepresented in polling samples. Education has become increasingly correlated with presidential vote preference — in 2016, its correlation with Republican support among whites was historically high. Polls that weighted on race and age but not education would systematically underestimate Trump.

2. Late-deciding voters broke toward Trump. A substantial portion of voters who made up their minds in the final week of the campaign — roughly 13 percent of the electorate, according to exit polls — broke toward Trump by double digits. The mechanism is debated: some analysts pointed to the FBI director's announcement about Clinton's emails; others argued that late deciders in 2016 were the voters who disliked both candidates and resolved their discomfort at the last moment. Whatever the cause, a polling snapshot from two weeks before the election captured a very different electorate than the one that actually voted.

3. Differential nonresponse. This is the subtlest and most important factor. It is not that Trump supporters refused to answer polls. It is that in 2016, the kinds of people likely to support Trump were, for reasons unrelated to the election itself, less likely to answer polls at all. Polling response rates have fallen dramatically since the 1990s. In 1997, the Pew Research Center achieved a 36 percent response rate on a standard telephone survey. By 2018, that number had fallen below 6 percent. When only 6 out of every 100 people you call participate in your survey, the question of who those 6 people are matters enormously.

4. Overconfidence and the assumption of mean reversion. Many forecasters and commentators assumed that any unusual polling patterns would revert toward historical norms before Election Day. This assumption — that 2016 would look like 2012 or 2008 in its fundamental structure — was not a crazy prior. But it was wrong, and building it into a model added a bias that the data did not support.

📊 The Education Weighting Gap: A Pew Research analysis found that in 2016, national polls that weighted by education produced estimates an average of 1.5 points more favorable to Trump than polls that did not weight by education. In high-stakes swing states, the gap was larger. This was a known methodological issue before 2016 — the Cooperative Congressional Election Study had flagged education-related nonresponse for years. The failure was not that the problem was unknown; it was that practice had not caught up with knowledge.

20.1.2 Was 2016 a Black Swan?

A "black swan," in Nassim Taleb's formulation, is an event that lies outside the realm of regular expectations — one that carries extreme impact and that, in retrospect, humans concoct explanations for in order to make it seem more predictable than it was. The question of whether 2016 was a genuine black swan or a predictable failure has real consequences for how forecasters should respond.

The case for "black swan": The combination of late deciders, correlated state errors, and a media environment that had never been more fragmented and volatile produced an outcome that had no precise historical parallel. No model could reasonably have been expected to capture it.

The case for "predictable failure": The specific failure modes — differential nonresponse, education weighting, correlated state errors — were known and documented before the election. Some forecasters, including the analytics team at FiveThirtyEight, had written publicly about all of them. The failure was not epistemic in the sense of being unknowable; it was institutional, in the sense that the industry lacked the incentive structure or the methodological discipline to act on what was known.

🔵 Debate: Can We Learn from Unique Events? If 2016 was genuinely unprecedented, what does "learning" from it mean? Some statisticians argue that low-frequency, high-impact events are fundamentally resistant to calibration — you cannot update a probability model based on one data point. Others argue that the structural features of 2016 (education polarization, differential nonresponse) are ongoing and require ongoing methodological response regardless of whether the precise combination recurs. Both positions have merit.

Vivian Park, in her postmortem, came down firmly on the second side. "I don't care if the exact sequence never repeats," she told Carlos. "The mechanisms that produced the error are still present. If we don't fix how we weight, we will get hit again — maybe not the same way, but by the same family of errors."

20.2 The 2020 Election: Better Models, Same Systematic Error

If 2016 prompted forecasters to rebuild their methodologies, 2020 provided an immediate test of how far they had come.

The short answer: not far enough.

The headline miss in 2020 was in some ways larger than in 2016, even though the outcome was less surprising. In Wisconsin, the polling average overestimated Biden by about 8 points. In Florida, where Trump won decisively, polls had shown a near-tie. In Ohio and Iowa, states Trump won by 8 and 9 points respectively, polls had shown 1-2 point margins. Nationally, the polling average overestimated Biden by about 3.9 points — the largest national polling error since 1980.

And yet, because Biden won, the miss attracted far less public scrutiny than 2016.

20.2.1 Demographic Differentials in Response Rates

The 2020 AAPOR task force report identified partisan nonresponse bias as the central mechanism: Trump supporters were systematically less likely to participate in surveys than Biden supporters, and this disparity had grown since 2016 — possibly, the report suggested, because of Trump's own rhetoric about polling and the media.

The key insight here is the distinction between differential nonresponse by partisanship and differential nonresponse by demographic proxy. In 2016, analysts had focused on education and geography as the mechanisms through which polling was going wrong. By 2020, it had become clearer that something more direct was happening: Republican-leaning voters were simply less likely to take surveys. Weighting on demographics helps if the problem is demographic. It helps much less if the problem is that Republicans and Democrats with identical demographic profiles have different probabilities of answering your survey.

This is a profound challenge because it resists standard methodological fixes. You can weight on education. You can weight on race. You can even weight on previous vote choice (though this introduces its own problems). But if the fundamental issue is that a substantial portion of one party's voters have an aversion to survey-taking as a cultural behavior, the normal tools of survey adjustment reach their limits.

⚠️ The Weighting Trap: Weighting on recalled vote choice — asking respondents which candidate they voted for in the previous election and adjusting the sample to match the actual result — can partially correct for partisan nonresponse. But it introduces several new problems: vote recall is imperfect (people misremember or refuse to say), and current party preferences may not match the distribution implied by past votes. Like many corrections in survey methodology, it treats the symptoms rather than the disease.

20.2.2 What Forecasters Changed — and What They Missed

After 2016, several major forecasting operations made substantive methodological adjustments:

FiveThirtyEight increased the assumed correlation between state errors, widening confidence intervals.
The Economist's model (which launched in 2020) built in explicit fundamentals-based adjustments to reduce poll dependency.
Several individual pollsters began weighting by education in their topline numbers, a practice that had been inconsistent before 2016.

These adjustments helped at the margins. FiveThirtyEight's final model gave Biden roughly an 89 percent chance of winning, wide enough to accommodate a Trump win without appearing overconfident in the way the 2016 models had. But the underlying polling error persisted and, by most measures, was worse than 2016. The adjustments had widened confidence intervals but had not corrected the directional bias.

🔗 Connection to Chapter 19: Recall from the probabilistic forecasting discussion that a model can be well-calibrated — its stated probabilities can be accurate — even if its point estimates are biased. FiveThirtyEight's 89 percent Biden forecast was arguably reasonably calibrated (Biden won an 89-percent-favored race). But Meridian's state-level point estimates were badly biased, and that matters enormously for clients who need to make resource allocation decisions based on which states are close.

20.3 The 2022 Midterms: A Smaller Miss, But a Miss Nonetheless

If 2020 tested whether 2016 adjustments had worked, 2022 provided another data point.

The 2022 midterm environment featured a near-universal expectation of a significant Republican wave. The president's approval ratings were in the low 40s. The historical pattern of midterm elections strongly favors the out-party. Inflation was running at 40-year highs. Forecasters, having been burned twice by underestimating Republicans, had recalibrated — some perhaps too aggressively.

The "red wave" did not materialize. Democrats outperformed the polling averages across the board, losing the House by a smaller margin than predicted and holding the Senate. Several high-profile Republican candidates in key Senate races — in Pennsylvania, Georgia, Arizona, and Nevada — lost despite being competitive or even ahead in the polls.

The 2022 error was structurally different from 2016 and 2020. In those cycles, the polling error ran in a consistent Republican direction. In 2022, the error ran in the Democratic direction in many contested races. This is important: it suggests the error was not simply a persistent Trump-specific nonresponse problem but something more complicated involving candidate quality, issue salience (particularly abortion, following the Dobbs decision), and late-breaking patterns in early and mail voting that polling had not fully captured.

📊 The 2022 Senate Polling Average Error: According to FiveThirtyEight's post-election analysis, the average error in competitive Senate races in 2022 was 3.0 points toward Republicans — meaning Democrats outperformed polls by 3.0 points on average. This was the largest Democratic overperformance in a midterm since at least 1998.

20.3.1 The Dobbs Effect and Late Information

One compelling theory about 2022: the Supreme Court's Dobbs decision in June, which overturned Roe v. Wade, fundamentally altered the issue environment of the election in a way that was difficult for fundamentals-based models to fully capture. Abortion had been a historically Republican-mobilizing issue; Dobbs inverted that dynamic, at least for the 2022 cycle. Models calibrated to historical relationships between presidential approval, economic indicators, and seat change had no mechanism to incorporate this structural shift.

This is a broader lesson: fundamentals models implicitly assume that the structural relationship between inputs and outputs is stable. When a major structural change occurs — a landmark Supreme Court decision, a pandemic, a major third-party candidacy — the model's historical correlations may not hold.

20.4 International Failures: This Is Not Uniquely American

American analysts sometimes treat the polling failures of 2016 and beyond as uniquely American pathologies — products of polarization, Trump, or some feature of the U.S. political system. The international evidence does not support this comfortable parochialism.

20.4.1 UK 2015: The Worst Polling Failure in British History

In the 2015 United Kingdom general election, all major polling firms predicted a hung parliament — a result so close between the Conservative and Labour parties that no single party would win a majority. The actual result was a decisive Conservative majority, with David Cameron's party winning 331 seats compared to Labour's 232. The Conservative vote share was 36.9 percent — roughly 6 percentage points higher than the average of published final polls, which had put the two parties at virtual parity.

The subsequent inquiry by the British Polling Council, led by Patrick Sturgis, found several contributing factors. First, the raw sampling had a clear pro-Labour skew: the people who responded to polls were systematically more likely to intend to vote Labour than the actual electorate. Even after applying standard demographic weights, the Labour-leaning skew persisted, suggesting that the nonresponse problem was not fully addressed by weighting on observable characteristics.

Second, there was a late swing: voters who were genuinely undecided until the final days broke disproportionately toward the Conservatives, possibly because of economic anxiety about what a Labour government would mean for living standards. Cameron's campaign had hammered this economic anxiety message in the final week; the polling snapshots from earlier in the campaign did not capture the dynamics this late push produced.

Third, and most consequentially for the profession's self-understanding, the Sturgis inquiry documented clear evidence of herding. When individual polling firms compared their raw numbers against competitors and found they were outliers, many adjusted their results toward the consensus before publication. The industry-wide convergence on "hung parliament" was not entirely a product of independent measurement pointing in the same direction — it was partly a product of mutual adjustment that suppressed genuine signals in individual firms' data.

🌍 UK 2015 and the Spiral of Silence: Some analysts invoked Elisabeth Noelle-Neumann's "spiral of silence" theory — the idea that people who hold minority views become less willing to express them publicly. In 2015, Conservative voting may have been a "shy Tory" phenomenon, in which potential Conservative voters were reluctant to identify themselves as such to pollsters. Whether this fully explains the 2015 miss is debated, but the concept of social desirability bias in politically charged contexts is a recurring theme in international polling failure.

The UK 2015 failure is particularly instructive because British polling had previously been considered among the most reliable in the world, and because the inquiry process was more rigorous and publicly transparent than most post-election reviews in other countries. The finding of herding, in particular, was a frank admission by the profession that competitive market dynamics were distorting published results. The British Polling Council subsequently introduced new disclosure requirements for weighting methodology and raw sample statistics — a direct institutional response to the identified failure mode.

20.4.2 Brexit: A Referendum That Polling Got (Almost) Right

The Brexit referendum of June 2016 is often cited alongside the 2016 U.S. election as an example of polling failure. This characterization is only partially accurate. The final polling average before the referendum showed Remain at roughly 50 percent and Leave at 48 percent — a result that fell within normal margins of error of the actual outcome (52 percent Leave, 48 percent Remain). The forecasting failure was less about polling and more about the near-universal expectation among political commentators and betting markets that Remain would win even when the polling was essentially tied.

The lesson from Brexit is somewhat different from 2015 or 2016: it is a case where polling was reasonably accurate, but the interpretation of that polling — the prior that Remain was favored built into forecasters' minds — led to systematic underweighting of the genuine uncertainty the data showed.

20.4.3 Australia 2019: "The Unlosable Election"

The 2019 Australian federal election was widely expected to produce a Labor government. The incumbent Liberal-National coalition under Scott Morrison had been polling behind Labor in virtually every published survey for nearly two years. The actual result was a comfortable Morrison victory — the Liberal-National coalition won 77 seats to Labor's 68, with the coalition's two-party preferred vote at approximately 51.5 percent, compared to a polling average that had shown Labor leading by 2-3 points.

The framing in Australian media — "the unlosable election" — captured how completely the political community had absorbed the polling consensus as reality. Labor had been governing as if victory were certain, presenting detailed and politically costly policy proposals (on housing negative gearing, capital gains tax, and climate) on the assumption that voters were ready for them. The coalition campaign had focused relentlessly on the economic risk of Labor's plans — a message that, as subsequent analysis showed, was resonating more than the polls suggested.

The subsequent review by the Australian Market and Social Research Society identified many of the same mechanisms seen in the UK and U.S. Sampling biases had developed as the industry shifted from probability-based telephone sampling to online panel-based methods. Online panels — recruited samples of willing respondents who participate in surveys repeatedly — are not random samples of the population; they tend to recruit more politically engaged, more educated, and more urban respondents. The political center of these panels drifted left of the actual electorate.

The review also noted a methodological transition that had happened without adequate validation: the shift from telephone to online was motivated by declining response rates and cost pressures, not by evidence that online panels produced more accurate political estimates. Australian pollsters had essentially made a bet that online panels would be close enough to probability samples — a bet they did not test rigorously against past elections before relying on it in 2019.

Herding was present in Australia as well. The consistent Labor-leading consensus over two years led individual pollsters to be skeptical of results showing different patterns and to adjust accordingly. In at least two documented cases, polling firms' raw data showed a closer race than their published numbers, with the publication decision reflecting conscious adjustment toward the industry consensus.

🌍 The International Pattern: The consistency of polling failures across countries — the UK in 2015, Australia in 2019, Brazil in 2022, Israel across multiple elections — suggests that the problem is not specific to Donald Trump or American polarization. The common threads are: increasing survey nonresponse rates globally; the shift from probability sampling to online convenience panels; and the herding incentives that exist for polling firms in competitive commercial markets. These are structural features of the modern polling industry, not products of any particular political context.

20.5 Systematic vs. Random Error: The Key Distinction

This chapter has used the word "systematic" repeatedly. It is worth being precise about what that means and why it matters.

Random error is noise — the kind of variation that arises from the process of sampling a finite number of people from a larger population. If you are trying to estimate public opinion among 150 million likely voters by surveying 800 of them, your estimate will deviate from the true value by some amount due purely to chance. This deviation has no consistent direction; if you ran the same survey 1,000 times, the errors would average to approximately zero. Random error is captured by the margin of error in published polling.

Systematic error (also called bias) is directional deviation that does not cancel out over repeated measurements. If your sampling method consistently underrepresents Republican voters, your estimate of Republican support will be too low not just once but every time you run that survey. Systematic error is not addressed by larger sample sizes — a biased survey of 10,000 people is still biased.

This distinction has an important implication: the margin of error reported in published polls addresses only random error. A poll with a margin of error of ±3 points can still be systematically off by 5 points in the same direction. The stated confidence interval is accurate only if the survey has no systematic bias — an assumption that, as this chapter has documented, is frequently violated.

⚠️ The Margin of Error Misconception: Journalists and consumers of polling frequently interpret the margin of error as capturing the full range of possible polling error. It does not. The margin of error reflects sampling variance only — the variability you'd expect if you drew repeated random samples from the same population using the same perfect methodology. Systematic biases from differential nonresponse, weighting errors, or question framing are invisible in the stated margin of error. This is one of the most consequential misunderstandings in political communication.

20.6 Correlated Errors: A Worked Numerical Example

The concept of correlated state errors is central to understanding why probabilistic forecasts can be so badly miscalibrated. A numerical example makes the mechanism concrete.

Setting Up the Problem

Suppose a forecaster is modeling a three-state election. The Democratic candidate needs to carry at least two of three states — let's call them Alpha, Beta, and Gamma — to win the electoral college. Based on state-level polling, the forecaster's model estimates:

State Alpha: Democratic candidate leads by 3 points; forecast as 70% probability of Democratic win
State Beta: Democratic candidate leads by 2 points; forecast as 65% probability
State Gamma: Democratic candidate leads by 1 point; forecast as 55% probability

Model A: Independent Errors

If the errors in each state are independent — if they have nothing to do with one another — the forecaster can calculate the win probability by multiplying the probabilities of winning the required combinations.

Probability of winning all three: 0.70 × 0.65 × 0.55 = 0.250

Probability of winning exactly Alpha and Beta (losing Gamma): 0.70 × 0.65 × 0.45 = 0.205

Probability of winning exactly Alpha and Gamma (losing Beta): 0.70 × 0.35 × 0.55 = 0.135

Probability of winning exactly Beta and Gamma (losing Alpha): 0.30 × 0.65 × 0.55 = 0.107

Total probability of winning at least two states (any path to victory): 0.250 + 0.205 + 0.135 + 0.107 = 0.697, or approximately 70%.

Model B: Correlated Errors

Now suppose the errors are correlated — specifically, suppose there is a common factor (say, underrepresentation of non-college white voters across all three states) that shifts all three polls by the same amount in the same direction. When we model correlated errors, we add a shared systematic component to the individual-state error distributions.

Under a correlated error model, the probability of losing all three states simultaneously is substantially higher than under the independent model, because a single systematic error that hurts in one state hurts in all of them at once.

Concretely: if we assume a 50% chance that each state poll is off by 3+ points in the Republican direction (and these are perfectly correlated — they're off for the same reason), then:

Half the time, the polling is accurate-ish, and the Democrat wins all three with roughly 70% × 65% × 55% probability
Half the time, the polls are all 3+ points wrong, and the "true" margins are essentially tied in each state, making the win probability closer to a coin flip in each

The combined probability of losing the election under the correlated model:

(0.5 × (1 - 0.70 × 0.65 × 0.55)) + (0.5 × probability of losing given 3-point shift across all states)

Even using generous assumptions, the probability of losing at least one state needed to win increases dramatically — in many parameterizations, the win probability drops from the model's stated 70% to something closer to 50-55%.

Why This Matters for Real Forecasts

In 2016, the Princeton Election Consortium's model assumed near-zero state error correlation. Its stated win probability for Clinton was around 95%. The actual result was a Trump victory through exactly the correlated-error pathway: all three critical Rust Belt states went wrong in the same direction by similar amounts.

FiveThirtyEight's model, which built in higher state error correlation, gave Trump a 29% win probability. That was still much lower than reality turned out to require, but it was far closer to acknowledging the genuine uncertainty. The difference between 5% and 29% was almost entirely explained by different assumptions about correlated versus independent state errors.

📊 The Practical Lesson: When a forecasting model shows extreme confidence — 90%+ win probability — the first question to ask is: what assumption is the model making about correlated errors? High confidence is only warranted if either the polling lead is enormous (so even a large correlated error won't flip the result) or there is good reason to believe state polls are truly independent. In an environment of persistent differential nonresponse affecting similarly-educated and similarly-partisan voters across similar states, independence is almost never a safe assumption.

20.7 Partisan Nonresponse Bias: The Mechanism

The most significant and durable source of systematic polling error in contemporary American elections is partisan nonresponse bias: the tendency for members of one party to be less willing to participate in surveys than members of the other party.

This is distinct from, though related to, the education-weighting problem identified after 2016. Partisan nonresponse bias can exist even after controlling for standard demographic variables. It is driven by differential trust in institutions (including polling organizations), political alienation, and — in some periods — direct signals from political leaders that polling is illegitimate or an instrument of the "fake news" media.

The challenge for pollsters is distinguishing partisan nonresponse from demographic nonresponse. If Republican voters are underrepresented in your sample because they are disproportionately non-college-educated and non-college voters respond to surveys at lower rates, you can partially correct for this by weighting on education. But if Republican voters are underrepresented because of specifically partisan aversion to surveys — independent of education, age, gender, and geography — demographic weighting will not fix it.

One attempted solution is weighting on recalled vote choice: adjusting the sample so that the proportion who report having voted for Trump (or for any Republican) in the previous election matches the actual vote share. This approach has intuitive appeal but practical limitations. First, vote recall is subject to "winner's memory effect" — people systematically overstate their support for whoever won. Second, the relationship between past vote choice and current preference is not fixed; a sample that perfectly reproduced 2020 behavior might still be wrong about 2024 behavior if there has been net party switching. Third, and most fundamentally, the recalled vote approach is circular: it calibrates the current sample to match past behavior, which is fine for stable electorates but assumes that the structural composition of the electorate does not change.

20.8 The Herding Problem

Beyond nonresponse bias, another structural failure mode in the polling industry is herding: the tendency for pollsters to adjust their results toward the existing polling consensus rather than publishing what their raw data shows.

Herding can be rational from the perspective of an individual pollster. If your poll shows a result very different from all other published polls, there are two explanations: either your poll is capturing something real that other polls are missing, or your poll has a methodological problem. Given the base rate, the second explanation is more likely. A pollster who publishes a major outlier and turns out to be wrong is embarrassed. A pollster who adjusts toward the consensus and turns out to be wrong has cover — "everyone else was wrong too."

This reasoning, applied by enough pollsters, creates a perverse dynamic: the published polling average becomes a self-reinforcing consensus that is less informative than the underlying data. When one major pollster has a genuine outlier reading, the adjustment toward consensus suppresses that signal. If the outlier was picking up something real — an unmeasured shift in sentiment, an unusual distribution of late deciders — the herding behavior means the final polling average fails to capture it.

🔴 Critical Thinking: The Game Theory of Herding. Herding is individually rational but collectively destructive. It is analogous to the conformity experiments of Solomon Asch, where individuals suppress accurate perceptions to match the group consensus. The polling industry, facing acute reputational pressure after 2016, arguably increased herding behavior: more firms began explicitly comparing their results to the polling average before publication and adjusting accordingly. This means that the stated diversity of the polling landscape is somewhat illusory — many "independent" polls are not independent in the statistical sense because they have been partially adjusted toward each other.

Evidence for herding in practice: Nate Silver and others have documented that the distribution of polling results is systematically "too tight" — it shows less variance than would be expected if polls were truly independent measurements of the same underlying quantity. The standard deviation of final poll results is consistently smaller than statistical theory predicts it should be. The most parsimonious explanation is widespread adjustment toward consensus.

20.9 The Sociology of Forecasting: Why Forecasters Herd

Herding is a technical problem, but it is rooted in a social one. Understanding why the polling and forecasting community tends toward consensus requires looking at the sociology of the industry — its incentive structures, its professional culture, its relationship to media and public attention.

Reputational Incentives Under Asymmetric Scrutiny

Polling organizations face a deeply asymmetric reputational environment. When a pollster's results are close to the consensus and the consensus is wrong, the blame is diffuse: "all the polls were wrong." When a pollster's results are distant from the consensus and the outlier is wrong, the blame is concentrated: "that firm had a methodology problem." The individual cost of being a correct outlier is modest (you may be praised retrospectively, but elections happen infrequently and attention spans are short). The individual cost of being a wrong outlier is high and immediate.

This asymmetry creates a persistent pull toward the center of the distribution of published polls, independent of what any individual firm's raw data actually shows. Rational actors in a reputationally asymmetric environment will shade their published results toward the consensus, and knowing that others are doing the same provides further justification for the same behavior.

The "Sophisticated" Consensus Effect

There is a second sociological dynamic that reinforces herding at a higher level of self-awareness. Political forecasters are not naive; they read each other's work, attend the same conferences, and are aware of the methodological arguments. When every major forecaster who has thought carefully about the problem reaches roughly the same conclusion, it is natural to treat that convergence as evidence of reliability. If FiveThirtyEight, The Economist, and two academic forecasting teams all show the same candidate with a substantial lead, any individual who sees different raw data faces a strong prior that the discrepancy is in their data, not in the consensus.

This is bayesian updating in a social context — and it is entirely rational as long as the forecasters are genuinely independent. The problem is that they are not. They are consuming the same published polls (many of which have been herded), using similar methodologies developed in the same academic journals, and interpreting the same political fundamentals through similar theoretical frameworks. Their apparent independence is greater than their actual independence, and the convergent consensus understates genuine uncertainty.

Career Incentives in Journalism and Media

A third dimension of herding sociology involves the media ecosystem in which forecasters operate. Data journalists and public forecasters at major media organizations have professional incentives to be legible and engaging to general audiences. Extremely wide confidence intervals — "anything between a 40% and 70% win probability" — are harder to communicate and less compelling than a precise number. A model that consistently says "too close to call" will lose audience share to one that confidently assigns specific probabilities.

This creates selection pressure for confident-seeming forecasts, which in turn creates pressure to narrow stated uncertainty below what the data warrants. The fact that wide uncertainty intervals are usually more accurate than artificially precise ones does not override the communication pressure to appear to know something.

💡 Intuition: Forecasting as a Social Performance

It is useful to think of public election forecasting as a social performance as well as a technical exercise. The forecaster is not just calculating probabilities; they are producing a public artifact that shapes how other actors — campaigns, donors, journalists, voters — think about the election. Forecasters know this. The awareness that their output will have downstream effects — on campaign strategy, media coverage, voter turnout, donor behavior — creates incentives that can diverge from pure epistemic accuracy. A forecaster who understands this dynamic can partially resist it; one who doesn't is subject to it without knowing it.

Post-Failure Over-Correction

A final sociological dynamic worth noting: after a major polling failure, the industry tends to over-correct in the direction of the previous error before finding a new equilibrium. After 2016's Republican-direction miss, some forecasters adjusted their models to be more skeptical of Democratic-leaning polls — an adjustment that arguably contributed to the 2022 Republican-direction bias in some forecasts. After being badly wrong about a Republican wave that didn't materialize in 2022, some forecasters pendulum-swung back toward Democrats for 2024.

This sequential over-correction reflects a systematic confusion between "adjusting for the underlying bias mechanism" and "adjusting because we were wrong last time in this direction." The first is principled. The second is pattern-matching on insufficient data. Distinguishing them requires understanding the mechanism of past failures well enough to know when it is operative and when it isn't — exactly the kind of rigorous postmortem analysis that the field has been too inconsistent about conducting.

20.10 Meridian's Postmortem: The Debrief

Trish McGovern finally spoke about forty minutes into the meeting.

"I had a bad feeling in the last week," she said. "Response rates were down in the counties that went hardest for the Republican in the last cycle. We flagged it. You saw the memo."

Vivian nodded. She had seen the memo. She had considered it, had run a sensitivity analysis on the weighting scheme, had concluded that the normal education weighting was probably capturing most of the problem. She had been wrong.

"The memo was right," she said. "We should have taken it further. Carlos — what was our effective sample size after education weighting in the three counties Trish flagged?"

Carlos scrolled back through his files. "About 340 in the combined three-county area. But the nominal sample was 580."

"Effective sample 340, nominal 580. We gave clients a margin of error based on 580. That was wrong, and I signed off on it."

This kind of moment — a senior researcher explicitly taking responsibility for a methodological decision that contributed to an error — is the centerpiece of a genuine postmortem. It is also, in Vivian's experience, relatively rare in political polling. The incentive structure of the industry pushes toward attribution of error to unique circumstances rather than methodology. Nobody builds a business by advertising their systematic biases.

A genuine postmortem, in Vivian's formulation, asks four questions:

What did we predict, precisely? Not "we predicted a close race" but a specific probabilistic statement: "We gave the Democratic candidate a 68 percent probability of winning with a point estimate of +2.3."
What happened, precisely? The Republican won by 1.2 points.
Was the error within expected statistical variance? A 3.5-point error in the margin when the 90-percent confidence interval was ±4 points is within expected bounds. A 3.5-point error when the 90-percent confidence interval was ±2.5 points requires explanation.
Was the error directional and consistent with a known bias mechanism? If yes, the mechanism needs to be fixed. If no, it may be within the expected range of random variation.

In Meridian's case, the error was larger than the stated confidence interval suggested it should be, and it ran in the direction that a differential nonresponse problem would predict. The mechanism was identifiable. The fix was actionable: more aggressive weighting on prior voting behavior in low-response areas, combined with a revised approach to effective sample size calculation.

✅ Best Practice: Genuine Postmortems. The most sophisticated forecasting organizations — including academic election forecasting teams and the best commercial survey firms — maintain public forecasting records that allow systematic assessment of calibration over time. A firm that claims 70-percent confidence in its predictions should be right about 70 percent of the time across a large sample of predictions. Tracking this systematically, rather than attributing each miss to unique circumstances, is the discipline that separates calibrated forecasting from sophisticated guessing.

20.11 How Forecasters Responded After Each Failure

The post-failure methodological responses in the polling and forecasting industry have been substantial, if uneven, and in some cases have introduced new problems while addressing old ones. Tracing these responses across three cycles illustrates both the field's capacity for learning and its structural limits.

After 2016: The Education Weighting Revolution

The clearest and most immediate methodological response to 2016 was the widespread adoption of education weighting in political surveys. Before 2016, roughly half of publicly released polls weighted on education; after, the proportion exceeded 80 percent within two years. This change was straightforward to implement and addressed a documented, concrete problem.

The response in probabilistic forecasting was more complex. Several major operations substantially increased the assumed correlation between state polling errors, widening their confidence intervals. This was the right response to the right problem: the core failure in 2016 was not that individual state polls were wrong by amounts outside normal ranges, but that they were wrong in the same direction and by similar amounts simultaneously.

FiveThirtyEight published an extended methodological explainer in 2018 describing how they had revised their model to account for correlated errors — including the specific technical decision to build a random national effect and correlated regional effects into their state-level error distribution. This kind of public methodological transparency is relatively rare in commercial forecasting and deserves recognition as a professional standard.

After 2020: Confronting Partisan Nonresponse

The more intractable problem identified in 2020 — specifically partisan, not merely demographic, nonresponse — has not been solved, partly because it cannot be fully solved within the standard survey toolkit.

Several responses have been proposed and tested. Weighting on self-reported 2020 vote choice has become more common, though it carries the limitations discussed earlier. Some polling organizations have experimented with "engaged respondent" weights — adjusting for the recency and frequency of survey participation on the theory that very frequent survey-takers are less representative than occasional participants. Others have developed explicit "Trump voter" imputation methods that attempt to identify and statistically represent the hard-to-reach Republican-leaning population.

None of these solutions has proved robust. The 2022 and 2024 cycles produced errors in different directions, suggesting that the bias mechanism is not simply "Republicans won't answer surveys" but something more dynamic and context-dependent — possibly related to the cultural salience of a particular election, the enthusiasm differential between the parties, and the degree to which either party's leader is actively hostile to polling.

After 2022: The Danger of Over-Correction

The 2022 experience — in which forecasters who had overcorrected for Republican nonresponse produced a polling average that overestimated Republicans — demonstrated a different failure mode. Some polling organizations, burned by the 2016 and 2020 misses, had adjusted their models in ways that were calibrated to past Republican underrepresentation rather than to the underlying mechanisms. When those mechanisms shifted in 2022, the adjustments produced bias in the opposite direction.

This sequential over-correction is a recognizable pattern in empirical forecasting. The appropriate response is not to correct toward past error but to identify and model the mechanism of past error, adjusting only to the extent that the mechanism is currently operative. In a post-Dobbs environment where abortion mobilized Democratic voters in ways not captured by historical patterns, the lesson was not "correct for 2020 Republican nonresponse" but rather "build more flexibility into the model to accommodate issue-saliency shocks."

🔗 Connection to Chapter 19: Recall from the discussion of model calibration that a well-calibrated model is right the right amount of the time across many different predictions. Sequential bias in one direction followed by sequential bias in the opposite direction is evidence of a calibration failure — the model is not accounting for genuine uncertainty about which direction the error will run, even if it is accounting for the magnitude of expected error. Genuine calibration requires uncertainty about the direction of forecasting error, not just its size.

20.12 How to Consume Forecasts Critically

Understanding the failure modes documented in this chapter equips you to be a more informed consumer of electoral forecasts. This matters whether you are an analyst communicating findings to clients, a campaign staffer using forecasts for resource allocation, or a citizen trying to assess the information environment. The following practical framework addresses the most consequential misunderstandings.

Ask About the State Error Correlation Assumption

When a forecasting model gives a candidate a very high (>85%) or very low (<15%) win probability, the key question is: what assumption is the model making about correlated state errors? High confidence is defensible only if the lead is large enough to survive a large correlated polling miss, or if there is genuine evidence that state errors are likely to be independent.

When reading a forecast, look for explicit discussion of the error correlation assumption. Models that do not discuss it have likely either ignored it or assumed independence — both are red flags. Models that explicitly estimate correlated error distributions and explain their assumptions are more trustworthy.

Distinguish Point Estimates from Win Probabilities

Campaign staff and political journalists routinely use win probabilities when they should be paying attention to point estimates, and vice versa. Win probabilities answer the question "who is likely to win." Point estimates answer the question "by how much is each candidate expected to win or lose." These are different questions with different decision implications.

For resource allocation — deciding whether to invest more heavily in Pennsylvania versus Florida, for example — the relevant quantity is the point estimate and its uncertainty range, not the win probability. A candidate who is ahead in a state by 3 points with a ±4-point error distribution has a meaningfully different strategic situation than one who is ahead by 3 points with a ±1.5-point error distribution, even if both translate to roughly the same win probability.

Watch for Sequential Published Polls from the Same Organization

When a polling organization releases multiple polls on the same race over several weeks, examine the trajectory. Is the organization's published estimate tracking the industry consensus more closely than its raw numbers might support? Large jumps in a polling organization's estimate within a short period — particularly if they move the organization from outlier to consensus — are a warning sign of herding. An organization that published a three-point lead for Candidate A, then revised to a two-point lead two weeks later to match the rest of the industry, may have been publishing its raw data the first time and its adjusted-toward-consensus estimate the second.

Apply the "Model Assumptions Under Stress" Check

The most fundamental question to ask about any forecast is: what assumptions does the model make, and under what circumstances would those assumptions be violated in ways that would change the forecast dramatically?

For fundamentals-based models, the key assumption is that the historical relationship between economic conditions, presidential approval, and election outcomes holds in the current cycle. When there is a structural break — a pandemic, a major third-party candidacy, an extraordinary issue like Dobbs — this assumption is under stress.

For polling-based models, the key assumption is that the nonresponse bias in current polls is similar in magnitude and direction to the nonresponse bias the model has been calibrated to handle. When something changes the partisan composition of survey respondents — a leader's active hostility to polling, a cultural shift in survey-taking norms, a major event that differentially energizes one party's supporters — this assumption is under stress.

Identifying these stress points before the election — and maintaining genuinely wider uncertainty when assumptions are clearly under stress — is the practical embodiment of epistemic humility in forecasting.

⚠️ The Forecast Is Not the Race. One of the most consequential misunderstandings in modern political analytics is the confusion between the forecast and the race itself. When a model shows Candidate A with an 87 percent chance of winning, some campaign staff hear "we don't need to worry about this race." The correct interpretation is "based on current polling with its acknowledged limitations, Candidate A is likely to win, but the model carries genuine uncertainty about whether the polling is itself biased." The forecast is a statement about uncertainty. It is not a result.

20.13 What Forecasters Learned and How They Adjusted

The post-2016 methodological response in the polling and forecasting industry has been substantial, if uneven:

1. More aggressive state error correlation. The major forecasting models now build in explicit covariance between state polling errors, widening the tails of the probability distribution and reducing the tendency to assign very high or very low win probabilities based on poll-of-polls averages.

2. Fundamentals-based adjustments. Models that blend polling with fundamentals (presidential approval, economic indicators, incumbency) are more resistant to a sudden polling miss because they have a second source of signal that is not subject to the same nonresponse biases as polls.

3. Education weighting. The proportion of polling operations that now weight on education has increased substantially since 2016. The 2020 and 2022 failures suggest this adjustment is necessary but not sufficient — it handles demographic nonresponse without addressing partisan nonresponse.

4. Transparency about methodology. The best firms now publish detailed methodological descriptions that allow clients and observers to evaluate specific design choices. Transparency does not prevent errors, but it allows for better postmortem analysis.

5. The limits of retrospective calibration. Several methodological innovations introduced after 2016 were, at least implicitly, calibrated to the 2016 error — to making the model that would have done better in 2016 or 2020. This carries a risk: the next failure will likely come from a mechanism that the corrections did not anticipate. Overfitting to past failures is itself a form of the gap between map and territory.

20.14 The Limits of Calibration: Genuine Unpredictability

Having traced the systematic roots of polling and forecasting failures, it is important to acknowledge the genuine limits of what calibration can achieve.

Some events are genuinely difficult to forecast not because of methodological failures but because they involve rare combinations of circumstances, high sensitivity to late-breaking events, and the chaotic dynamics of multi-candidate competition. The 2000 presidential election was decided by 537 votes in Florida — no model, however carefully constructed, should be expected to forecast a margin that small. The impact of James Comey's letter about Clinton's emails eleven days before the 2016 election is nearly impossible to model because it was an idiosyncratic event whose magnitude was not knowable in advance.

The honest position for a forecaster is to maintain calibrated uncertainty not just about the outcomes they are modeling but about the reliability of their models themselves. A model that has worked well in the past may have been working for reasons that are no longer present. The structural relationship between presidential approval and vote share may be weaker in a high-polarization environment than it was in the 1950s. The utility of party registration data for voter targeting may decline as voters split tickets more frequently.

🔴 Critical Thinking: When Does a Model Stop Working? Statistical models are always built on historical data, which means they are built on the assumption that the future will resemble the past in the relevant respects. When this assumption breaks down — in structural shifts like the education-party realignment, or in novel events like a pandemic — the model continues to produce confident-looking outputs even as its underlying assumptions are violated. Recognizing when a model's assumptions are under stress is among the most important and most difficult skills in applied forecasting.

20.15 The Gap Between Map and Territory

The recurring theme of this chapter — and of this section of the textbook — is the gap between the map and the territory. A model is always a simplified representation of reality, not reality itself. The danger is not the simplification, which is necessary and inevitable, but the mistake of treating the map as more accurate than the evidence warrants.

When a model gives a candidate an 87 percent chance of winning and that candidate loses, two questions must be asked: First, was the stated probability calibrated to reality (and 87-percent-favored events do sometimes lose)? Second, was the probability itself a product of systematic errors that made it look like 87 percent when the reality was closer to 60 percent?

The 2016 and 2020 failures involved primarily the second kind of problem — not bad luck but bad measurement that produced overconfident probabilities. The remedy is not to abandon modeling or to retreating to "too close to call" assessments that convey no information. The remedy is the discipline that Vivian Park was practicing that morning in her conference room: rigorous, specific, directional examination of the mechanisms by which a model failed, followed by targeted methodological correction, followed by honest acknowledgment of which failures were correctable and which reflected the irreducible uncertainty of social systems that respond to being measured.

Carlos, looking up from his laptop, asked the question that had been on his mind since the election. "Do we tell the clients exactly what we found? That the error was bigger than our stated margin, and why?"

Vivian replaced the cap on her marker. "We tell them everything. That's what we're paid for."

Trish put her glasses back on. "They're not going to like it."

"They're not going to like the next election night either, if we don't fix it," Vivian said. "That's the choice."

20.16 Conclusion: The Discipline of Being Wrong Well

The examination of forecasting failures in this chapter leads not to nihilism about the possibility of useful prediction but to a more precise understanding of its limits. The errors of 2016, 2020, and 2022 were real, consequential, and partially preventable. They reflected specific, identifiable methodological weaknesses — differential nonresponse, insufficient state error correlation, herding, and a sociology of forecasting that reinforces groupthink — that practitioners had the knowledge to address.

The international failures in the UK (2015) and Australia (2019) demonstrate that these problems are not American peculiarities. They are structural features of the modern polling industry, driven by declining response rates, the shift to non-probability online panels, and competitive market dynamics that reward consensus over accuracy.

At the same time, the limits of calibration are genuine. Late-breaking events, structural realignments, and the fundamental uncertainty of close elections cannot be engineered away by better methodology. A 50.1 percent popular vote lead cannot be reliably detected by any survey methodology; the noise in the measurement is larger than the signal.

The appropriate response to this situation is not to claim more precision than the tools warrant, nor to refuse to forecast because forecasting is hard. It is to forecast honestly, to document uncertainty explicitly, to resist the herding incentives that push toward consensus at the expense of accuracy, and to conduct rigorous postmortems when models fail.

In the language of Vivian's four-question framework: know precisely what you predicted, know precisely what happened, know whether the error exceeded expected variance, and know whether the error had a directional structure that points to a fixable mechanism. This discipline does not guarantee that the next model will be right. It does guarantee that the next model will be a little less wrong in the same direction as the last one — which is, in the long run, what improvement in forecasting looks like.

Summary

The 2016 polling failure resulted from correlated state errors, education-based nonresponse, late-deciding voters, and inadequate state-error correlation in probabilistic models.
The 2020 cycle featured an even larger raw polling error — the largest since 1980 — driven primarily by partisan nonresponse bias that grew more severe after 2016.
The 2022 cycle produced a smaller miss in the opposite direction, suggesting that the error mechanism is not simply persistent Republican underrepresentation but something more dynamic.
The UK 2015 failure — the worst in British polling history — was driven by pro-Labour sampling bias, a late swing to the Conservatives, and extensive herding; the subsequent inquiry produced transparency reforms that other countries have not fully replicated.
The Australia 2019 failure followed the same structural pattern: non-probability online panels drifting left of the actual electorate, herding around a Labor-leading consensus, and inadequate validation of the new sampling methodology.
International failures confirm that the problem is structural — declining response rates, online panel biases, herding incentives — and not specific to any particular political context.
The key analytical distinction is between systematic error (directional bias that does not cancel out) and random error (sampling variance addressed by the margin of error).
A worked numerical example shows why correlated state errors are so much more dangerous than independent errors: they eliminate the diversification that allows probabilistic models to survive individual state misses.
Partisan nonresponse bias resists standard demographic weighting and is partially addressed by weighting on prior vote choice, which introduces its own problems.
Herding — the convergence of polling results toward industry consensus — is individually rational but collectively degrades the information value of the polling environment; it is rooted in reputational asymmetries, the sophisticated consensus effect, and media incentives for confident-seeming forecasts.
The sociology of forecasting — including post-failure over-correction — creates its own bias dynamics that are distinct from the technical measurement problems.
Practical guidelines for consuming forecasts critically include: asking about state error correlation assumptions; distinguishing point estimates from win probabilities; watching for sequential convergence toward consensus; and applying the "model assumptions under stress" check.
Genuine postmortems ask specific directional questions about whether errors exceeded expected variance and whether they are consistent with known bias mechanisms.
Some forecast errors reflect genuine unpredictability, not methodological failure; maintaining honest uncertainty about model reliability is as important as maintaining uncertainty about electoral outcomes.

Key Terms

Correlated state error — The tendency for polling errors in different states to run in the same direction and by similar amounts due to shared underlying causes. Correlated errors eliminate the error-diversification that probabilistic models rely on, producing win probabilities that are far too extreme.

Systematic error — Bias that produces consistent directional deviation, not corrected by larger sample sizes.

Partisan nonresponse bias — The differential tendency of members of one party to decline survey participation, independent of demographic characteristics.

Herding — The adjustment of polling results toward the industry consensus to avoid publishing outlier estimates, reducing the effective independence of published polls.

Differential nonresponse — The general phenomenon in which certain subpopulations are less likely to respond to surveys, creating unrepresentative samples if weighting does not correct for the disparity.

Calibration — The match between stated probability estimates and actual frequency of outcomes over many predictions.

Effective sample size — The sample size that accounts for the statistical efficiency loss introduced by weighting; always less than or equal to the nominal sample size when weights are applied.

Sociology of forecasting — The social dynamics — reputational incentives, professional culture, media pressures — that shape how forecasters produce and publish predictions, often introducing herding and over-correction biases independent of measurement quality.