Case Study 2: YouGov and the Non-Probability Revolution
A Radical Proposition
In the early 2000s, as the telephone polling industry was beginning to grapple with declining response rates and rising costs, a British political scientist named Stephan Shakespeare co-founded an internet-based polling company called YouGov. The firm's proposition was radical: you did not need probability sampling to produce accurate polls. You just needed a large enough online panel, good statistical models, and the right weighting techniques.
The idea was heretical to the survey methodology establishment. Probability sampling was the gold standard, the hard-won lesson of the 1936 and 1948 debacles. The entire statistical framework of margins of error and confidence intervals depended on the assumption that respondents were randomly selected. Abandoning probability sampling, critics argued, was a return to the dark ages of the Literary Digest---trading methodological rigor for convenience and cost savings.
Yet YouGov's approach worked better than its critics predicted. In the 2005 UK general election, YouGov's final poll was closer to the actual result than most traditional telephone polls. By the 2010s, YouGov had become one of the most prolific polling organizations in the world, conducting surveys in dozens of countries on topics ranging from elections to consumer preferences to social attitudes. In the United States, YouGov became the polling partner of choice for major academic surveys, including the Cooperative Election Study (CES), one of the largest and most important surveys of the American electorate.
How It Works
YouGov's methodology is built on three pillars: a massive opt-in panel, a statistical technique called "sample matching," and extensive post-survey weighting.
The Panel. YouGov maintains online panels of millions of members who have signed up to take surveys, typically in exchange for small payments or entries into prize drawings. These panelists are not randomly selected; they are volunteers who found YouGov through advertisements, social media, word of mouth, or other non-random channels.
Sample Matching. When YouGov conducts a poll, it does not simply invite a random subset of its panel. Instead, it uses a technique called sample matching. First, the firm obtains a probability-based sample of the target population (e.g., from the American Community Survey or the Current Population Survey). Then, for each person in the target sample, it finds the closest match in the YouGov panel---the panelist whose demographic characteristics most closely resemble those of the target respondent. This matched sample is then invited to complete the survey.
The logic is intuitive: if the target sample is representative of the population, and if the matched panelists closely resemble the target respondents, then the resulting survey sample should approximate a probability sample---without the cost and difficulty of actually conducting probability-based recruitment.
Weighting. After data collection, YouGov applies extensive statistical weights to adjust the sample to match known population parameters: age, sex, race, education, geography, past vote choice, and other variables. This weighting corrects for remaining imbalances between the sample and the target population.
The Debate
YouGov's success has fueled an intense and unresolved debate in the survey methodology community. The debate centers on a fundamental question: can statistical modeling substitute for random selection?
The case for YouGov's approach. Proponents argue that in an era of 3-to-5-percent response rates, traditional telephone polls are no longer truly "probability samples" in any meaningful sense. When 95 percent of the people you try to reach refuse to participate, the resulting sample is effectively self-selected, regardless of how the initial contacts were generated. If both telephone and online polls rely on self-selected respondents, the argument goes, then the relevant comparison is not between "probability" and "non-probability" methods but between different approaches to adjusting self-selected samples---and YouGov's approach, which includes explicit matching and extensive weighting, may be as good or better than the implicit adjustments made by telephone pollsters.
Proponents also point to empirical evidence. In head-to-head comparisons across multiple election cycles, YouGov's polls have generally performed comparably to high-quality telephone polls---sometimes better, sometimes worse, but within the same range of accuracy. The firm's polls are not systematically biased in one political direction, which would be the expected signature of the kind of uncontrolled selection bias that plagued the Literary Digest.
The case against. Critics argue that the theoretical foundations of YouGov's approach are shaky. Probability sampling allows for the calculation of valid uncertainty estimates because the selection mechanism is known and random. With non-probability samples, the selection mechanism is unknown---you do not know why someone joined the YouGov panel, and you cannot quantify the ways in which panelists differ from non-panelists. This means that the margins of error reported for YouGov polls are not derived from probability theory but from modeling assumptions that may or may not be correct.
Critics also worry about the long-term reliability of the approach. YouGov's polls have performed well in recent elections, but so did the Literary Digest's polls---until they didn't. The biases in a non-probability sample may be latent, manifesting only when political conditions change in ways that correlate with the selection mechanism. For example, if YouGov panelists are systematically more politically engaged than non-panelists, this bias might not matter in elections with typical turnout patterns but could produce significant errors in elections with unusual mobilization dynamics.
Implications for Meridian Research Group
The YouGov debate is not abstract for Meridian Research Group. Vivian Park has been watching the rise of online polling with a mixture of professional respect and methodological anxiety.
On one hand, she recognizes that the economics of polling have shifted decisively. A telephone poll of 1,000 likely voters costs approximately $40,000 to $60,000 and takes seven to ten days to complete. An online panel poll of similar size costs $5,000 to $10,000 and can be completed in two to three days. For media clients with limited budgets, the cost differential is often decisive.
On the other hand, Vivian is unwilling to abandon the principles she learned from studying polling history. "The Literary Digest had millions of respondents," she reminds Carlos during a methodological discussion. "YouGov has millions of panelists. The numbers are different, but the underlying question is the same: are the people you can reach representative of the people you cannot?"
Meridian's compromise is the mixed-mode approach described in this chapter: combining probability-based telephone interviews, text-to-web surveys, and probability-based online panel respondents. This is more expensive than a pure online panel approach, but Vivian believes it provides a better foundation for accuracy.
"I am not saying YouGov is wrong," she tells her team. "I am saying we do not yet know the conditions under which their approach fails. And until we do, I want a backup plan."
The Broader Lesson
The YouGov case illustrates a recurring pattern in polling history: a new methodology challenges the established orthodoxy, achieves early successes, and forces the profession to rethink its assumptions. Whether the non-probability revolution will ultimately be vindicated---or will produce its own version of the Literary Digest disaster---remains to be seen.
What is certain is that the debate is not purely technical. It is about the values that underpin survey research: transparency, accountability, and the commitment to representing the full population, not just the people who are easiest to reach. These values are at the heart of Vivian Park's approach to polling, and they are at the heart of this book.
Discussion Questions
-
The chapter argues that low response rates have made traditional telephone polls "effectively self-selected." Do you agree? Is there a meaningful difference between a 4-percent-response-rate telephone poll and a volunteer online panel, or are they both forms of convenience sampling?
-
YouGov's sample matching technique attempts to create a pseudo-probability sample by matching panelists to a probability-based target. What assumptions does this technique rely on? Under what conditions might those assumptions fail?
-
Vivian Park's concern is that "we do not yet know the conditions under which their approach fails." Is this a legitimate reason to prefer more expensive probability-based methods, or is it an overly cautious response to a methodology that has demonstrated empirical accuracy?
-
The cost difference between telephone and online polling is roughly tenfold. What are the implications of this cost difference for the polling ecosystem? If only well-funded organizations can afford telephone polling, how might this affect the diversity and quality of public opinion data?
-
If you were advising a news organization that wanted to commission polls of the Garza-Whitfield race on a limited budget, would you recommend a YouGov-style online panel, Meridian's mixed-mode approach, or some other methodology? Justify your recommendation.