Chapter 8: Sampling: Who Speaks for the Public?

46 min read

In the fall of 1936, the editors of The Literary Digest were confident. They had conducted the largest presidential preference poll in American history — 10 million postcards mailed, 2.4 million returned — and the results were unambiguous: Alf...

Learning Objectives

Distinguish between probability and nonprobability sampling and explain when each is appropriate
Describe simple random, systematic, stratified, and cluster sampling designs and their tradeoffs
Explain margin of error intuitively and interpret poll results with appropriate uncertainty
Identify sampling frame problems and coverage bias in common polling modes
Explain post-stratification weighting, raking, and the conceptual logic of MRP
Analyze the implications of declining survey response rates for polling validity
Apply sampling concepts to the practical challenge of polling a demographically diverse state
Evaluate sample quality in published polls using a systematic checklist

In This Chapter

The Literary Digest Disaster: What Went Wrong
The Logic of Probability Sampling
The Sampling Frame Problem
Margin of Error: What It Means and What It Doesn't
The Square Root Law: Why More Data Yields Diminishing Returns
The Cell Phone Transition: What It Taught Us About Coverage
Weighting: Correcting for What Went Wrong
Weighting in Practice: Meridian's Sun Belt Challenge
International Sampling Challenges: Beyond the American Context
Declining Response Rates: The Crisis in Modern Polling
Probability vs. Nonprobability Sampling: The Modern Debate
Evaluating Sample Quality in Published Polls
Practical Sampling Design: Putting It Together
Who Gets Counted, Who Gets Heard
Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 8: Sampling: Who Speaks for the Public?

In the fall of 1936, the editors of The Literary Digest were confident. They had conducted the largest presidential preference poll in American history — 10 million postcards mailed, 2.4 million returned — and the results were unambiguous: Alf Landon would defeat Franklin Roosevelt by a comfortable margin. The Digest had done this kind of poll successfully in 1920, 1924, 1928, and 1932. They saw no reason to doubt it now.

On November 3, 1936, Roosevelt won 61% of the popular vote and carried 46 of 48 states. Landon won 2.

The Literary Digest was out of business within a year.

This is the founding cautionary tale of American polling, and it turns on a single question that anyone who commissions or interprets political surveys must be able to answer: Who is in your sample? Because who is in your sample determines what your sample can tell you — and what it cannot.

The Literary Digest Disaster: What Went Wrong

The Digest didn't fail because it didn't have enough respondents. Ten million postcards is an enormous number — far larger than any contemporary political poll. The failure was something more fundamental: the sample was not drawn from the population it claimed to represent.

The Digest built its mailing list from automobile registrations and telephone directories. In 1936, in the depths of the Great Depression, automobile owners and telephone subscribers were disproportionately middle-class and wealthy — people who had survived the economic collapse relatively intact, and who disproportionately favored Landon's small-government platform. The urban poor, the unemployed, the rural working class who were Roosevelt's core supporters — they had neither cars nor phones, and they never received a postcard.

This is called coverage bias: systematic exclusion of population groups from the sample in ways that distort aggregate results. No amount of sample size can fix coverage bias, because the people who are excluded are not randomly distributed in the population. They are concentrated in specific demographic and political groups — and in 1936, those groups broke overwhelmingly for Roosevelt.

The lesson is not "bigger is better." It is "representative is better." A smaller, properly drawn probability sample will outperform a massive biased convenience sample every time.

Trish McGovern keeps a framed copy of the Literary Digest's final poll table on her office wall. When junior staff ask about it, she says: "This is a $2 million mistake. That's about what it cost to mail 10 million postcards in 1936, adjusted for inflation. Which is why I spend time on sampling design instead of trying to hit everyone in the state."

The Logic of Probability Sampling

The Literary Digest failed because it didn't use probability sampling. Let's be precise about what probability sampling means and why it matters.

Probability sampling is any sampling procedure in which every member of the target population has a known, nonzero probability of being included in the sample. The critical word is "known." When selection probabilities are known, we can:

Make valid statistical inferences from sample to population
Calculate the sampling variance (and thus the margin of error)
Weight observations to correct for unequal selection probabilities

When selection probabilities are unknown — as in the Digest's convenience sample — none of these things are possible. We can describe the sample, but we cannot justify inference to any broader population.

Simple Random Sampling

The most conceptually basic form of probability sampling is simple random sampling (SRS): every possible sample of size n has an equal probability of being selected from the population of size N. In practice, this means assigning a random number to every member of the population and selecting the n members with the highest (or lowest) random numbers.

SRS is the theoretical gold standard that most sampling statistics are derived from. It has one serious practical limitation: it requires a complete, accurate list of the population — a sampling frame — and in most political polling contexts, such a list doesn't exist.

There is no complete list of all likely voters in a Sun Belt state. There is no database containing every registered voter's phone number, updated in real time. Any sampling frame a pollster can actually use will be imperfect, and the imperfections will introduce bias. This is the sampling frame problem, and we'll return to it shortly.

Systematic Sampling

Systematic sampling begins with a sampling frame and selects every k-th element, where k = N/n (population size divided by desired sample size). Start at a random point and proceed at even intervals. This is equivalent to SRS if the frame is randomly ordered, and it's far easier to implement in practice.

Systematic sampling has a well-known failure mode: if there is a periodic pattern in the sampling frame that corresponds to the selection interval, you will over- or under-sample systematically. The classic example is sampling housing units in a residential survey: if you sample every 10th house and houses are arranged in blocks of 10 with the corner house always being larger and more expensive, you'll oversample large expensive homes or never sample them, depending on where you start.

Political applications of systematic sampling are less vulnerable to periodicity problems, but it's worth verifying that your sampling frame doesn't have hidden patterns that could introduce bias.

Stratified Sampling

Stratified sampling divides the population into subgroups (strata) defined by some characteristic — geography, demographics, political region — and draws independent simple random samples from each stratum.

Why bother? Because stratification reduces sampling variance when the strata are internally homogeneous on the outcome of interest. If you're polling a Sun Belt state and you know that urban, suburban, and rural areas have very different political profiles, stratifying by urban/suburban/rural ensures adequate representation of each type without relying on chance to give you enough rural respondents.

There are two variants:

Proportionate stratification: Sample from each stratum at the same rate, so the sample composition mirrors the population composition. Simple and produces an unbiased estimate without weighting.

Disproportionate stratification: Oversample some strata (typically smaller or more variable ones) to ensure sufficient sample size for subgroup analysis. The tradeoff is that you must weight back to population proportions when estimating overall population quantities.

For a poll of the Garza-Whitfield race, Meridian uses disproportionate stratification: they oversample the state's heavily Latino southern counties to ensure they can analyze Latino voters as a distinct subgroup. When reporting the overall horse-race number, they weight the southern counties back down to their actual share of the likely voter population.

💡 Intuition: Why Stratification Works

Imagine you're trying to estimate the average height of students in a large university by taking a random sample of 100. If you sample entirely from men's sports teams (by accident), your estimate will be biased upward. If you stratify by men's and women's residence halls and take proportionate samples from each, your estimate will be more accurate with the same sample size — because you've guaranteed that the groups with very different average heights are appropriately represented. Stratification is insurance against the bad luck of getting an unrepresentative sample.

Cluster Sampling

Cluster sampling selects groups (clusters) rather than individuals as the primary sampling units, then samples individuals within selected clusters.

In a classic area probability sample for an in-person survey, you might randomly select counties, then randomly select neighborhoods within those counties, then randomly select households within those neighborhoods, then interview one resident in each selected household. This is a multi-stage cluster sample, and it dramatically reduces the cost of face-to-face interviewing by concentrating interviewers in selected areas.

The statistical cost is that cluster samples are less efficient than SRS or stratified samples. Members of the same cluster tend to be more similar to one another than to the general population — neighbors share demographic and political characteristics. This within-cluster similarity (measured by the intracluster correlation coefficient, or ICC) inflates the variance of your estimates compared to a SRS of the same size. The design effect (DEFF) quantifies this penalty: a DEFF of 1.5 means your cluster sample provides as much precision as an SRS of two-thirds its size.

For telephone and online surveys, where geographic concentration of interviewers is unnecessary, cluster sampling is less common. But address-based sampling — which draws from the Computerized Delivery Sequence File, a nearly complete list of U.S. residential addresses — often uses cluster elements at the PSU (primary sampling unit) stage.

The Sampling Frame Problem

Probability sampling requires a sampling frame — a list of the population from which the sample will be drawn. In political polling, this requirement creates practical problems that every analyst should understand.

What Frame Do You Use?

The ideal sampling frame for a likely voter poll would be a complete, up-to-date list of everyone who will vote in the upcoming election, with contact information. This doesn't exist. Pollsters must choose from imperfect alternatives:

Registered voter files: Most states maintain public databases of registered voters with names, addresses, and in many cases phone numbers and past voting history. These are excellent frames for registered voter samples — far better than random-digit dialing. They don't cover unregistered voters (a significant share in many states) and may be out of date (voters who moved, died, or were purged).

Random-digit dialing (RDD): Generate random telephone numbers and call them. In principle, this reaches any household with a phone number. In practice, cell phone RDD reaches different populations than landline RDD, and the rapid growth of VoIP numbers has fragmented what was once a relatively clean sampling frame.

Address-based sampling (ABS): Draw from residential mailing addresses. Coverage is excellent — nearly 97% of U.S. households have a postal address — and demographic coverage is much more even across race, education, and income than telephone-based frames. The major limitation is that you can mail questionnaires to an address, but you don't know the phone number or email address of the occupant without additional matching.

Online opt-in panels: Recruit respondents through online advertising, partnerships with websites, or standing opt-in communities. These are convenience samples: respondents self-select into the panel, and their selection probability is unknown and non-uniform. Technically speaking, they are not probability samples.

Coverage Bias in Practice

Here's where the Literary Digest lesson becomes contemporary. Each sampling frame covers the population unevenly:

Landline phone frames miss the roughly 60% of American adults in cell-phone-only households, who skew younger and more diverse
Cell phone frames have lower response rates among older Americans
Online panels miss respondents without internet access or comfort (still about 10-15% of adults), who skew older, rural, and lower-income
Registered voter files are out of date in high-mobility areas (urban, college towns) where residential instability is high
Spanish-language respondents may be underrepresented in English-only telephone frames

The irony is that the groups with the worst coverage — young people, rural residents, low-income households, non-English speakers — are often the groups whose political behavior is most uncertain and therefore most analytically interesting.

📊 Real-World Application: The 2016 and 2020 Polling Errors

Both the 2016 and 2020 elections produced significant polling errors, particularly in Midwestern states. Post-election analyses identified a common thread: respondents without a four-year college degree — a group that swung heavily toward Trump — were systematically underrepresented in pre-election polls. The reasons are debated: non-college voters may have lower response rates on surveys generally, they may have been concentrated in geographic areas with lower coverage, or they may have been subject to social desirability bias (the "shy Trump voter" hypothesis). The result was a systematic underestimate of Trump's support and overestimate of Biden's, particularly in the Rust Belt states. AAPOR's review panel concluded that educational attainment was a key weighting variable that many polls were not adequately controlling for.

Margin of Error: What It Means and What It Doesn't

"The poll has a margin of error of plus or minus 3 percentage points." You've read this sentence a thousand times. Do you know what it means?

The Intuition

When you draw a random sample from a population, the sample statistic (e.g., the proportion who prefer Garza) will differ from the true population parameter by some amount due to random sampling variation — even with perfect sampling and perfect measurement. If you repeated the survey many times with different random samples, the distribution of your sample estimates would cluster around the true value, and the margin of error describes the width of that cluster.

More precisely: a 95% confidence interval of ±3 percentage points means that if you repeated this exact sampling procedure an infinite number of times, approximately 95% of the resulting confidence intervals would contain the true population proportion.

What it does not mean:

It is not a probability that the true value is within ±3 points of your estimate (a common misinterpretation)
It does not account for nonsampling errors (coverage bias, question wording effects, nonresponse bias, weighting errors)
It assumes probability sampling — it is technically invalid for nonprobability samples (though many polls report it anyway)

The Math (Kept Accessible)

For a simple proportion estimated from a simple random sample, the margin of error at 95% confidence is approximately:

MOE ≈ 1 / √n

Where n is the sample size. This is the rule of thumb. Let's see it work:

Sample Size	Approximate MOE (95%)
100	±10%
400	±5%
1,000	±3.2%
1,600	±2.5%
4,000	±1.6%

This relationship between sample size and MOE is sublinear: to cut your MOE in half, you need to quadruple your sample size. This is the law of diminishing returns in sampling, and it explains why you rarely see political polls with samples larger than 2,000 — the precision gain from going from 2,000 to 4,000 respondents is modest relative to the doubled cost.

The formula gives the MOE for the overall sample. MOE for subgroups is larger — proportional to 1/√(subgroup n). If you have 1,000 respondents and want to analyze Latino voters who make up 15% of your sample, you have approximately 150 Latino respondents, for an MOE of roughly ±8%. This is why analysts frequently caution that subgroup findings should be interpreted with care.

⚠️ Common Pitfall: Treating Small MOE as a Green Light

A reported MOE of ±3 points accounts only for random sampling error in a perfectly executed probability sample. It says nothing about coverage bias, nonresponse bias, question wording effects, or weighting errors — all of which can introduce systematic errors much larger than ±3 points. Systematic errors don't average out with repeated sampling; they consistently push your estimates in the same wrong direction. The MOE is the floor on your uncertainty, not the ceiling.

Statistical Power and Sample Size

Beyond the overall MOE, analysts care about whether a sample is large enough to detect the differences they care about. Statistical power is the probability that a test will correctly detect a true effect of a given size.

For political polling, typical power calculations concern:

Can you detect a 5-point difference between Garza and Whitfield as statistically significant at 95% confidence? (Yes, with n=1,000)
Can you detect a 3-point change from your previous poll? (Barely, with n=1,000 in each wave)
Can you detect a 10-point difference between Latino and non-Latino respondents in Whitfield favorability? (Yes, but only if you have 200+ Latino respondents)

The key insight: the sample size you need depends on the comparison you want to make and the size of the true difference you expect to find. A tracking poll that only needs to report the top-line horse-race can get by with 600 respondents. A poll that needs to analyze three racial/ethnic groups, five regions, and voters by education will need 1,500-2,000 respondents to support those subgroup analyses.

The Square Root Law: Why More Data Yields Diminishing Returns

The 1/√n formula for margin of error is one of the most important — and most frequently misunderstood — relationships in all of applied statistics. Understanding why it takes the form it does is essential to making good sampling decisions.

Where the Square Root Comes From

The standard error of a proportion p estimated from a sample of size n is:

SE = √(p × (1-p) / n)

For a proportion near 0.5 (the maximum uncertainty case, which is what pollsters typically assume), p × (1-p) ≈ 0.25. So the standard error simplifies to:

SE ≈ √(0.25 / n) = 0.5 / √n

The 95% confidence margin of error is approximately 1.96 × SE ≈ 2 × 0.5 / √n = 1/√n.

The key is in the denominator: it's √n, not n itself. To reduce uncertainty by a factor of 2, you must increase sample size by a factor of 4. This is the square root law, and it has profound practical implications.

A Worked Example

Suppose Meridian's first poll of the Garza-Whitfield race had 400 respondents and showed Garza at 48%, Whitfield at 44%, with an MOE of ±5 points. The campaign asks: "If we pay for a bigger poll, how big do we need to get to ±3 points?"

Using MOE ≈ 1/√n: - Current MOE: 1/√400 = 1/20 = 5% - Target MOE: 3% - Required: 1/√n = 0.03, so √n = 33.3, so n = 33.3² ≈ 1,111

To cut MOE from ±5% to ±3%, the sample needs to grow from 400 to about 1,100 — nearly three times larger.

What if the campaign wants ±2%? - Required: 1/√n = 0.02, so n = 2,500

Going from ±3% to ±2% requires more than doubling the sample again. The return on investment diminishes rapidly. This is why sophisticated campaign analysts don't simply request "more" polling — they ask what precision they need for the specific comparison they care about, and they specify sample size accordingly.

The Implication: Allocating Budget Across Subgroups

The square root law has a direct implication for how to allocate a fixed polling budget. Suppose Nadia Osei's analytics team needs to track:

The overall horse-race (needs ±3% MOE)
The Latino vote share for each candidate (needs ±6% MOE for that subgroup)
The suburban women's vote (needs ±8% MOE for that subgroup)

If Latinos are 25% of likely voters, then in a 1,200-person poll, the expected Latino sample is 300 — giving a subgroup MOE of about 1/√300 ≈ ±5.8%. That's close enough to the 6% target. Suburban women at 22% of the electorate would yield about 264 respondents, MOE ±6.1%. The overall MOE would be 1/√1200 ≈ ±2.9%.

But what if Nadia also needs to track Black voters separately? At 10% of likely voters, a 1,200-person poll provides about 120 Black respondents — MOE of ±9%. If the team needs ±6% for the Black subgroup, they would need: - 1/√n = 0.06 → n = 278 Black respondents - At 10% of the sample, that requires a total sample of 2,780

The cost of subgroup precision can be substantial. This is why targeted oversampling of smaller demographic groups — as Meridian does for Latino voters in the southern metro — is a smart allocation of polling budget. You spend more money per interview on the southern metro oversample, but you get dramatically more useful subgroup estimates in return.

📊 The Precision Plateau

The table below illustrates the diminishing returns of increasing sample size, using the formula MOE ≈ 1/√n:

Sample Size	MOE (±%)	Gain from doubling previous
100	±10.0%	—
200	±7.1%	−2.9 points
400	±5.0%	−2.1 points
800	±3.5%	−1.5 points
1,600	±2.5%	−1.0 points
3,200	±1.8%	−0.7 points
6,400	±1.3%	−0.5 points

Each time you double the sample size, the improvement in precision shrinks. Beyond about 1,500 respondents, you are buying very little additional accuracy for a great deal of money — assuming your sampling design is sound. This is the practical reason why most high-quality political polls target 800–1,500 respondents rather than 4,000 or 10,000.

The Cell Phone Transition: What It Taught Us About Coverage

One of the most consequential methodological disruptions in the history of American polling was the rapid shift from landline to cell phone usage in the 2000s and 2010s. Understanding what happened — and what the industry learned from it — is essential context for evaluating contemporary polling methodology.

The Landline Era (Before 2007)

Through the early 2000s, random-digit dialing of landline phones was the workhorse of American survey research. A household with a landline had a phone number that could be reached from the sample; virtually every American household had a landline. Coverage was excellent. Response rates, while declining, were still above 30%.

The sampling logic was simple: generate random 10-digit phone numbers in the format used by residential landlines, call them, and interview any adult who answers. The sampling frame was imperfect but comprehensive enough for most purposes.

The Transition Period (2007–2015)

The problem emerged gradually, then suddenly. By 2007, the National Health Interview Survey estimated that roughly 16% of American adults were in cell-phone-only households. By 2011, that figure had crossed 30%. By 2015, it was above 50%.

This mattered because the Telephone Consumer Protection Act of 1991 restricts autodialed calls to cell phones, requiring cell numbers to be manually dialed — a significant cost increase. Many commercial pollsters were slow to incorporate cell phones into their samples because of this cost.

The demographic profile of cell-phone-only households was highly skewed: young adults, renters, lower-income households, and Hispanic households were dramatically overrepresented in cell-only status. A poll conducted only on landlines in 2012 was missing a population that voted at substantially different rates for Obama vs. Romney than the remaining landline population.

Pew Research Center's tracking of the "cell phone problem" during this period is the most systematic documentation of the issue. Their studies consistently showed that cell-only respondents were more Democratic-leaning, more optimistic about the economy, and more ethnically diverse than landline respondents — and that failure to include cell phones introduced a Republican-leaning bias in polls that was partially offsetting, partially compounding other sources of error.

What the Industry Learned

The cell phone transition taught the polling industry several durable lessons:

Coverage gaps compound politically. The cell-phone-only population wasn't just demographically different from the landline population — it was politically systematically skewed. Coverage bias produces systematic directional error, not random noise.

Methodological laggards get punished. Organizations that were slow to incorporate cell phone sampling — due to cost concerns, institutional inertia, or complacency — produced systematically biased estimates during the transition period. The 2008 and 2012 elections, where cell-only respondents broke heavily Democratic, exposed the laggards.

New modes require new weighting targets. When cell phones were added to landline samples, the combined sample required new weighting approaches. Respondents reached on cell phones vs. landlines are now treated as coming from distinct sampling strata, and their representation is managed through the weighting scheme.

Coverage problems don't stay fixed. The lesson of the cell phone transition is not just "add cell phones to your sample." It is that any dominant sampling method will eventually face a coverage challenge as the population and its communication technology evolve. Today's comparable challenge may be the decline of any reliable phone-based contact frame as younger adults increasingly use messaging apps, internet calling, and other communications infrastructure that has no analogue in the voter file.

⚠️ The Next Coverage Crisis

As of the mid-2020s, cell phone response rates have themselves declined substantially — below 4% in some commercial polls. The next coverage transition may not be to a new telephone mode at all, but to text-to-web or passive data approaches that raise entirely different coverage concerns. The lesson of the cell phone transition is to watch for structural shifts in how target populations communicate, and to assume that any existing sampling frame is aging toward obsolescence.

Weighting: Correcting for What Went Wrong

Even the best-designed probability sample will produce a sample that doesn't perfectly match the target population in demographic composition. Some groups will be overrepresented (because they answer the phone more readily, because they're oversampled by design, because response rates differ); others will be underrepresented. Weighting is the process of adjusting for these imbalances.

Post-Stratification Weighting

Post-stratification is the simplest and most common weighting approach. You compare the demographic composition of your sample to the known composition of the target population (from the Census or voter file), and assign each respondent a weight that increases if their demographic group is underrepresented and decreases if it's overrepresented.

If women are 53% of your target population but 60% of your sample, women are overrepresented. You give each female respondent a weight of 53/60 ≈ 0.88. If men are 47% of the target but 40% of the sample, each male respondent gets a weight of 47/40 ≈ 1.18. After weighting, the sample's gender composition matches the population.

You can post-stratify on multiple variables simultaneously (age, race, education, party registration). The more variables you include, the more precisely your sample will match the population — but the more complex the weighting procedure becomes.

Raking (Iterative Proportional Fitting)

When you want to weight on multiple variables simultaneously, raking is the standard algorithm. Also called iterative proportional fitting, raking adjusts weights sequentially across variables — first matching on gender, then on age, then on race, then cycling back to gender — until all the marginal distributions match their population targets simultaneously.

Raking doesn't require you to specify the full joint distribution (e.g., the exact percentage of 35-49-year-old Black women). It only requires marginal distributions (the gender distribution, the age distribution, the race distribution). This makes it practical when the full joint distribution is unknown or when the sample is too small to estimate it reliably.

Multilevel Regression and Poststratification (MRP)

Multilevel Regression and Poststratification — universally abbreviated as MRP, and sometimes called "Mister P" — is a more sophisticated approach that has become increasingly influential in political polling, especially for estimating opinion in small geographies.

The conceptual logic of MRP has two steps:

Step 1: Build a multilevel regression model. Using survey data, estimate a regression model predicting the opinion of interest (candidate preference, issue support) as a function of individual-level demographic characteristics (age, race, education, gender) and geographic characteristics (state, region, urbanicity). Use multilevel ("hierarchical") modeling to allow the relationship between demographics and opinion to vary across geographic units, borrowing strength from similar states/regions when data is sparse.

Step 2: Poststratify. The U.S. Census provides counts of every combination of demographic characteristics in every geographic unit. Using your regression model, generate a predicted opinion for each demographic cell in each geography. Then weight these predictions by the actual population in each cell and sum up to get the population-level estimate.

The power of MRP comes from step 2. Even if you only interviewed 1,000 people nationally, your model can generate estimates for each of the 435 congressional districts — because the model borrows information from similar districts when any specific district has few respondents. This is how media organizations and academic researchers produce congressional-district-level opinion estimates from national surveys that would otherwise have far too few respondents per district.

For Meridian's Garza-Whitfield poll, Vivian explains MRP to Carlos as follows: "Imagine you only interviewed 20 Latino voters in the southern metro area. That's not enough for a reliable direct estimate. But if you build a model that uses what you know about Latino voters across the whole state — and what you know about how the southern metro area differs from other regions — you can produce an estimate for southern metro Latinos that's better than just averaging those 20 responses."

💡 Intuition: MRP as Informed Extrapolation

MRP is sometimes criticized for "making up" estimates where data is thin. A fairer description is that it's making informed estimates based on systematic patterns in the data — the same thing that any model does. A doctor who doesn't have a direct measurement of your blood pressure can estimate it from your age, weight, and other health indicators. That estimate isn't as good as a direct measurement, but it's far better than a random guess. MRP works similarly: it uses systematic demographic patterns to generate estimates in data-sparse areas, clearly flagged as model-based rather than direct measurements.

A Conceptual MRP Worked Example

To make the mechanics concrete, consider a simplified version of the kind of MRP model Meridian might build for the Garza-Whitfield poll. Suppose they want to estimate Garza's support among three demographic groups (young voters 18-34, middle-aged voters 35-64, and older voters 65+) across four regions (Southern Metro, Northern Cities, Suburbs, and Rural), for a total of 12 demographic-geographic cells.

After conducting their 1,200-person poll, some cells have substantial representation (300 suburban middle-aged respondents) and some are thin (only 30 rural young respondents). For the thin cells, direct estimates are unreliable.

The multilevel regression step estimates how candidate preference varies with age and region while allowing for regional variation. The model might find: Garza does better with young voters everywhere (a demographic effect), but the age gap is larger in the northern cities than in rural areas (an interaction effect captured by the multilevel structure). From these coefficients, the model generates a predicted Garza support level for each of the 12 cells.

The poststratification step takes Census data on how many likely voters fall into each of the 12 cells. If young rural voters are 4% of the electorate and the model predicts 38% Garza support for that group, that cell contributes 0.04 × 0.38 = 0.0152 to the overall Garza estimate. Summing across all 12 cells — weighted by their population shares — produces a demographically adjusted overall estimate.

The result is typically more stable than a simple weighted sample estimate, because the model borrows information across cells rather than treating each cell independently. The tradeoff is that the model introduces assumptions (about how demographic effects combine, about how much geographic variation to allow) that a pure weighting approach does not.

Weighting in Practice: Meridian's Sun Belt Challenge

Trish McGovern spread the sample disposition report across the conference table like a battle map. Two days into field, and the sample profile was already showing the expected distortions.

"We're short on young respondents," she said. "18-34s are 19% of our likely voter target, but they're only 11% of our completes so far. We're long on 65-plus — they're 28% of completes but only 22% of target."

Carlos leaned over the table. "Does that affect the horse-race number significantly?"

"Depends on how much age drives candidate preference," Vivian said. "Pull up the last cycle data."

Carlos found the crosstabs from the previous Senate race: younger voters had preferred the Democrat by roughly 15 points more than older voters. "If we're undersampling young voters, we're probably underestimating Garza's support."

"Right," Vivian said. "That's why we weight. But here's the harder problem."

She pointed to the regional breakdown. The southern metro area — heavily Latino, the demographic group most central to Garza's base — was 14% of completes. Their target weight from the voter file was 22%. An 8-percentage-point gap.

"We're not reaching Latino respondents at the same rate as Anglo respondents," Trish said. "Partly it's the phone frame — Latino households are more likely to be cell-phone-only, and our cell rate is lower. Partly it's language — our interviewers are calling in English first, and some Latino households are hanging up before we get to the Spanish introduction."

This is the sampling challenge that defines modern Sun Belt polling: the electorate is diverse, differential response rates by race and ethnicity introduce systematic bias, and any weighting correction depends on accurate population benchmarks that may themselves be uncertain.

The Weighting Decision

Meridian's weighting scheme for the Garza-Whitfield poll includes the following targets, drawn from a combination of the voter file, recent Census data, and the state's election authority projected turnout estimates:

Variable	Target (%)
Age: 18-34	19
Age: 35-49	25
Age: 50-64	32
Age: 65+	24
Gender: Female	53
Gender: Male	47
Race: White non-Hispanic	58
Race: Latino	25
Race: Black	10
Race: Other	7
Education: No college	38
Education: Some college	27
Education: College+	35
Region: Southern metro	22
Region: Northern cities	31
Region: Suburbs	28
Region: Rural	19

"The hardest one is education," Carlos says, running the raking algorithm. "Every time I add education weight, the other margins shift."

"That's why we rake," Vivian says. "Run it until convergence. Usually takes about 20 iterations."

After 23 iterations, the raked sample is within 0.3 percentage points of every target margin. The horse-race number shifts from 47% Garza / 45% Whitfield in the unweighted data to 49% Garza / 44% Whitfield after weighting — a 2-point change in the margin, driven primarily by reweighting upward for younger and Latino respondents.

"That's a material change," Carlos notes. "If we hadn't weighted, we'd have reported a tighter race."

"We'd have been wrong," Trish says simply.

⚖️ Ethical Analysis: Who You Weight For Is a Political Choice

Weighting targets are not neutral technical decisions. When Meridian decides to weight the sample to 25% Latino representation, they are making a claim about who constitutes the relevant public for this poll. Should they weight to the adult population? The registered voter population? The likely voter population? Each choice reflects a different theory of democratic inclusion.

If they weight to the likely voter population (which is whiter and older than the registered voter population), they are producing the most accurate election prediction — but they are also reporting "public opinion" in a way that systematically underweights groups whose voices are suppressed by structural barriers to turnout. If they weight to the adult population, they capture a more inclusive picture of public preferences — but they will likely overestimate Garza's support because her coalition includes many unlikely voters.

Neither choice is wrong. Both choices should be disclosed. And the tension between election prediction and democratic representation should be explicitly acknowledged.

International Sampling Challenges: Beyond the American Context

The sampling challenges described so far are real, but they are tractable by global standards. The United States has several structural advantages that make representative polling easier than in most other democracies: a well-maintained postal address database, relatively complete (if imperfect) state voter files, advanced telecommunications infrastructure, and a large commercial survey industry that has invested decades in methodology development.

Sampling Frame Availability in Comparative Context

The choice between voter file sampling, RDD, ABS, and online panels — which American pollsters navigate as a menu of imperfect but available options — doesn't exist in the same form in many countries.

Voter registration vs. census-based frames: In the United States, voter registration is opt-in, meaning the voter file is an incomplete record of the eligible electorate. In many European and Latin American countries, voter registration is mandatory and administered by the state, making the voter register a near-complete census of the eligible electorate. This is a significant advantage for sampling frame construction. The UK's electoral register, Germany's Einwohnermelderegister (resident registration), and Brazil's electoral justice database all provide starting points for sampling that are substantially more complete than U.S. state voter files.

The limitation is that mandatory registration databases are not always publicly accessible for sampling purposes. German law restricts commercial use of residence registration data. French electoral lists are maintained locally and not centralized in a nationally accessible database. Brazilian microdata from the electoral justice registry has been made available to researchers in structured form, but accessing it requires navigating a federal bureaucracy with its own procedural requirements.

Linguistic diversity: The United States has a dominant-language majority (English) with a substantial Spanish-speaking minority that commercial pollsters have largely learned to accommodate. Many other democracies have more complex linguistic environments that create qualitatively different sampling challenges. India has 22 constitutionally recognized languages and hundreds of regional dialects; constructing a nationally representative sample requires instruments in multiple languages and interviewers who can conduct interviews in each. Switzerland conducts federally mandated surveys in four national languages (German, French, Italian, and Romansh). Indonesia has more than 700 languages. In each context, the sampling design must account for linguistic access — which groups can be reached by an instrument in which languages — in ways that go beyond the English-Spanish consideration familiar to American pollsters.

Infrastructure gaps: Random-digit dialing of landlines, which was the American polling workhorse for thirty years, requires that target respondents have landline telephone service. In many Sub-Saharan African countries, landline penetration remains below 5%, and cellular coverage, while expanding, is uneven across geographic areas. Online survey methods require internet access that in many developing-country contexts is concentrated in urban areas and higher-income households. These infrastructure gaps create systematic coverage bias that cannot be corrected through weighting if the uncovered population has no equivalent-quality contact channel.

The Special Problem of Authoritarian and Semi-Authoritarian Contexts

In consolidated democracies, the main threats to valid sampling are technical: incomplete frames, differential response rates, coverage gaps. In electoral authoritarian or semi-authoritarian contexts — where competitive elections occur but the political environment is significantly less free — sampling faces a qualitatively different challenge: respondent self-censorship.

When citizens in a semi-authoritarian context believe that their survey responses might be identifiable or might have consequences for their safety, freedom, or employment, they may systematically distort their responses in the direction they believe the authorities prefer. This is not the social desirability bias familiar from Western survey research — it is genuine fear-driven misrepresentation.

The practical implications are severe: no amount of sampling design improvement can correct for self-censorship that the researcher cannot measure. Techniques like the list experiment (Chapter 7) may partially address this, but they require respondents to trust that even the indirect measurement is anonymous — a trust that may be difficult to establish in environments where surveillance is common.

Academic researchers working in such environments have developed specialized protocols: interviewer training that emphasizes anonymity protections, interview locations chosen to minimize visibility, informed consent procedures that explicitly address legal protections, and analysis approaches that attempt to bound the extent of self-censorship without being able to directly measure it.

Declining Response Rates: The Crisis in Modern Polling

The most persistent structural challenge facing contemporary political polling is the collapse in survey response rates. In the 1970s and 1980s, a well-conducted telephone survey could achieve response rates of 60-80%. By 2000, response rates had fallen to 35-40%. By 2020, AAPOR estimates that the response rate for a typical commercial telephone poll was below 6%.

What Does 6% Mean?

A 6% response rate means that for every 100 people a pollster attempts to contact, 94 hang up, screen their calls, or otherwise refuse to participate. The 6 who do respond are the basis for the published poll.

This would be fine if the 6% were a random subset of the 100% — if refusal to respond were randomly distributed across the population. But it is not. Response propensity — the likelihood of completing a survey — is correlated with a range of demographic and political characteristics:

Older respondents are more likely to complete telephone surveys than younger ones
Higher-income and higher-education respondents are more likely to complete surveys
Politically engaged respondents are more likely to complete political surveys
Some research suggests that partisans are differentially responsive during periods of high political salience for their side

If the respondents who complete a survey differ systematically from those who don't, the resulting sample is nonresponse biased — it misrepresents the population not because the sampling frame was wrong, but because the fraction of the frame that responded was unrepresentative.

Is Polling Dead?

This is a reasonable question, and it deserves a nuanced answer.

Polling is not dead, but it has changed fundamentally. The traditional model — random-digit-dial telephone surveys achieving representative response rates — is essentially defunct. What has replaced it is a more complex, and frankly more uncertain, combination of:

Online opt-in panels with sophisticated statistical adjustments (including MRP)
Text-to-web approaches that use cell phone numbers to recruit online survey completers
Address-based sampling that recruits through the mail, with web or phone completion options
Registered voter file matching that combines phone outreach with statistical adjustment using the voter file

Each approach has different coverage and nonresponse properties. None is clearly superior to the others across all contexts. The honest assessment is that contemporary polling has more uncertainty than the reported MOE implies, because nonsampling errors — particularly nonresponse bias — are not captured in the MOE calculation.

🔴 Critical Thinking: The Herding Problem

In the final weeks of an election cycle, you will often notice that polls from different organizations cluster around similar numbers — much more than independent probability samples should. This is "herding" or "house effects dampening," and it happens because pollsters adjust their methods (often unconsciously) to avoid being the outlier. An outlier who is right is merely vindicated; an outlier who is wrong is professionally damaged. This incentive to cluster creates a systematic bias toward consensus that can cause the entire polling industry to miss simultaneously. 2016 and 2020 are cases where substantial herding preceded a systematic industry-wide miss. When all polls show a race at 52-44, be appropriately skeptical — that level of consensus may reflect social dynamics among pollsters as much as the true state of the electorate.

The Nonresponse Bias Research

The good news from recent methodological research is that high nonresponse rates do not automatically produce high nonresponse bias. Whether bias is large depends on whether the characteristic that makes people respond is correlated with the outcome of interest.

For some outcomes — factual knowledge, consumer behavior — response propensity is relatively weakly correlated with the outcome, and even low-response surveys produce reasonable estimates. For other outcomes — political engagement, candidate preference in some contexts — response propensity is more strongly correlated, and low-response surveys can be substantially biased.

Pew Research Center's work on this question showed that a phone survey with a 9% response rate and a 36% response rate produced similar demographic profiles after weighting, and similar top-line political opinion distributions. This was taken as cautiously reassuring — but it applies to national surveys. District-level and state-level polls, with smaller sample sizes and more variable political geographies, may be more vulnerable to nonresponse bias even when national surveys are not.

Probability vs. Nonprobability Sampling: The Modern Debate

This chapter has presented probability sampling as the gold standard. But the contemporary reality is that most commercial political polling in the United States uses nonprobability samples — specifically, online opt-in panels. These are convenience samples: respondents self-select into the panel, and their selection probability is unknown and unequal.

How do we know this? Because no one tells them "you have a 1-in-1,000 chance of being selected." They see an ad, click on it, join a panel, and take surveys in exchange for small rewards or sweepstakes entries. They are disproportionately internet-engaged, survey-taking, politically attentive adults. They are not a random slice of the electorate.

Can Nonprobability Samples Be "Fixed"?

The argument for nonprobability panels is that aggressive weighting — particularly MRP — can adequately correct for their unrepresentativeness. If you weight to known population benchmarks and use a strong model, a nonprobability sample can approximate what a probability sample would show.

The counter-argument is that weighting can only correct for observed imbalances. If the sample is systematically different from the population in ways that are not captured by age, gender, race, education, and geography — if, say, online panelists are systematically more politically cynical or more engaged with electoral politics than the general population — weighting cannot correct for that, because you can't measure something you don't know to adjust for.

The methodological consensus, such as it is: well-weighted nonprobability samples can do a reasonable job for many research purposes, but they are more vulnerable to systematic error than well-executed probability samples. The appropriate response is not to refuse to use nonprobability data, but to be transparent about it and to report results with uncertainty ranges that reflect the broader range of methodological risk.

🌍 Global Perspective: Sampling in Non-Western Democracies

The sampling challenges described in this chapter are relatively tractable by global standards. The United States has a well-maintained postal address database, relatively complete voter files, advanced telecommunications infrastructure, and — despite low response rates — a large commercial survey industry. In many democracies, polling faces more fundamental challenges: incomplete or inaccurate voter rolls, high rates of mobile-only households without stable numbers, linguistic diversity that requires multilingual instruments and interviewers, and lower levels of trust in survey research that suppress response rates further. In some authoritarian-adjacent contexts, respondents may be genuinely afraid to express political opinions to interviewers. Global political researchers must adapt their sampling strategies to local infrastructure and institutional constraints — strategies developed for American polling may fail catastrophically in contexts with very different baseline conditions.

Evaluating Sample Quality in Published Polls

Given all of the above, how should an analyst evaluate the sampling quality of a poll they encounter in the wild — a poll released by a campaign, a media organization, or a polling firm they have not previously evaluated?

A systematic checklist helps. The following questions correspond to the major sources of sampling error described in this chapter.

The Sample Quality Checklist

Sampling frame: - [ ] Is the sampling frame described? (voter file, RDD, ABS, online panel?) - [ ] Is the frame appropriate for the target population? (A poll of likely voters should not use a general adult panel without adjustment) - [ ] Are obvious coverage gaps acknowledged? (cell-phone-only households, non-English speakers)

Sample design: - [ ] Is the sampling method described? (probability vs. nonprobability; stratification design if any) - [ ] If the sample is nonprobability (online opt-in), is it disclosed as such? - [ ] Is the sample size reported for both the full sample and the relevant subgroups?

Response rate and field period: - [ ] Is the response rate reported, or at least approximated? - [ ] Is the field period disclosed? (Multi-day polls provide more stability than single-day polls) - [ ] Is the mode described? (phone, online, text-to-web, ABS mail)

Weighting: - [ ] Are the weighting variables and targets described? - [ ] Is the weighting target population appropriate? (Adult population vs. registered voters vs. likely voters) - [ ] Are the weighting sources cited? (Census, voter file, election authority projections)

Uncertainty: - [ ] Is the margin of error reported? - [ ] Does the MOE appropriately account for the design effect (if stratified or clustered sampling was used)? - [ ] Are subgroup results reported with their own (larger) margins of error? - [ ] Does the methodology statement acknowledge nonsampling sources of uncertainty?

Red flags:

Several features of a poll's methodology statement should raise concern:

No description of the sampling frame or mode
Online opt-in panel disclosed but no description of weighting approach
MOE reported as ±3% for a sample of 500 respondents (should be ±4.5%)
Subgroup results from a sample of 60-80 respondents treated as reliable without caveat
Field period of one day (highly vulnerable to day-specific events distorting results)
Weighting targets not disclosed (or weighting to undisclosed proprietary targets)
No disclosure of who commissioned and paid for the poll

✅ Best Practice: Read the Methodology Statement First

When evaluating a poll result, read the methodology statement before reading the topline numbers. Understanding how the sample was constructed should precede interpretation of what the sample found. Most reputable polling organizations now post methodology statements on their websites and include them in their topline data releases. If no methodology statement is available, treat the poll's results with substantial skepticism regardless of what the numbers show.

Practical Sampling Design: Putting It Together

When Meridian designs the sampling strategy for the Garza-Whitfield poll, here is the actual decision sequence:

Step 1: Define the target population. Likely voters in the state's November Senate election. "Likely voters" is defined as registered voters with a vote propensity score of 50 or higher (based on past voting history), or registered voters who self-report certain intent to vote and have voted in at least one of the last two general elections.

Step 2: Choose the sampling frame. The state's registered voter file, matched to phone numbers using commercial list-matching services. Approximately 70% of registered voter records can be matched to at least one phone number (cell or landline).

Step 3: Choose the sampling method. Disproportionate stratified sampling: oversample southern metro counties (where Latino voters are concentrated) at 2x the proportionate rate, to ensure sufficient Latino subgroup sample for analysis.

Step 4: Determine sample size. Target overall n=1,200 completes. At 2:1 oversample of southern metro, target 350 southern metro completes and 850 from the rest of the state. After weighting, effective n for key subgroups: approximately 200 Latino respondents (MOE ±7%), 120 18-34 year olds (MOE ±9%).

Step 5: Design weighting scheme. Rake on age, gender, race/ethnicity, education, and region to targets drawn from voter file characteristics and Census ACS estimates.

Step 6: Plan for nonresponse. Expect a completion rate of 4-6% from the phone frame. Build sufficient contact attempts (minimum 6 per number) to maximize response. Track response disposition carefully. Compare the profile of early and late responders to estimate nonresponse bias.

Step 7: Report sampling design transparently. The published methodology statement will include: the sampling frame (voter file, matched phone numbers), the sampling method (disproportionate stratified), the unweighted and weighted sample sizes, the weighting variables and targets, the field dates, the mode (telephone, live interviewer, bilingual), and the response rate.

Trish summarizes her philosophy: "There's no perfect sample. There's a well-documented sample and an under-documented sample. We document everything, we're honest about the limitations, and we let the results speak — with appropriate caveats."

Who Gets Counted, Who Gets Heard

The title of this chapter asks: who speaks for the public? After working through the theory and practice of sampling, we can give a more precise answer: the public, in any given poll, is the population whose selection probability was nonzero and whose response probability was high enough to produce usable data.

This is not a neutral, value-free technical determination. It is a political fact with political consequences.

Groups with lower response rates, less stable phone numbers, less internet access, and more geographic isolation are systematically underrepresented in the polling that shapes campaign strategy, media narratives, and political conversation. When politicians and journalists talk about "what the public thinks," they are almost always talking about what a particular, advantaged subset of the public says when asked in a particular way.

This doesn't mean we should stop polling. It means we should be persistently, actively honest about whose voices are most clearly heard in our data — and whose are muted. It means investing in sampling methods (ABS, Spanish-language instruments, multilingual interviewers, MRP) that reach historically undersampled populations. It means reporting subgroup results with appropriate uncertainty ranges, rather than pretending that 80 Latino respondents in a national survey give us precise knowledge of Latino political preferences.

And it means, above all, resisting the temptation to treat a poll result as a simple, unambiguous fact about public preferences — when it is, at best, a carefully constructed, inherently imperfect approximation of a diverse and dynamic reality.

🔴 Critical Thinking: The Self-Fulfilling Prophecy of Likely Voter Screens

Likely voter models, by definition, predict who will vote based on past voting behavior. This creates a feedback loop: past voters are predicted to vote again, campaigns target them, their turnout is reinforced. Non-voters are predicted not to vote, campaigns don't target them, their non-voting is reinforced. The "likely voter" construct — a methodological tool — becomes a self-fulfilling prophecy that can entrench the existing electorate's demographic composition rather than reflecting any fixed natural fact about who will vote. In a close race, where mobilizing infrequent voters could be the difference, likely voter models that rely heavily on past history may systematically undercount the very voters whose mobilization could matter most.

Summary

The story of the Literary Digest is more than historical curiosity. It is a permanent warning about the relationship between sampling method and inference quality. A large, biased sample is not better than a smaller, representative one. The origin of your data — who was included, with what probability, responding at what rate — determines the validity of every inference you draw from it.

Probability sampling provides the theoretical foundation for valid inference: when every member of the population has a known, nonzero probability of selection, we can calculate margins of error and make justified inferences from sample to population. Stratified and cluster designs extend this foundation to practical contexts with costs and subgroup analysis requirements.

The margin of error follows the square root law: cutting it in half requires quadrupling the sample size. This creates powerful diminishing returns that explain why most high-quality polls target 800–1,500 respondents rather than trying to achieve the precision of a census. The practical implication is that budget allocation should be driven by the specific precision requirements of the analysis — subgroup comparisons require oversampling of smaller groups, and that investment should be planned explicitly.

The sampling frame problem and the response rate crisis are the twin challenges of contemporary political polling. No frame perfectly covers the target population, and the fraction of people who respond to surveys has fallen so dramatically that nonsampling error now arguably exceeds sampling error as the primary threat to poll validity. The cell phone transition of the 2000s and 2010s illustrated how coverage gaps can compound politically — systematically biasing estimates in directions that mirror the demographic skew of the uncovered population.

The tools for managing these challenges — post-stratification weighting, raking, MRP — are sophisticated and improving. But they are not magic. They can correct for imbalances we can observe and measure; they cannot correct for unmeasured sources of systematic error.

International polling faces additional challenges that American analysts often underestimate: incomplete or politically restricted sampling frames, linguistic diversity that confounds instrument design, infrastructure gaps that exclude whole populations from phone- and internet-based contact, and in some contexts, genuine fear-driven self-censorship that no weighting approach can remedy. Sampling methodology must be adapted to local context, not imported wholesale from one electoral environment to another.

The fundamental question — who speaks for the public? — has no clean technical answer. It is ultimately a question about democratic theory as much as about statistics: whose voices should count in our assessments of public preference, and what methodological choices best serve that value? Honest pollsters hold both the technical and the democratic dimensions of that question simultaneously, designing the best sample they can while being transparent about its limitations.

Vivian Park, presenting the Garza-Whitfield methodology to Nadia Osei's analytics team, ended with this: "We're not telling you what the state thinks. We're telling you what a well-constructed sample of likely voters said when we asked them these questions, in these ways, over these three days. That's valuable — it's the best available approximation of reality we have. But it's an approximation. Don't treat it as a verdict."

Nadia looked up from the methodology statement. "So what do we do with the margin of error?"

"You treat it as the floor on your uncertainty," Vivian said. "Not the ceiling."

Learning Objectives

In This Chapter

Chapter 8: Sampling: Who Speaks for the Public?

The Literary Digest Disaster: What Went Wrong

The Logic of Probability Sampling

Simple Random Sampling

Systematic Sampling

Stratified Sampling

Cluster Sampling

The Sampling Frame Problem

What Frame Do You Use?

Coverage Bias in Practice

Margin of Error: What It Means and What It Doesn't

The Intuition

The Math (Kept Accessible)

Statistical Power and Sample Size

The Square Root Law: Why More Data Yields Diminishing Returns

Where the Square Root Comes From

A Worked Example

The Implication: Allocating Budget Across Subgroups

The Cell Phone Transition: What It Taught Us About Coverage

The Landline Era (Before 2007)

The Transition Period (2007–2015)

What the Industry Learned

Weighting: Correcting for What Went Wrong

Post-Stratification Weighting

Raking (Iterative Proportional Fitting)

Multilevel Regression and Poststratification (MRP)

A Conceptual MRP Worked Example

Weighting in Practice: Meridian's Sun Belt Challenge

The Weighting Decision

International Sampling Challenges: Beyond the American Context

Sampling Frame Availability in Comparative Context

The Special Problem of Authoritarian and Semi-Authoritarian Contexts

Declining Response Rates: The Crisis in Modern Polling

What Does 6% Mean?

Is Polling Dead?

The Nonresponse Bias Research

Probability vs. Nonprobability Sampling: The Modern Debate

Can Nonprobability Samples Be "Fixed"?

Evaluating Sample Quality in Published Polls

The Sample Quality Checklist

Practical Sampling Design: Putting It Together

Who Gets Counted, Who Gets Heard

Summary

Related Reading