46 min read

> "A survey instrument is a promise. The field operation is how you keep it — or break it."

Learning Objectives

  • Compare the coverage, cost, and quality trade-offs of CATI, online, mail, IVR, SMS, and face-to-face modes
  • Calculate and interpret AAPOR Response Rates 1 through 6
  • Explain the causes and consequences of the long-term decline in response rates
  • Distinguish random nonresponse from systematic nonresponse bias and describe evidence from callback studies
  • Identify panel conditioning effects and interviewer effects and explain how each distorts data
  • Outline the logical workflow from raw survey data to a clean, weighted, codebook-documented dataset

Chapter 9: Fielding and Data Collection

"A survey instrument is a promise. The field operation is how you keep it — or break it." — Trish McGovern, Senior Field Director, Meridian Research Group

Trish McGovern has been running political surveys since before Carlos Mendez was born. She started her career in the 1990s, supervising phone banks out of a converted warehouse in suburban Columbus where interviewers sat at rows of beige cubicles dialing from paper sample sheets. Back then, if you called a household number, someone answered. Response rates of 50, 60, sometimes 70 percent were achievable. The data felt solid in a way that's difficult to recapture now.

Today Trish manages a dramatically different operation. When Meridian lands a major statewide poll — say, tracking the Garza-Whitfield Senate race in a competitive swing state — she coordinates a simultaneous multi-mode effort: an online panel sample, a random digit dialing (RDD) phone component, and sometimes a SMS push to cell-phone-only households. She watches response rate dashboards on two monitors. She makes calls at midnight checking whether quotas in rural counties are filling. She argues, gently but firmly, with clients who want a thousand-person sample delivered in four days because they saw a competitor turn one around in 72 hours.

"Speed is the enemy of accuracy," Trish tells Carlos on his first week. "Everyone wants their poll yesterday. My job is to make sure we still have something worth reading when they get it."

This chapter follows Trish's operation to understand what actually happens between "we have a questionnaire" and "we have a dataset." The journey is messier, more expensive, and more consequential for data quality than most consumers of poll results ever imagine.


9.1 The Survey Mode Landscape

Survey methodology is not a single technology. It is an ecosystem of modes — each with its own coverage properties, cost structure, speed characteristics, and sources of bias. Modern pollsters rarely deploy just one mode. Understanding why requires understanding each mode on its own terms.

9.1.1 Computer-Assisted Telephone Interviewing (CATI)

For most of the twentieth century, the telephone interview was the gold standard of survey research. CATI systems — software that displays questionnaires on interviewers' screens, routes branching logic automatically, and records responses directly into a database — transformed telephone interviewing from a labor-intensive paper process into a manageable, quality-controlled operation.

CATI's core strength is the live interviewer. A trained human caller can probe unclear answers, manage complex skip patterns conversationally, handle respondents who start to disengage, and build just enough rapport to push through a 25-minute questionnaire. The interviewer can also exercise judgment: if a respondent's audio quality is poor or they seem confused, the interviewer can clarify without invalidating the response.

The coverage logic of CATI has shifted dramatically. For decades, landline RDD — randomly dialing numbers within working area code/exchange combinations — provided near-universal coverage of American households. As of 2023, approximately 25 percent of U.S. adults live in households that are cell-phone-only, and another significant portion are effectively unreachable by landline because they almost never answer. Pew Research Center's surveys of cell-phone-only adults consistently show they are younger, more likely to be renters, more likely to be Hispanic, and more politically distinct than their landline-owning counterparts. Excluding them from a landline-only sample is not a sampling error — it is a coverage error, a systematic exclusion of a definable population segment.

Modern CATI operations therefore require dual-frame designs: separate sample frames for landlines and cell phones, with the two frames merged and weighted to represent the combined population. Cell phone interviewing is more expensive because federal regulations (the Telephone Consumer Protection Act, TCPA) require that cell phones be dialed manually rather than by auto-dialer, adding per-interview costs.

Cost structure: High. Live interviewers cost $10–$25 per completed interview in labor alone, depending on survey length and response rate. A 600-person statewide poll can run $20,000–$40,000 in data collection alone. Timelines are typically 3–7 field days for a standard political poll.

Quality signature: CATI data tends to have lower item nonresponse (respondents are less likely to skip individual questions) and better performance on long, complex questionnaires. But it is vulnerable to interviewer effects (Section 9.5) and social desirability bias (Section 9.5.2).

9.1.2 Interactive Voice Response (IVR)

IVR polls — sometimes called "robopolls" — replace the live interviewer with a recorded voice. Respondents listen to pre-recorded questions and enter their answers by pressing numbers on a telephone keypad. The system captures keystrokes and generates a dataset automatically.

IVR is dramatically cheaper than CATI: without interviewers, the per-interview marginal cost falls sharply, and a 1,000-person poll can be fielded for a few thousand dollars rather than tens of thousands. Speed is also an advantage — an IVR poll can sometimes produce results overnight.

These cost advantages come with serious methodological liabilities. First, IVR can only reach landline phones under TCPA rules (auto-dialing cell phones with a prerecorded message requires prior written consent). This means IVR surveys suffer from the same cell-phone-coverage problem as landline CATI, but without the option to add a cell phone component using manual dialing. Second, IVR questionnaires must be short (typically under 10 minutes) because respondents hang up at far higher rates than with live interviewers. Third, response rates are extremely low — often 1–3 percent — and the population of people who will sit through a robocall survey is likely nonrandom in ways difficult to correct through weighting.

Some IVR firms combine their phone sample with online panel recruitment to compensate for cell-phone coverage gaps, an approach that introduces its own complexities around blending independent probability and non-probability samples.

📊 Real-World Application: IVR in State Legislative Races

IVR polls play an outsized role in state legislative and local election polling precisely because live-interviewer CATI is prohibitively expensive for small-N races. A state house district poll of 400 respondents that would cost $15,000 by CATI might cost $1,500 by IVR. For campaigns with tight budgets and quick turnaround needs, IVR is often the only feasible option — but consumers of these polls should apply heightened scrutiny.

9.1.3 Online Surveys

Online polling has become the dominant mode for commercial polling, and it plays a major role in academic and campaign research as well. The operational logic is simple: send an invitation email (or display a panel recruitment ad) pointing to a web-hosted questionnaire. Respondents click through, complete the survey, and their data is automatically logged.

The critical distinction in online polling is between probability-based online panels and opt-in (non-probability) panels.

Probability-based online panels — exemplified by the AmeriSpeak panel at NORC/University of Chicago, Ipsos's KnowledgePanel, and similar designs — recruit respondents through probability methods (often address-based sampling, ABS, drawing from the USPS delivery sequence file). Households without internet access are provided tablets and connectivity. The goal is a panel whose composition mirrors the general population, allowing statistical inference with estimable margins of error. These panels are expensive to maintain but produce data that meets AAPOR's standards for probability-based inference.

Opt-in panels are assembled by recruiting volunteers through advertising, website sign-ups, and affiliated networks. Pollsters purchase access to these panels from vendors who maintain millions of registered respondents. Opt-in panels can turn around surveys extremely quickly and cheaply — $0.50 to $2.00 per completed survey — but they are fundamentally non-probability samples. Statistical margins of error calculated for them are, strictly speaking, not valid because the denominator is unknown. AAPOR guidelines require that opt-in polls disclose their non-probability nature and the margin of error footnotes in many such polls contain misleading language.

💡 Intuition: Why Can't You Just Weight Your Way Out of Opt-In Bias?

When opt-in panelists differ from the general population in measurable ways (age, education, party registration), you can weight to correct for those differences. But opt-in panels likely differ in unmeasured ways too: people who volunteer for survey panels may be more politically engaged, more opinionated, more likely to hold extreme views, or just different in personality from those who don't. No amount of weighting on observed demographics corrects for selection on unobserved characteristics. This is the fundamental problem with opt-in samples that no methodological sophistication fully solves.

Mode effects in online surveys: Online polls consistently produce different results from telephone polls on the same substantive questions. Online respondents tend to give more extreme responses on scales (they're less likely to pile up in the middle), are more likely to endorse socially sensitive positions (such as racial prejudice items) presumably because of reduced social desirability pressure, and show different engagement with open-ended questions. These are mode effects, not measurement errors per se — but they complicate trend analysis when mode is changing over time.

9.1.4 Mail Surveys

Mail surveys — sending a paper questionnaire with a return envelope — fell out of favor in the 1980s as telephone coverage became universal. They have seen something of a revival in specific contexts, particularly for reaching populations with low phone response rates (elderly rural residents, for instance) or for sensitive topics where self-administered paper questionnaires reduce social desirability bias.

Address-based sampling from USPS files is the common mail survey frame, offering near-universal household coverage. The timeline is long: allow 2–3 weeks for initial delivery, response, and return, plus another week or two for a follow-up mailing and reminder. Mail surveys are not suitable for tracking rapidly changing opinion during a campaign.

Response rates for mail surveys average 10–30 percent in general population samples, though specific populations with strong motivation to participate (registered voters during a high-salience election) can hit 40–50 percent with well-designed follow-up protocols. Mail surveys are difficult to weight because they require assumptions about who returned the questionnaire versus who didn't, and the demographic characteristics of nonrespondents are typically estimated rather than known.

9.1.5 SMS Text Surveys

SMS polling pushes questions to respondents' cell phones as text messages. Respondents reply with a number or short text, and automated systems log responses. SMS surveys must be extremely short — typically 2–5 questions — because respondents abandon text conversations quickly. They also require opt-in recruitment (again, TCPA governs mass texting) or panel populations that have already consented to SMS contact.

SMS surveys are used primarily for tracking quick toplines (approve/disapprove, horse race) during campaigns, push polls (a different, ethically fraught use), and get-out-the-vote reminder studies. Their coverage overlaps largely with those who have opted into a panel, so coverage properties are similar to opt-in online panels.

9.1.6 Face-to-Face Interviewing

In-person interviewing — an interviewer knocking on a sampled address's door and administering the questionnaire in person — remains the gold standard for complex, long surveys and for reaching populations difficult to contact by any other mode. The American National Election Studies (ANES) and the General Social Survey (GSS) use face-to-face methods for their flagship surveys, accepting the high cost in exchange for richer data and better response rates.

Face-to-face survey costs are prohibitive for routine political polling: $100–$200+ per completed interview when you account for interviewer travel, training, and supervision. A 1,000-person national face-to-face poll costs as much as a large quantitative political science research project. For campaign polling, face-to-face is essentially never used. For academic and government surveys that form the backbone of political science research, it remains essential.


9.2 Multi-Mode Operations: Trish's Playbook

When Meridian received the contract to poll the Garza-Whitfield race in a competitive mid-Atlantic state, Trish convened a mode design meeting in the conference room where a whiteboard was quickly covered with boxes and arrows.

"Client wants 600 likely voters, credible margin of error, results in five days," she told the assembled team. "We are not doing this by CATI alone. Cost would kill us and timeline is impossible."

Meridian's standard approach for statewide political polling uses a blended design: roughly 50 percent CATI (split between landline and cell phone frames using dual-frame RDD), 30 percent online panel from a probability-based vendor, and 20 percent address-based mail with web-option response. The exact shares shift based on the state, the timeline, and whether special populations need oversampling.

The logic of blending: No single mode reaches everyone. CATI reaches people who answer their phones. Online panels reach regular internet users who have joined a panel. Mail reaches address-file households regardless of phone or internet usage. Each mode fills coverage gaps left by the others. The blend is not arbitrary — Meridian has calibrated its mode composition against known benchmarks (voter registration files, Census demographic distributions) to minimize the systematic gaps that any single mode creates.

Mode-specific protocols: Each mode requires its own operational protocols. For CATI, Trish works with a telephone interviewing facility that uses her customized CATI script, conducts interviewer training on the specific questionnaire, and returns raw data files daily. For online, she uploads the questionnaire to a panel vendor's system and monitors quota filling in real time through a dashboard. For mail-web, she works with a letter shop to print and mail invitations and monitors web-form completions daily.

⚠️ Common Pitfall: Mode Harmonization

When combining data from multiple modes, question wording, response scales, and answer category order must be identical across all modes. Even subtle differences — showing a 5-point approval scale with the positive end first online but the negative end first on paper — can introduce mode effects that masquerade as substantive attitude differences. Trish enforces a single master questionnaire that specifies mode-specific adaptations only where unavoidable (e.g., IVR cannot handle grid questions, so those must be broken into individual items).


9.3 Response Rates: Defining the Denominator

How many people you reach, attempt to interview, and successfully complete interviews with is not just an operational metric — it is a data quality indicator and an ethical reporting obligation. The American Association for Public Opinion Research (AAPOR) has codified a family of response rate definitions to create consistency across organizations and studies.

9.3.1 The AAPOR Response Rate Framework

AAPOR defines response rates based on what information is known or assumed about the sample. The general formula is:

Response Rate = Completed Interviews / (Completed + Partial + Refusals + Non-contacts + Unknown Eligibility × Estimated Proportion Eligible)

The six standard AAPOR response rates (RR1–RR6) differ primarily in how they treat partial completions and cases of unknown eligibility:

  • RR1 (Minimum Response Rate): Counts only fully completed interviews in the numerator. Assumes all cases of unknown eligibility are eligible (maximizes the denominator, producing the most conservative/lowest response rate).

  • RR2: Includes both completed and partially completed interviews in the numerator. Still assumes all unknowns are eligible.

  • RR3: Completed interviews only in the numerator, but applies an estimated eligibility rate to unknown cases (based on the known ratio of eligible to non-eligible cases in resolved cases).

  • RR4: Completed and partial interviews in the numerator, estimated eligibility rate for unknowns.

  • RR5: Same as RR3 but excludes cases where a screener was completed but the respondent was ineligible.

  • RR6: Completed and partial interviews (as in RR4) plus the screener-completed-but-ineligible correction.

For political polls targeting registered voters or likely voters, screening out ineligible respondents is standard practice, making RR5 and RR6 most appropriate. Most reputable polls report one of these.

📊 Real-World Application: Trish's Response Rate Tracking

Midway through the Garza-Whitfield tracking poll, Trish pulls up the daily response rate dashboard and reads out the figures to Carlos:

"CATI landline component: we've dialed 8,400 numbers, reached 1,240, screened in 680 as registered voters, completed 290 interviews, 55 partials. Cell phone: dialed 4,100, reached 620, screened in 400, completed 185. The online panel filled 180 completes at a 34 percent response rate. Mail-web component still coming in."

Carlos does the arithmetic on the CATI landline: about 14.8 percent of all dialed numbers reached a respondent, and about 23.4 percent of reached, eligible respondents completed the interview. "So what's the actual response rate?" he asks.

"Depends which formula," Trish says. "RR1 on the landline component is around 3.5 percent. RR3, where we account for the fact that a lot of those non-contacts are probably disconnected lines and businesses, gets us closer to 9 percent. Neither number would have been considered acceptable forty years ago. Today, 9 percent is about what everyone is getting."

9.3.2 The Response Rate Collapse

The story of survey response rates over the past half-century is one of the most consequential methodological narratives in social science. In the 1970s, response rates of 70–80 percent were common for well-conducted telephone surveys. By 1997, Pew Research reported overall response rates around 36 percent. By 2012, the figure had fallen to 9 percent for a standard five-day telephone survey effort. By the early 2020s, Pew and other organizations were reporting rates of 4–6 percent for telephone surveys and lower still for many commercial operations.

What happened? The causes are multiple and interacting:

Caller ID and call screening: Widespread adoption of caller ID in the 1990s allowed households to screen unknown numbers before answering. The percentage of Americans who answer calls from unknown numbers has declined precipitously.

Cell phone transition: Cell phones introduced friction into polling. People carry their phones with them but do not answer unknown numbers on them. Cell phone respondents are harder to reach (often calling back at the same time every day is ineffective) and require manual dialing under TCPA.

Erosion of civic norms: Participation in surveys once carried a civic framing — you were helping researchers understand public opinion. That framing has weakened as survey fatigue, telemarketing abuse of telephone contact, and general distrust in institutions have grown.

Survey proliferation: The very success of survey research created a saturation problem. Americans are contacted by more surveys, legitimate and otherwise, than they were in 1970. The marginal value of any individual survey participation has decreased, raising the threshold for compliance.

The rise of marketing calls: As telemarketing exploded, the telephone became associated with commercial solicitation rather than civic participation. Respondents who might have answered a survey call in 1980 now assume any unfamiliar number is trying to sell them something.

9.3.3 Does Response Rate Decline Mean Bias?

This is the pivotal question, and the answer is counterintuitive: not necessarily, or at least not in simple proportion to the rate decline.

The critical issue is not response rate per se but nonresponse bias — whether those who respond are systematically different from those who don't in ways that affect survey estimates. If nonresponse is random with respect to the variables being measured, low response rates produce estimates with higher variance (more uncertainty) but not systematically wrong ones. If nonresponse is correlated with the outcome — if, say, partisans of one party systematically refuse to participate in polls — then the estimates are biased regardless of whether you're getting 70 percent or 7 percent response.

The empirical evidence is mixed but sobering. Studies comparing survey respondents to administrative records (voter files, Census data) find that:

  • Survey respondents are, on average, more educated, more politically engaged, and more likely to vote than their sampling frame counterparts.
  • These biases can sometimes be corrected through weighting if the observable characteristics of nonrespondents are known.
  • But if nonrespondents differ on unobservable dimensions — such as enthusiasm for a candidate or intensity of partisanship — weighting cannot correct the bias.

Callback studies — in which researchers track the demographic and attitudinal characteristics of respondents reached after varying numbers of contact attempts — provide important evidence. Hard-to-reach respondents (those who only participate after five or six callback attempts) tend to be younger, more likely to be employed full-time, and somewhat less politically engaged than easy-to-reach respondents. This suggests that stopping at early callbacks produces samples weighted toward more engaged, older respondents. Protocols requiring extensive callbacks (often 8–10 attempts at varying times of day and days of week) partially address this problem but add cost and time.

🔴 Critical Thinking: The 2020 Polling Error

The systematic overestimation of Democratic support in 2020 (polling averages showed Biden +8 nationally; he won by +4.5) has been attributed in part to differential nonresponse by party. A comprehensive post-election study suggested that Trump supporters were systematically less likely to respond to polls, and that this was not fully correctable through standard demographic weighting. The bias may have arisen from differential social trust — Trump supporters who were skeptical of media and institutions were also skeptical of researchers — and from the difficulty of constructing weights that capture political engagement rather than just party identification. This remains an active area of research with significant implications for how the industry thinks about nonresponse.


9.4 Nonresponse Bias: Mechanisms and Evidence

Nonresponse bias can arise from two distinct mechanisms, which have different implications for data quality and correction strategies.

9.4.1 Unit Nonresponse

Unit nonresponse occurs when a sampled individual provides no data at all — they don't answer the phone, they hang up before the screener, they delete the invitation email without clicking through. The entire record is missing.

Unit nonresponse becomes bias when those who fail to respond differ from respondents on the variables of interest. The magnitude of the bias depends on both the nonresponse rate and the difference between respondents and nonrespondents:

Nonresponse Bias = (1 - Response Rate) × (Ȳ_nonrespondents - Ȳ_total)

This formulation clarifies two pathways to minimizing bias: increasing the response rate (reducing 1 - RR) or ensuring that respondents and nonrespondents don't differ on key variables (reducing the second term). The industry has largely abandoned the first strategy as infeasible and concentrated on the second through weighting and validated measures of bias.

Evidence from matched-sample validation: One powerful method for assessing unit nonresponse bias is matching survey respondents to administrative records (voter files, Census records) and comparing observed demographic distributions. Studies using this approach consistently find that survey respondents are more educated and more likely to be registered voters than the frame population — but these differences are partially correctable through raking and poststratification weights.

9.4.2 Item Nonresponse

Item nonresponse occurs when a respondent participates in the survey but declines to answer specific questions — leaving a candidate preference blank, skipping an income question, or refusing the open-ended opinion probe. Item nonresponse rates vary dramatically by question type: income questions routinely see 10–20 percent nonresponse even in otherwise cooperative samples; horse-race candidate preference questions in political polls typically see only 2–5 percent nonresponse.

For political polls, item nonresponse on the vote-intention question is managed through follow-up probing ("Which candidate are you leaning toward?") and sometimes through imputation for analysis. How undecided and leaning respondents are handled — reported separately, allocated proportionally, excluded from the topline — varies across pollsters and represents a source of inconsistency in published results.

9.4.3 Wave Nonresponse in Panel Studies

Panel surveys — those that follow the same respondents over time — face a specialized form of nonresponse called wave nonresponse or attrition. Respondents who participate in Wave 1 may be unavailable or unwilling to participate in Wave 2 or Wave 3. If attrition is correlated with opinion change (those who switch parties are less likely to re-participate, for instance), wave nonresponse produces biased estimates of change over time.

Political tracking polls that attempt to follow the same panel of respondents across the campaign cycle must carefully monitor attrition patterns and weight for wave nonresponse. This is one reason why most campaign tracking polls use fresh cross-sectional samples rather than true panels — attrition management is costly and the benefits of panel design are sometimes outweighed by the nonresponse complications.


9.5 Interviewer Effects and Social Desirability

9.5.1 Interviewer Effects

The presence of another human being changes how people answer questions. This is not a subtle or occasional phenomenon — it is a robust, well-documented source of measurement bias in telephone and face-to-face surveys, and one of the primary motivations for self-administered (online, mail, IVR) modes in sensitive research.

Interviewer effects operate through several mechanisms:

Demographic matching effects: Respondents give different answers when interviewed by someone of a different race, gender, or perceived political affiliation. In a classic study of race-of-interviewer effects, Black respondents reported higher levels of racial solidarity and more support for affirmative action when interviewed by Black interviewers than by white interviewers. The reverse pattern held for some questions: white respondents expressed less support for affirmative action when interviewed by Black interviewers. These are not reporting errors — they may reflect genuine differences in how respondents construct social situations — but they are measurement artifacts that distort population estimates.

Interviewer variance: Different interviewers reading the same script produce different answers. Some interviewers are more likely to probe ambiguous responses; some have different vocal tones that affect cooperation; some subtly emphasize different words in question text. Sophisticated survey organizations measure interviewee-level intraclass correlations (the degree to which respondents to the same interviewer correlate with each other after controlling for demographics) as a quality indicator. High interviewer-level clustering signals a problem.

Expectancy effects: Interviewers may unconsciously communicate expected answers through tone, pacing, or emphasis. Training protocols at organizations like Meridian explicitly address this, requiring interviewers to read questions verbatim, pause equally after each response option, and probe non-directively ("Can you tell me more about that?" rather than "So you mean...").

9.5.2 Social Desirability Bias

Social desirability bias (SDB) is the tendency for respondents to report opinions or behaviors that they believe will be viewed favorably by the interviewer or by society generally, rather than their true opinions or behaviors. The effect is strongest for:

  • Questions about socially stigmatized behaviors (drug use, sexual behavior, criminal history)
  • Questions about racial attitudes or intergroup relations
  • Questions about vote choice when one candidate is more socially sanctioned than the other
  • Questions about politically incorrect views

In political polling, social desirability most famously appeared in the "Bradley Effect" hypothesis: the observation that Black candidates in several 1980s elections performed worse than pre-election polls suggested. The explanation offered was that white voters were unwilling to admit to human interviewers that they planned to vote against the Black candidate. While the empirical evidence for a consistent Bradley Effect in recent elections is mixed, the underlying mechanism — willingness to give politically comfortable rather than honest responses — is well-established.

The "Shy Tory" Effect: British polling has documented consistent underestimation of Conservative support in pre-election polls, attributed partly to social desirability among voters embarrassed to admit supporting a stigmatized party. This same dynamic has been proposed as an explanation for the polling miss on Trump support in 2016 and 2020, though evidence is contested.

Self-administered modes as a partial solution: Online and mail surveys reduce SDB because respondents believe (usually correctly) that their individual responses are not directly observed by another person. Comparisons of online vs. telephone responses to sensitive questions consistently show higher endorsement of socially undesirable attitudes in online modes. This is generally interpreted as online providing more accurate measurement of sensitive topics, not that online makes people more prejudiced.

⚠️ Common Pitfall: Assuming SDB Only Goes One Direction

Social desirability operates differently depending on the social context. In some communities, expressing extreme nationalist views or anti-establishment sentiments carries the same social approval that mainstream views carry in others. Survey researchers studying populist movements have noted that populist sentiment is sometimes understated in elite-coded survey environments and overstated in media environments that normalize that frame. The direction of desirability bias cannot be assumed without understanding the respondent's social context.


9.6 Panel Conditioning Effects

Online panels and longitudinal studies face a particular hazard: respondents who participate in many surveys over time change because of that participation. This is called panel conditioning or time-in-sample bias.

Conditioning effects include:

Learning effects: Repeated survey participation educates respondents about politics. Panel members who complete quarterly political surveys for two years know more about political institutions, more about candidates, and have better-formed opinions than comparable non-panel citizens. Surveys of these respondents then measure political sophistication that was created by the survey process itself.

Sensitization effects: Exposure to survey questions on a topic increases respondents' awareness of and attention to that topic in daily life. Panel members who are regularly asked about their news consumption habits become more conscious news consumers. Their subsequent reports about news consumption may be more accurate but also inflated by the attention the survey drew to their habits.

Attitude crystallization: Respondents who are repeatedly asked about their policy preferences develop more consistent and stable attitudes over time. The panelist asked fifteen times about immigration policy has a more crystallized, accessible immigration attitude than a fresh respondent. Survey-measured attitudes of long-term panelists may be more coherent than population attitudes — a bias toward extremity and stability.

Fatigue and satisficing: Long-term panelists who recognize survey patterns may answer more quickly and less thoughtfully over time, choosing the first plausible response rather than considering each option carefully. This "satisficing" behavior tends to push responses toward the middle of scales and toward expected answers.

Probability-based panels like KnowledgePanel and AmeriSpeak manage conditioning concerns by rotating panelists through surveys carefully, not surveying the same respondent on the same topic too frequently, and monitoring individual-level response pattern changes over time. Opt-in panels have less ability to manage conditioning because they have less visibility into individual panel members' complete survey history across vendors.


9.7 The Logistical Reality of a National Poll

Carlos comes to understand the operational scope of Meridian's work gradually, in the moments between data requests and methodology discussions.

Fielding a credible national poll — 1,000 adults or 800 likely voters, dual-frame CATI plus online — involves:

Sample procurement: Purchasing or generating a sample is itself a market with vendors, pricing, and quality variation. For RDD, sample vendors sell lists of working telephone numbers stratified by state and frame type. For online, panel vendors sell access to their registrants, priced by the expected complete rate and demographics. For ABS, the USPS deliverability file must be licensed and processed. Trish has vendor relationships built over two decades. She knows which sample vendors deliver cleaner cell-phone frames and which online panels tend to over-represent politically extreme respondents.

Questionnaire programming: The paper questionnaire must be translated into working CATI scripts, online survey platform code, and (for mail) print-ready PDFs. Skip patterns that are trivial to describe in prose ("If respondent says 'definitely vote,' ask Q17; otherwise, go to Q23") require careful conditional logic implementation. A single misrouted skip can invalidate hundreds of responses before the error is caught.

Interviewer training: For CATI components, interviewers must be trained on the specific questionnaire. This is not generic training — it covers how to pronounce candidates' names, how to handle specific probable objections ("I don't do political surveys"), how to code ambiguous responses on screening questions, and what notes to attach to unusual records.

Daily monitoring: Trish watches quota dashboards every day of the field period. If cell-phone Black respondents are filling too slowly, she increases the cell-phone sample allocation targeted at those area codes. If rural counties are under-represented in the online component, she may add a mail push to ABS addresses in those zip codes. Quota monitoring is not about distorting the sample — it is about ensuring that the deliberate sampling plan is being executed as designed.

Refusal conversion: For CATI surveys, initial refusals are not automatically coded as final. Respondents who decline during the first contact are often re-contacted by a senior interviewer (or by a more experienced caller at a different time of day) for a "conversion" attempt. Refusal conversion protocols must be disclosed as part of the methodology (AAPOR standards require reporting whether conversion was attempted). Converted refusals sometimes differ systematically from initial cooperators — they may be more skeptical, more skeptical of specific institutions, or more fatigued — and their inclusion should be monitored.

Field period management: The length of the field period matters for data quality. Too short (24–48 hours) and certain subgroups are systematically missed — people working multiple jobs who are never home on weekday evenings, shift workers who sleep during standard dialing hours. Too long and the political environment may shift during fielding, creating a synthetic "average" of opinion across different news cycles. For political tracking polls during an active campaign, 3–5 days is typically considered the right balance.

📊 Real-World Application: The Weekend Effect

Trish explains to Carlos why she always pushes clients to allow at least one weekend of fielding. "Weekday-only samples miss a ton of working-age adults who are simply unavailable Monday through Friday evening. Saturday and Sunday calling reaches people who never answer on weekdays. If you only field Tuesday through Thursday, your sample skews older and more retired. In a race as close as Garza-Whitfield, that could move the topline by a point or two."


9.8 Data Cleaning and Coding: From Raw Responses to Analyzable Data

Survey fielding ends when the last response comes in. Data quality work is only half-done.

9.8.1 Initial Data Cleaning

Raw survey data files contain problems that must be resolved before analysis:

Speeder flagging: Online and IVR surveys log completion times. Respondents who complete a 20-minute questionnaire in under 3 minutes ("speeders") likely did not read questions carefully. Standard practice is to flag and review speeders, removing records where completion time is implausibly fast (typically less than one-third of median completion time).

Straightlining detection: Grid questions (where respondents rate multiple items on the same scale) reveal "straightliners" who select the same response option for every item without reading the questions. Straightlining is detected by examining variance across items for individual respondents. Records with zero variance across grid items are flagged for review.

Open-end response review: Open-ended questions produce text that must be reviewed by a human coder. Bot-generated responses (strings of nonsense text, repeated characters, verbatim repetition of the question) must be removed. Genuine but short responses ("don't know," "no opinion," "none") must be distinguished from true bot activity.

Out-of-range value checks: Automatically coded numeric values (age, income, zip code) are checked against valid ranges. A reported age of 150 or an income of $9,999,999 is a data entry error requiring investigation.

Duplicate response detection: Online surveys must check for duplicate respondents — the same person completing the survey twice using different device identifiers, or the same panelist being sampled through multiple vendors.

9.8.2 Coding Schemes and Codebooks

Closed-ended questions have predefined response codes. Open-ended questions and partially ambiguous closed-ended responses require human coding against a codebook — a document specifying what each response code means and how ambiguous cases should be resolved.

For vote intention questions, responses like "I'm leaning toward Garza" need to be coded consistently. Meridian's standard protocol creates separate variables for the primary vote-intention response ("which candidate would you vote for if the election were held today") and a follow-up lean variable for initially undecided respondents ("which way are you leaning"), allowing analysts to report both "definite" and "definite plus lean" preferences.

Intercoder reliability: When human coders are assigning codes to open-ended responses, quality requires assessing consistency between coders. A standard approach is double-coding: two coders independently code 20 percent of responses, and agreement rates (Cohen's kappa) are calculated. Agreement below 0.70 suggests the coding scheme needs clarification or coders need additional training.

9.8.3 Documentation: The Codebook and Data Dictionary

A well-documented dataset is reproducible and defensible. Meridian's standard deliverable includes:

  • Codebook: Variable names, question text, response options, and code values for all variables
  • Data dictionary: Data types, valid ranges, and notes on missing value conventions
  • Field period log: Dates and modes of data collection, daily sample sizes, daily response rates
  • Weighting documentation: The target margins used for weighting, the variables used in the rake, and diagnostic statistics (minimum and maximum weight values, design effect)
  • Disposition codes: Full accounting of all sample records by final disposition (complete, partial, refusal, non-contact, ineligible, etc.) per AAPOR standards

Carlos learns to check these documents before touching a dataset. "The dataset alone tells you what the data says," Vivian Park tells him at one point. "The documentation tells you whether to believe it."


9.9 Who Gets Counted: The Politics of Survey Participation

Every decision in survey fielding — which mode, which frame, which languages, which call hours — determines who is included in the population of potential respondents. These are not purely technical decisions. They are decisions with political implications.

Language exclusion: A survey conducted only in English excludes Spanish-speaking, Mandarin-speaking, Vietnamese-speaking, and other non-English-dominant respondents. In states with large immigrant communities, English-only surveys systematically exclude populations whose political attitudes may differ substantially from the English-speaking majority. AAPOR standards recommend multilingual field operations when the target population includes substantial non-English-speaking segments, but cost constraints lead many commercial pollsters to produce English-only surveys and note this limitation quietly in methodology disclosures.

Digital exclusion: Online-only surveys exclude households without reliable internet access. This exclusion falls disproportionately on rural residents, low-income households, elderly individuals, and populations in geographic areas with poor broadband infrastructure. Probability-based panels address this by providing internet access to offline-recruited respondents, but opt-in panels cannot do so and simply accept the coverage gap.

Literacy and disability access: Mail surveys require literacy; IVR and CATI surveys require hearing capacity; online surveys require visual capacity. Standard surveys rarely provide accessible alternatives for deaf, blind, or low-literacy respondents, systematically excluding these populations from the data that shapes policy.

⚖️ Ethical Analysis: When Methodology Becomes Policy

Survey-based measures of public opinion influence policy decisions. If methodological choices systematically exclude the political attitudes of marginalized populations — the undocumented, the incarcerated, the severely ill, the digitally disconnected — then "public opinion" as measured by surveys is actually the opinion of a subset of the public. Policy made in response to surveyed opinion may not serve, and may actively harm, populations whose voices were never in the data. This is not merely a statistical problem. It is a question about whose preferences democratic institutions are designed to aggregate.


9.10 Weighting the Final Dataset

Before data leaves Meridian's hands and enters analysis, it must be weighted. Weighting is the process of adjusting each respondent's contribution to aggregate statistics to compensate for unequal probabilities of selection and to bring the sample's demographic profile into alignment with known population benchmarks.

9.10.1 Why Weighting Is Necessary

Even a carefully designed probability sample will deviate from the target population due to differential response rates across demographic groups, the practical realities of multi-mode fieldwork, and random sampling variability. A sample of 600 likely voters drawn from a state where 52% of voters are women may yield only 44% women after fielding — not because anything went wrong, but because women happened to be less reachable during those field days or less willing to participate. Unweighted, this sample would undercount women's political preferences.

Weighting corrects for this by up-weighting women (giving each female respondent a weight greater than 1.0) and down-weighting men (giving each male respondent a weight less than 1.0) until the weighted sample distribution matches the population distribution. The same logic applies to age, race, education, geographic region, and — in political polls — often party registration.

9.10.2 Raking (Iterative Proportional Fitting)

The most common weighting method in political polling is raking, also called iterative proportional fitting (IPF). Raking adjusts a sample simultaneously on multiple dimensions without requiring a full joint distribution. The process:

  1. Define target marginal distributions for each weighting variable (e.g., age: 18–34 = 22%, 35–54 = 38%, 55+ = 40%; gender: male = 48%, female = 52%).
  2. Begin with equal weights (each respondent = 1.0).
  3. Adjust weights so the age distribution matches the target.
  4. Adjust the same weights so the gender distribution matches the target (this may partially disturb the age alignment).
  5. Re-adjust for age, then for gender again, continuing to iterate until all marginals simultaneously match their targets within a specified tolerance.
  6. Repeat for all weighting variables in sequence.

Raking typically converges in 10–20 iterations. The result is a set of weights where each respondent's weight reflects how much they should "count" in aggregate statistics to represent their demographic cell in the population.

Target benchmarks: Political polls typically rake to: age × gender (from Census or voter file), race/ethnicity, education, region (urban/suburban/rural), and sometimes to prior vote (from the actual vote distribution of the most recent comparable election). The choice of benchmarks involves judgment — weighting to too many variables on too small a sample can produce extreme weights that destabilize estimates.

9.10.3 Design Effects and Effective Sample Size

Weighting comes at a cost: it increases variance in the estimates. When some respondents receive high weights (because they are from underrepresented groups) and others receive low weights, the effective number of independent observations is reduced. This is quantified by the design effect (DEFF):

DEFF = Actual Variance / Simple Random Sample Variance

For a typical political poll with moderate weighting, DEFF ranges from 1.2 to 2.0. This means the effective sample size — the size of a simple random sample that would produce the same precision — is the actual sample size divided by DEFF. A poll of 800 respondents with DEFF = 1.6 has an effective N of 500, which should inform the reported margin of error.

Many published polls report margins of error calculated as if the sample were a simple random sample of the stated size, ignoring the design effect. This is a form of understating uncertainty that affects primarily consumers who interpret the MOE as a complete account of sampling uncertainty. Best practice is to report the design-effect-adjusted MOE alongside the simple SRS-based MOE.

Minimum and maximum weights: Survey organizations monitor weight distributions closely. Extremely high weights (a single respondent "speaking for" many others) are a data quality risk: if that respondent misunderstood a question or gave an unrepresentative answer, their error is amplified proportionally. Most organizations cap weights at 4.0 or 5.0 times the minimum weight, using trimming procedures that pull extreme weights toward the center while preserving overall distributional alignment.

9.10.4 The Limits of Weighting

Weighting is powerful but not magical. It corrects for imbalances on observed, measurable variables — demographics that appear in both the survey and the benchmark files. It cannot correct for imbalances on unobserved variables — attitudes, behaviors, or personality traits that predict survey participation but are not captured in the benchmark data.

This is the fundamental limitation that the 2020 polling miss brought into sharp relief (see Section 9.3.3 and Case Study 9-2): if partisan enthusiasm, institutional trust, or political identity systematically drives survey participation in ways not captured by age/race/education demographics, weighting on those demographics cannot correct the resulting bias. The field is actively developing supplementary weighting approaches — using party registration files, recall voting items, and trust scales as additional weighting dimensions — but none of these fully solves the problem of nonresponse driven by unmeasured attitudinal characteristics.


9.11 The Weighting-Transparency Connection

Trish McGovern's long experience has made her firm about one principle that she enforces across every Meridian project: every weighting decision must be documented before the data is analyzed, and the documentation must be part of every client deliverable.

"Weighting is where you can hide a lot of sins," she tells Carlos during his first month. "You can weight your way to almost any result if you're willing to use crazy benchmarks and nobody checks your work. Our job is to make sure every choice we make is visible and defensible."

Meridian's weighting documentation template requires:

  1. Source and vintage of all benchmark data: Which Census year, which voter file vintage, which official election return for prior vote benchmarks.
  2. Exact target marginal distributions: Printed as a table, not just described in text.
  3. Number of raking iterations completed and convergence criterion.
  4. Final weight distribution statistics: Minimum weight, maximum weight, mean weight (always 1.0 if properly constructed), standard deviation of weights, and DEFF.
  5. Trimming decisions: Whether any weights were trimmed, the trimming threshold used, and the number and percentage of records affected.
  6. Sensitivity checks: How the topline estimates change under alternative reasonable weighting schemes (e.g., different age marginals, with and without party registration weighting).

This level of documentation serves multiple purposes. It creates a paper trail that allows any analyst — at Meridian, at the client, or in a future methodological review — to reproduce the weighting and verify the results. It forces the analyst who performed the weighting to think through every decision explicitly, reducing ad hoc choices. And it provides the basis for the transparency disclosure that AAPOR standards require in any public release.


9.12 The Survey Lifecycle: From Instrument to Archive

The data collection phase covered in this chapter is embedded in a larger survey lifecycle that begins with research design and ends with data archiving and secondary analysis. Understanding where fielding fits in the lifecycle helps place its contribution to data quality in proper context.

Phase 1 — Research Design: Defining the population, research questions, and hypotheses. Choosing the mode and mode mix. Determining the sampling strategy and target sample size. Setting the budget and timeline. This phase produces the sampling plan and the project protocol.

Phase 2 — Instrument Development: Writing the questionnaire. Conducting cognitive interviewing and pretesting. Finalizing question wording, response options, and skip logic. Translating for multilingual administration if required. This phase produces the master questionnaire.

Phase 3 — Sample Procurement: Purchasing or generating the sample frame. Drawing the initial sample. Stratifying for oversampling if planned. This phase produces the sample file with disposition tracking initialized.

Phase 4 — Fielding and Data Collection: The operational phase described in detail throughout this chapter. CATI programming, interviewer training, panel questionnaire upload, mail production, daily monitoring, refusal conversion. This phase produces the raw data file and the field period log.

Phase 5 — Data Processing: Data cleaning, coding, weighting. Producing the final analytic dataset with codebook and data dictionary. This phase produces the clean, weighted dataset ready for analysis.

Phase 6 — Analysis and Reporting: Statistical analysis, visualization, interpretation, and client reporting. For public polls, this phase includes the press release, topline tables, crosstabs, and methodology disclosure.

Phase 7 — Archiving: Depositing the data (often in anonymized, de-identified form) in an archive such as the Roper Center or the Inter-university Consortium for Political and Social Research (ICPSR). This phase ensures the data is available for secondary analysis, replication, and longitudinal research that will use today's poll as a data point in tomorrow's trend study.

Each phase introduces potential errors and biases that propagate forward. A flaw in the sampling frame (Phase 3) cannot be fully corrected by excellent data processing (Phase 5). A poorly worded question (Phase 2) will produce measurement error that no amount of weighting or cleaning can remove. The survey lifecycle is a chain whose data quality is limited by its weakest link — which is why professional survey organizations invest in quality control at every phase, not just at the fielding stage.


9.13 Cost, Speed, and Quality: The Eternal Triangle

Every survey research project operates within a constraint triangle: cost, speed, and data quality. Improvements in any one dimension typically require sacrifices in at least one other. Understanding this triangle is essential for analysts who must communicate methodological trade-offs to clients who are often most focused on cost and speed.

Cost: The dominant cost drivers in survey research are interviewer time (for CATI and face-to-face), sample procurement (for high-quality frames), and field time (longer fields require more management and monitoring). Online opt-in surveys are cheap because they eliminate interviewers and use inexpensive non-probability samples — but those savings come from sacrificing probability-sample quality. Mail surveys are moderately expensive due to printing and postage, but expensive in calendar time.

Speed: Campaign polls often have days or hours, not weeks, between the decision to poll and the need for results. This drives choices toward modes with fast turnaround — IVR, opt-in online, short CATI operations. But speed compresses the field window, which can introduce time-of-day sampling biases and prevent adequate callback protocols. The 72-hour poll is a commercial reality; whether it is good science is a separate question.

Quality: Data quality in this context means minimizing all sources of error — coverage error, nonresponse bias, measurement error, and processing error — simultaneously. High-quality designs typically require longer field periods (to enable callbacks and multilingual outreach), probability-based frames (which cost more), live interviewers (which add cost and introduce interviewer effects but also reduce item nonresponse), and comprehensive documentation (which adds analytical time).

🧪 Try This: The Trade-Off Matrix

Draw a three-column matrix with headers: "Design Choice," "Quality Impact," "Cost/Speed Impact." For each of the following design choices, fill in both impact columns: (1) reducing field period from 7 days to 3 days; (2) switching from CATI to IVR; (3) adding Spanish-language option to a CATI survey; (4) increasing sample size from 400 to 800; (5) weighting on party registration in addition to standard demographics.

Vivian's answer to every question about survey design begins with this matrix. It is not that trade-offs are bad — trade-offs are unavoidable. It is that good analysts make trade-offs explicitly and report them transparently, rather than accepting them silently and presenting results as if no sacrifice was made.


9.14 International Perspectives on Survey Methodology

Survey research is a global practice, but the specific challenges of phone, online, and mail modes play out very differently across national contexts. Understanding how methodology travels — and where it breaks down — is increasingly important as political analytics becomes a global field.

Cell phone dominance: The United States is unusual in maintaining a substantial landline telephone infrastructure. In most of sub-Saharan Africa, South Asia, and much of Latin America, landlines are rare or nonexistent — virtually all telephone ownership is mobile. In these contexts, RDD of mobile numbers is the only feasible telephone approach, and TCPA-style restrictions on auto-dialing do not apply. This enables some survey research that is methodologically impossible in the U.S. but creates its own challenges (urban-rural differential in mobile ownership, cost of data for survey-taking in low-income contexts).

Internet access: Online surveys assume meaningful internet penetration. In countries where internet access is concentrated among urban, educated, higher-income populations, online surveys face coverage problems far more severe than in the United States. Probability-based online panels that provide internet access to offline households — the KnowledgePanel/AmeriSpeak model — are not financially viable in most developing country contexts.

Trust in survey institutions: Response rates depend partly on social trust in the institutions conducting surveys. Countries with strong traditions of civic participation and trust in academic institutions tend to have higher response rates than those with histories of authoritarian surveillance, where survey participation may be associated with government monitoring. Post-conflict societies may have particular sensitivities around questions that touch on ethnic or political identity.

🌍 Global Perspective: The Arab Barometer

The Arab Barometer is a network of scholars conducting public opinion surveys across the Arab world using face-to-face interviews — a mode chosen specifically because phone and online coverage are insufficient to reach representative samples in many Arab countries. The challenges of face-to-face interviewing in politically sensitive contexts (questions about trust in government, support for political change) make interviewer training, community trust-building, and safety protocols paramount. The substantive findings of Arab Barometer surveys — on democratization, religious identity, women's rights — have been influential in regional policy debates precisely because the methodology is rigorous enough to generate credible evidence in difficult research environments.


9.15 Measurement Shapes Reality: The Chapter's Organizing Theme

The chapter's themes converge on a principle that will recur throughout this textbook: measurement shapes the reality it purports to describe.

Survey methodology is not a neutral conduit between public opinion and its measurement. The choice of mode affects who participates, which affects what is measured. The interviewer's presence changes answers. The question's position in a questionnaire changes what respondents think about before answering. Response options define the universe of legitimate responses. The response rate and weighting scheme determine whose voices are amplified.

None of these effects are catastrophic when managed carefully. Good survey methodology minimizes distortions, documents them honestly, and interprets results with appropriate epistemic humility. But the student of political analytics must internalize a fundamental truth: when you read a poll, you are not reading a transparent window into public opinion. You are reading the output of a complex measurement process shaped by dozens of design decisions made by researchers operating under cost constraints, time pressure, and their own tacit assumptions about who the public is.

Trish puts it simply, cleaning up the whiteboard at the end of the mode design meeting: "Every choice we make is a trade-off. There's no pure data. There's only more or less carefully collected data, and more or less honest disclosure about the limitations."

Carlos writes this down. It becomes the heading on a document he'll return to through his career: "Survey methodology is applied epistemology."


Summary

This chapter traced the journey from survey instrument to clean dataset, examining the full landscape of survey modes and their trade-offs, the mechanics and meaning of response rate decline, the mechanisms of nonresponse bias and how it is assessed, the distorting effects of interviewers and social desirability, the unique challenges of panel conditioning in longitudinal research, the logistical complexity of multi-mode field operations, and the data cleaning and documentation practices that separate research-grade data from noise. Throughout, we kept in view the political stakes of methodological choice: every operational decision shapes who gets counted and whose voice appears in the data that drives political analysis.

In Chapter 10, we turn from producing polls to evaluating them — learning to read methodology disclosures critically, detect systematic pollster biases, and use Python to analyze polling data with the rigor that good interpretation demands.


Key Terms

Computer-Assisted Telephone Interviewing (CATI): A telephone survey mode in which interviewers read questions displayed on a screen and record responses directly into a database, with automated skip-pattern routing.

Interactive Voice Response (IVR): A survey mode using pre-recorded questions and keypad responses, without live interviewers; limited to landlines under TCPA.

Response Rate (RR1–RR6): AAPOR's family of standardized response rate formulas differing in treatment of partial interviews and cases of unknown eligibility.

Cooperation Rate: The proportion of eligible respondents who, once contacted, complete the interview; isolates the persuasion challenge from the contact challenge.

Contact Rate: The proportion of sampled cases for whom a contact with a household member is made; isolates the reach challenge.

Nonresponse Bias: Systematic difference between survey respondents and nonrespondents on variables of interest, distinct from mere low response rates.

Panel Conditioning: Changes in respondents' attitudes, awareness, or behavior attributable to repeated participation in a survey panel.

Interviewer Effect: Distortion of survey responses due to characteristics of or interactions with the interviewer, including demographic matching effects and expectancy effects.

Social Desirability Bias: The tendency to report opinions or behaviors believed to be viewed favorably by interviewers or society rather than honest self-report.

Refusal Conversion: Re-contacting initial refusals for a second attempt to obtain participation, typically by a more experienced interviewer or at a different contact time.

Data Cleaning: The process of identifying and resolving errors, inconsistencies, and implausible values in raw survey data before analysis.

Codebook: A document specifying variable names, question text, response options, code values, and protocols for handling ambiguous responses.

Multi-Mode Survey: A survey design that collects data through two or more distinct modes (e.g., CATI plus online) to improve coverage and reduce mode-specific biases.