49 min read

The conference room at Meridian Research Group was smaller than it looked in the company photos — four chairs, a whiteboard that had never been fully erased, and a coffee machine that Trish McGovern had personally purchased after the last one died...

Learning Objectives

  • Identify and correct question wording problems including leading questions, loaded language, and double-barreled questions
  • Explain why question wording choices construct rather than merely discover public opinion
  • Design appropriate response scales for different types of political questions
  • Describe order effects and how to mitigate them through questionnaire design
  • Apply sensitive question techniques including list experiments and randomized response
  • Construct a complete questionnaire with appropriate screening, branching, and flow logic
  • Evaluate a survey instrument for validity threats using a systematic checklist
  • Apply cognitive interviewing and pretesting methods to improve instrument quality
  • Design longitudinal survey instruments that enable valid tracking of opinion change over time

Chapter 7: Survey Design: From Questions to Questionnaires

The conference room at Meridian Research Group was smaller than it looked in the company photos — four chairs, a whiteboard that had never been fully erased, and a coffee machine that Trish McGovern had personally purchased after the last one died. It was 9:00 a.m. on a Tuesday, and the team was staring at a blank Google Doc titled "Garza-Whitfield Poll — DRAFT QNR v1."

"So," said Carlos Mendez, looking at the empty document. "How do we start?"

Trish McGovern, who had been running field operations for sixteen years, poured herself a coffee without looking up. "We start by arguing about question two, which is always worse than question one, so we argue about question two first."

Dr. Vivian Park ignored this. "We start by deciding what we're trying to measure," she said, pulling up a chair. "Because everything else — every word choice, every response scale, every question order — flows from that decision."

What followed was three hours of exactly the kind of argument Trish had predicted. It was also, Carlos would later reflect, the most useful education he had received since arriving at Meridian. By the time they emerged with a draft questionnaire, he understood something he hadn't before: survey design is not a technical procedure. It is a series of consequential choices, each of which shapes the reality the survey will report.

This chapter takes you inside that process.


Why Question Wording Matters More Than You Think

Before we get to techniques, you need to internalize one foundational fact: the question is not a window onto an opinion that already exists in respondents' heads. The question is part of what produces the opinion you measure.

We established the theoretical basis for this in Chapter 6. Here we get practical. The wording, framing, and structure of a question systematically shape the responses you receive — not because respondents are being dishonest, but because the questions activate different considerations, different reference points, and different social norms.

The most famous demonstration involves two phrases for the same policy:

Version A: "Do you support government welfare programs?"

Version B: "Do you support government assistance to the poor?"

Across literally hundreds of replications, Version B consistently generates 10-20 percentage points more support than Version A. The policies being asked about are substantively identical. The populations being asked are identical. The only thing that changes is the word used to describe the policy. "Welfare" activates a cluster of associations — dependency, taxpayer burden, racial stereotypes in many respondents — that "assistance to the poor" does not. Version B activates a different cluster — charity, obligation, struggling families.

Neither version is "the truth." Both are partial framings of a complex policy reality. But the gap between them is enormous by any political standard. A candidate who ran on a platform "opposing welfare" and a candidate who ran on a platform "opposing assistance to the poor" would be running on very different political brands, even if their actual policy positions were identical.

📊 Real-World Application: The Tax Vocabulary Wars

Frank Luntz, the Republican pollster and messaging strategist, made a career out of the political power of word choice. His recommendation to use "death tax" instead of "estate tax" is one of the most-studied examples in political communication. Research found that "death tax" framing increased opposition to the estate tax by roughly 10-15 percentage points. Democrats who tried "inheritance tax" saw intermediate results. The same policy, the same respondents, three different vocabulary choices, three meaningfully different levels of measured opposition. This is not manipulation — it is the nature of measurement interacting with the nature of opinion. But it does mean that the choice of vocabulary in a survey question is a political act, even when the surveyor intends neutrality.


The Taxonomy of Bad Questions

Every survey methodology textbook has a taxonomy of question problems. Memorize this taxonomy, because you will encounter all of these problems in polls you read, in surveys you are given to review, and in your own draft questionnaires.

1. Leading Questions

A leading question signals to the respondent what the "correct" or expected answer is, thereby increasing social pressure to conform and inflating estimates of agreement with the implied position.

Bad: "Do you support Tom Whitfield's bold plan to cut wasteful government spending?"

Better: "Tom Whitfield has proposed reducing federal spending by 15%. Do you support or oppose this proposal?"

The word "bold" is a positive evaluative cue. "Wasteful" pre-judges the spending. A respondent who might otherwise be uncertain is nudged toward agreement. Even "do you support or oppose" is better than "do you support," because it explicitly legitimizes opposition as a response option.

Leading questions appear in push polls — phone calls that masquerade as polling while actually delivering opposition research messages. A push poll might ask: "If you knew that Candidate X had been cited for ethics violations, would you be more or less likely to vote for her?" This is not a measurement instrument; it is a persuasion instrument wearing a survey's clothing. Real polling organizations do not conduct push polls, and you should be suspicious of any poll result that came from a question with this structure.

2. Loaded Questions

A loaded question contains assumptions embedded in its phrasing that respondents may not share — and accepting the question at face value forces them to accept the assumption.

Bad: "When did you stop ignoring climate change?"

Bad: "How do you think the federal government's failed border policy has affected crime rates?"

The first assumes the respondent has been ignoring climate change. The second assumes the border policy has failed and that it has affected crime rates. Any answer to either question implicitly accepts these contested premises.

Loaded questions are common in poorly designed polls that are fishing for a predetermined result. They are also depressingly common in journalistic interviewing, where they often pass without challenge. Training yourself to hear the embedded assumption is a critical analytical skill.

3. Double-Barreled Questions

A double-barreled question asks about two things simultaneously, making it impossible to know which part of the question is driving the response.

Bad: "Do you think the Garza campaign has been honest and effective?"

Bad: "Do you support spending more on education and healthcare?"

The first question asks about two distinct attributes — honesty and effectiveness — that might not both apply. A respondent might believe Garza has been honest but ineffective, or effective but not honest, and have no way to express this distinction. The second question conflates two separate spending priorities; a respondent who supports healthcare spending but not education spending has no valid response option.

Better: Separate these into two questions.

"Do you think the Garza campaign has been honest?" "Do you think the Garza campaign has been effective?"

The extra question takes ten seconds. The clarity gained is worth far more.

4. Vague Questions

Vague questions use terms that different respondents may interpret in fundamentally different ways, producing a distribution of responses that reflects not opinion differences but interpretation differences.

Bad: "Do you think the government should do more about crime?"

"More" is vague. More than what? "Crime" is vague. Property crime? Violent crime? White-collar crime? "Do something about" is vague. Prosecution? Prevention? Treatment? A respondent who supports more police funding and a respondent who supports more mental health services might both answer "yes" — but they hold diametrically opposed policy views.

Better: Be specific. Ask about specific policies, specific programs, specific amounts. If you need to ask a general question (sometimes you do, for tracking purposes), acknowledge in your analysis that it is measuring a general sentiment rather than a policy preference.

5. Acquiescence Bias and "Agree/Disagree" Scales

Acquiescence bias — also called "yes-saying" — is the tendency of some respondents to agree with any statement they are asked about, regardless of content. This is particularly pronounced among respondents with lower educational attainment and in telephone interviews where social pressure to seem agreeable is high.

Agree/disagree questions ("Do you agree or disagree that...") are vulnerable to acquiescence bias because agreement is always the "positive" response. Research consistently shows that reversing the direction of the question — asking both "A is better than B" and "B is better than A" in different randomized versions — captures a different distribution of response than asking only one direction.

⚠️ Common Pitfall: The All-Agree Questionnaire

A questionnaire that presents a series of policy statements and asks respondents to agree or disagree will systematically overstate support for every statement, because some respondents will agree with all of them. If you need to use agree/disagree format, balance the questionnaire so that agreeing with some statements implies a liberal position and agreeing with others implies a conservative position. This helps identify true acquiescers and corrects the bias.


Abstract discussion of question wording problems is less instructive than examining actual failure cases. The following ten examples — drawn from published polls, campaign surveys, and public opinion research — illustrate the taxonomy in practice. Each case presents the flawed original, the specific problem, and an improved version.

Failure 1: The Embedded Adjective

Original: "Do you support Mayor Chen's aggressive crackdown on downtown crime?"

Problem: "Aggressive" is evaluatively loaded — it can be either positive (decisive) or negative (excessive) depending on the respondent's priors. The respondent cannot separate their view of the policy from their view of the adjective.

Corrected: "Mayor Chen has increased police patrols in the downtown area and expanded penalties for certain crimes. Do you support or oppose these changes?"

Why it works: The corrected version describes the specific content of the policy without pre-evaluating it.

Failure 2: The Scale with No Anchor

Original: "How much do you support expanding healthcare coverage? [1 — 2 — 3 — 4 — 5]"

Problem: What do 1 and 5 mean? Without labeling, respondents impose their own interpretations. Some treat 1 as "most support" (like a ranking), others treat 5 as "most support" (like a Likert scale). This is not a measurement instrument — it is a random number generator.

Corrected: "Do you strongly support, somewhat support, somewhat oppose, or strongly oppose expanding healthcare coverage?"

Why it works: Every response option has a substantive label, eliminating interpretive ambiguity.

Failure 3: The Question That Answers Itself

Original: "Most experts agree that climate change poses a serious threat to the economy. Do you agree or disagree with this assessment?"

Problem: "Most experts agree" is a social proof cue that pushes respondents toward agreement, making this functionally a leading question despite the formal "agree or disagree" format.

Corrected: "Do you think climate change poses a serious threat to the U.S. economy, a moderate threat, a minor threat, or no threat at all?"

Why it works: The corrected version presents a range of responses without presupposing any expert consensus.

Failure 4: The Impossible Comparison

Original: "Compared to last year, do you feel that crime, immigration, and the economy have gotten better, stayed the same, or gotten worse?"

Problem: Three separate constructs — crime, immigration, and the economy — each with its own independent trend, crammed into a single response. A respondent who thinks crime is up, immigration is stable, and the economy is better has no valid response option.

Corrected: Three separate questions, each with "better / stayed the same / gotten worse" response options.

Why it works: Analytic clarity about which attitude is driving which response is only possible when questions are separated.

Failure 5: The False Balance

Original: "Some people say we should invest more in renewable energy; others say we should support the fossil fuel industry. Which do you agree with?"

Problem: This frames a complex policy landscape as a binary opposition. Many respondents support both (a reasonable "all-of-the-above" energy position) or think the question misrepresents the relevant policy choice entirely.

Corrected: "How important is each of the following as a national energy priority? [Renewable energy expansion] [Maintaining current fossil fuel production] [Reducing energy costs] [Energy independence from foreign sources]" — each rated on a 4-point importance scale.

Why it works: The corrected version respects the multi-dimensional nature of energy policy rather than forcing a false choice.

Failure 6: The Hypothetical Without Substance

Original: "If a candidate supported free college tuition, would you be more likely, less likely, or equally likely to vote for them?"

Problem: The question provides no information about what "free college tuition" means — who pays, who qualifies, what institutions are covered. Respondents are responding to a phrase, not a policy. The answers tell you about the appeal of the slogan, not the policy.

Corrected: "A proposal would make public community colleges and four-year universities free for students whose family income is below $125,000 per year, funded by a tax on financial transactions. Would you strongly support, somewhat support, somewhat oppose, or strongly oppose this proposal?"

Why it works: Respondents can form a genuine opinion about a described policy rather than a marketing phrase.

Failure 7: The Jargon Trap

Original: "Do you support or oppose the reconciliation bill currently before the Senate?"

Problem: "Reconciliation bill" is a legislative procedure term that most members of the public do not recognize. Responses will reflect (a) random noise from respondents guessing, and (b) partisan cueing for respondents who know only that their party's leadership supports or opposes it.

Corrected: Either describe the bill's content ("A bill that would increase spending on healthcare and reduce the deficit by raising taxes on corporations...") or ask about the policy rather than the legislation.

Why it works: Survey questions should be intelligible to respondents who have not been following legislative proceedings closely.

Failure 8: The Inverted Scale

Original (on a web form): "Rate your agreement with the following statements: [Statement 1: Government should spend more on education] [Statement 2: Government spends too much overall]" — both rated on 5-point scales where 5 = Strongly Agree.

Problem: A respondent who consistently agrees with both statements — giving 5s to each — holds logically inconsistent views. But the scale design invites this error. The second statement's "agreement" direction conflicts with the first's.

Corrected: Balance scales so that agreement with all items implies a consistent ideological position — or use forced-choice formats that make the trade-off explicit.

Why it works: Respondents should not be structurally invited to produce logically inconsistent response patterns.

Failure 9: The Temporal Confusion

Original: "Have you always supported Candidate Garza?"

Problem: "Always" is a recall question that asks respondents to accurately remember past attitudes. Political science research consistently shows that people misremember their prior opinions in the direction of their current views (a phenomenon called "attitude reconstruction"). This question produces inaccurate data about attitude change.

Corrected: "Thinking back to six months ago, did you support Candidate Garza, oppose her, or were you undecided?" — though even this improved version is subject to reconstruction bias. The only reliable way to track attitude change is with a panel survey that measures attitudes at multiple time points.

Why it works: The corrected version acknowledges the limitation of retrospective attitude questions; the best solution is panel design.

Failure 10: The Courtesy Bias Trap

Original (phone interview): "The interviewer you're speaking with today — was he or she professional and courteous?"

Problem: This question is asked by the interviewer who conducted the survey, creating an obvious social pressure to answer "yes." The question cannot produce valid data about interview quality from any in-person or phone administration.

Corrected: This question should appear in a separate follow-up email survey, not in the interview itself.

Why it works: Question administration context shapes social desirability pressure; sensitive questions require contexts that minimize that pressure.


Response Scale Design

After question wording, the most consequential design decision is the response scale. Different scale formats measure different aspects of opinion and are appropriate for different question types.

Likert Scales

The Likert scale, developed by Rensis Likert in 1932, asks respondents to indicate their degree of agreement or disagreement with a statement, typically on a 5-point or 7-point scale:

Strongly agree / Agree / Neither agree nor disagree / Disagree / Strongly disagree

Likert scales are the workhorse of survey research. They have several important properties:

Midpoint inclusion. Including a "neither" or "don't know" midpoint gives respondents who genuinely have no opinion a valid response option, reducing acquiescence bias. Some researchers prefer to omit the midpoint to force a direction, but this generally distorts estimates by forcing the non-opinionated into an artificial choice.

Labeling all points. Scales where only the endpoints are labeled ("1 = Strongly Agree, 5 = Strongly Disagree") can be misinterpreted. Labeling all points reduces ambiguity.

Five vs. seven points. Five-point scales are more common in political polling because they are easier for respondents to navigate by phone. Seven-point scales offer more resolution and are standard in academic survey research. The choice depends partly on mode and partly on how much variance you expect in the construct you're measuring.

When to Use 4-Point, 5-Point, or 7-Point Scales

The choice of scale length is not arbitrary — it interacts with the question type, the administration mode, and the analytical purpose.

4-point scales (no midpoint): Best when you need a forced direction. Asking "Would you say the state's economy is getting better, somewhat better, somewhat worse, or much worse?" forces respondents to commit to a directional assessment. The absence of a neutral midpoint is appropriate here because the analytical question is whether conditions are improving or deteriorating, not whether respondents have an opinion. Appropriate for: economic assessments, favorability forced-choice questions, any context where "no opinion" would be analytically problematic.

5-point scales (with midpoint): The most common format in political polling. The midpoint ("neither / don't know") is appropriate when genuine neutrality or lack of opinion is a substantively meaningful response. A 5-point scale is legible in telephone administration — respondents can hold five labeled options in working memory. Appropriate for: Likert-type attitude items, approval ratings, policy support/opposition, any question where genuine non-opinions are expected and analytically relevant.

7-point scales (high resolution): Standard in academic survey research, particularly in the ANES (American National Election Studies). The 7-point party identification scale, ideology scale, and issue preference scales provide finer-grained discrimination that enables more powerful statistical analyses. The tradeoff is cognitive load: 7-point scales are harder to administer by phone (respondents may struggle with "on a scale from 1 to 7, where 1 is..."), though they work well in web surveys where a visual slider or set of labeled radio buttons can be presented. Appropriate for: academic surveys, web-administered research, any context where fine-grained variation in the construct matters analytically.

💡 The Reliability-Validity Trade-off in Scale Length

Longer scales (more response categories) generally produce higher reliability (less measurement error from random guessing about which category applies) but may produce lower validity if respondents cannot meaningfully distinguish between adjacent categories. Research by Krosnick and Fabrigar (1997) found that 7-point scales produce somewhat higher reliability than 5-point scales in web surveys, but 5-point scales perform comparably in telephone administration where visual anchors are unavailable. The practical recommendation: match scale length to mode, and validate your scale choice with cognitive interviewing.

📊 Real-World Application: The ANES Party Identification Scale

The American National Election Studies party identification question has used a seven-point scale since 1952: "Generally speaking, do you usually think of yourself as a Republican, a Democrat, an Independent, or what?" followed by probe questions that place respondents on a scale from Strong Democrat (1) to Strong Republican (7). This scale has been used continuously for over seventy years, making it one of the most-studied and validated scales in political science. Its longevity is a testament to the value of consistent measurement — you can track the distribution of partisan identification across seven decades because the question has remained constant.

Feeling Thermometers

The feeling thermometer is one of political science's most distinctive measurement tools. Respondents are asked to rate their feelings toward a person, group, or institution on a scale from 0 to 100 degrees, where 0 means "very cold or unfavorable" and 100 means "very warm or favorable," with 50 representing "no feeling."

Feeling thermometers have several advantages over agree/disagree scales for measuring affect toward political figures and groups:

  • The 0-100 range captures fine-grained variation that a 5-point scale would miss.
  • The temperature metaphor is accessible and understood across educational levels.
  • The 50-point neutral midpoint has intuitive meaning (neither warm nor cold).

The ANES has used feeling thermometers to measure attitudes toward candidates, parties, and social groups continuously since 1964. A critical finding: the gap between in-party and out-party feeling thermometer ratings has grown substantially since the 1980s, providing one of the cleanest empirical signatures of affective polarization — the increasing tendency of partisans to dislike the opposing party, independent of policy disagreement.

Branching (Follow-Up) Questions

Rather than using a 7-point scale in a single question, many political surveys use branching: ask a simple directional question first, then probe for intensity.

Main question: "Do you approve or disapprove of the job Maria Garza is doing as a candidate?"

Follow-up (if approve): "Is that strongly approve or just somewhat approve?"

Follow-up (if disapprove): "Is that strongly disapprove or just somewhat disapprove?"

This produces a five-point scale (Strong Approve / Somewhat Approve / Neither or DK / Somewhat Disapprove / Strongly Disapprove) while making the question easier to answer by phone, where respondents can't see a visual scale. The branching format is standard in most commercial political polling.

Forced Choice vs. Scale

For horse-race questions — who would you vote for if the election were held today — the standard format is forced choice between named candidates, not a scale:

"If the election for U.S. Senate were held today, and the candidates were Maria Garza, the Democrat, and Tom Whitfield, the Republican, for whom would you vote?"

Forced choice is appropriate here because the actual election is forced choice. Including a "both equally" option or a neutral midpoint would create a false equivalence that doesn't exist in the voting booth. For undecided respondents, follow-up probes ("Which way are you leaning?") capture soft preference without forcing an artificial commitment.


Order Effects: How What Comes Before Shapes What Comes After

Even perfectly worded questions with perfectly designed response scales can produce misleading results if they appear in the wrong order. Order effects — the influence of earlier questions on later responses — are among the most reliably documented phenomena in survey methodology.

Question Order Effects

Earlier questions prime later responses. If you ask respondents to rate their feelings about immigrants before asking about immigration policy, you will get different policy responses than if you ask the policy question first. The earlier questions make certain considerations more accessible, exactly as Zaller's RAS model predicts.

This is not always a bad thing. If your theoretical interest is in how affective attitudes toward a group shape policy preferences, you might deliberately order questions to measure affect before policy — not to distort responses, but to capture the causal process as it occurs in real political thought.

The problem arises when order effects are unintended and unacknowledged. A survey that asks "Is climate change a serious threat?" before asking "Do you support the Green New Deal?" will generate higher GND support than one that asks the questions in reverse order — even if the survey's sponsor believes both numbers represent independent public views.

Best practice: For high-stakes policy questions, consider experimenting with question order in split-sample designs, where half the sample receives questions in order A-B and the other half receives them in order B-A. The difference between the two versions is your estimate of the order effect.

Response Order Effects: Primacy and Recency

Even when question wording and order are fixed, the order of response options can influence choices. Two competing effects operate in different contexts:

Primacy effect: In visual (web or paper) surveys, respondents sometimes disproportionately select the first option they see, particularly when they are tired or uncertain.

Recency effect: In telephone (audio) surveys, respondents sometimes disproportionately select the last option they heard, because it is most recently in working memory.

The conventional response is to randomize response option order across respondents when the options have no inherent ordering. For a candidate choice question, you might rotate which candidate is listed first. For a list of policy priorities, you might randomize the order of the list.

Note that some scales have inherent ordering (Strongly Agree through Strongly Disagree) and should not be randomized. The point is to randomize when order is arbitrary, not to introduce randomness for its own sake.

🧪 Try This: Design a Split-Sample Test

Take any survey question you've seen recently and rewrite it three ways: (1) the original version, (2) with different wording that frames the issue from the opposite direction, (3) with response options in reverse order. Now think about how you would design a study to test whether these versions produce different responses, and what you would conclude if they did.


Measuring Sensitive Topics: When Respondents Won't Tell You the Truth

Social desirability bias — covered in Chapter 6 — is not just a theoretical problem. In political polling, it creates specific, practically important measurement failures. Respondents understate racial prejudice, overstate civic engagement, underreport support for stigmatized candidates, and overstate their agreement with dominant social norms.

For these situations, researchers have developed specialized techniques designed to give respondents "permission" to answer honestly by creating structures that protect their anonymity or provide plausible deniability.

The List Experiment (Item Count Technique)

The list experiment is the most widely used technique for measuring sensitive political attitudes. Its logic is elegant:

Respondents are randomly assigned to one of two conditions. The control group receives a list of N non-sensitive statements and is asked to indicate how many of them apply to them (not which ones — just the count). The treatment group receives the same list plus one additional sensitive item, and is similarly asked for a count.

Since respondents are not asked to identify which items they endorsed, they have no reason to conceal endorsement of the sensitive item. The average count in the treatment group minus the average count in the control group estimates the proportion of respondents who endorsed the sensitive item.

Control group receives: "Here is a list of four things that some people believe. How many of them do you believe? (Don't tell me which ones, just tell me how many.)"

  1. The government should be allowed to wiretap phone calls with suspected terrorists.
  2. Prayer should be allowed in public schools.
  3. Capital punishment is sometimes justified.
  4. The U.S. should have stricter trade policies.

Treatment group receives the same list, plus: 5. I would be uncomfortable if my child's teacher was openly gay.

If the control group averages 2.3 items and the treatment group averages 2.7 items, the estimate is that 40% endorse the sensitive item — even if only 10% would directly say so.

List experiments require large samples (because you are estimating a difference between group means, and that estimate has variance) and careful attention to design (the non-sensitive items should be politically diverse and roughly equally appealing across the spectrum, to avoid ceiling/floor effects). But they have been used successfully to measure racial resentment, support for extreme candidates, and other politically sensitive positions.

Randomized Response Technique

The randomized response technique (RRT), developed by Stanley Warner in 1965, uses a different strategy: introduce a random element that makes individual responses uninterpretable, while allowing aggregate estimation of sensitive behavior.

The most common version works like this: respondents are asked to flip a coin secretly. If it comes up heads, they answer "yes" regardless of the true answer. If it comes up tails, they answer truthfully. Since the interviewer doesn't know the coin flip result, a "yes" answer is ambiguous — it might be the honest answer or the forced "heads" response. This gives respondents genuine protection, because their true answer cannot be identified from their observed answer.

Aggregate population proportions can still be estimated because the probability structure is known. If we know that p(heads) = 0.5 and we observe a "yes" proportion of 0.65 in the sample, we can estimate the true "yes" proportion as approximately 0.30 [(0.65 - 0.5) / 0.5].

RRT has been used to estimate prevalence of drug use, illegal behavior, and, in political contexts, unpopular political views. It is more methodologically complex than direct questioning and requires explanation to respondents, which introduces its own measurement challenges. But when social desirability bias is severe, it provides substantially better estimates than direct questions.

⚠️ Common Pitfall: Assuming Sensitive Question Techniques Are Always Better

List experiments and RRT produce less precise estimates than direct questions (because they introduce design noise). They are only worth the added complexity when social desirability bias is substantial. For questions where stigma is low, direct questions are preferable. Before deploying a sensitive question technique, ask: what is the evidence that respondents would lie in response to a direct question on this topic? If the answer is "I'm not sure they would," use direct questioning.


Cognitive Interviewing and Pretesting Methods

No matter how carefully you design a questionnaire, you will get things wrong. Questions you thought were clear will confuse respondents. Response options you thought were exhaustive will leave respondents searching for the option that matches their view. Order effects you didn't anticipate will distort your results.

Cognitive interviewing is the standard method for identifying these problems before they contaminate your data. In a cognitive interview, you administer the questionnaire to a small sample of respondents (8-12 is usually sufficient) who are asked to think aloud as they answer — to narrate what they understand the question to be asking, what information they are drawing on to answer it, and how they are mapping their response onto the scale.

The results are often humbling. Questions that seemed perfectly clear in the conference room generate confusion in cognitive interviews. Response options that seemed to cover all possibilities leave respondents without a valid choice. Complex skip logic confuses interviewers and respondents alike.

The Cognitive Interview Protocol in Practice

A well-run cognitive interview follows a structured protocol:

Concurrent verbal probing: The interviewer asks the respondent to think aloud while answering each question. "Before you give me your answer to Question 6, tell me what you understand the question to be asking." The interviewer records the respondent's interpretation and notes any discrepancies from the intended meaning.

Retrospective probing: After the respondent answers, the interviewer follows up with probes designed to surface latent confusion. Common probes include: - "In your own words, what did that question mean to you?" - "How did you decide on that answer?" - "When you heard the phrase [X], what came to mind?" - "Was there any part of that question that was unclear or confusing?" - "Were any of the response options difficult to choose between?"

The "surprised" test: The interviewer tracks any response that surprises them — cases where the respondent's answer or interpretation differs from what the question intended. Every surprise is a data point about questionnaire failure. If five of twelve cognitive interview respondents interpret "economic security" to mean something different from what the researcher intended, the question needs revision regardless of how clear it seemed in the conference room.

Meridian's standard practice is three stages of pretesting:

  1. Internal review: The full team reads the questionnaire aloud, challenging every wording choice.
  2. Cognitive interviews: 8-10 interviews with respondents who match the target population.
  3. Small pilot: A 50-person pilot fielding to check skip logic, timing, and item nonresponse.

"The hardest thing to teach junior analysts," Trish told Carlos at the end of the day, "is that a question you spent two hours writing can still be garbage. You have to be willing to kill your darlings."

Carlos looked at the draft questionnaire he'd spent the morning on. "Anything I should cut?"

"Q13," Trish said without hesitation. "It's double-barreled. You're asking about economic security and healthcare in the same question. Split them."

She was right. Carlos hadn't noticed it.

What Cognitive Interviews Cannot Fix

Cognitive interviews identify problems with question comprehension and response mapping. They are limited in several important ways:

Small sample size: 8-12 cognitive interviews are sufficient to identify clear problems, but they cannot reliably detect subtle issues that appear in only 5-10% of the population. Pilot testing with 50+ respondents is necessary to catch rarer problems.

Motivated respondents: People who agree to participate in cognitive interviews tend to be cooperative and linguistically fluent. They may not represent the segment of your target population that is most likely to misinterpret questions or struggle with response categories.

No social pressure replication: A face-to-face cognitive interview is a completely different social context from a telephone survey. Social desirability effects that would appear in the actual survey may not surface during cognitive testing.

Best Practice: The Cognitive Interview Protocol

Conduct cognitive interviews before finalizing any survey instrument that will be fielded broadly. Recruit participants who match your target population in terms of education, partisanship, and geography. Use a combination of verbal probing ("What do you mean when you say you 'somewhat agree'?") and concurrent think-aloud protocols. Document every response that surprises you — surprise is data about where your question design assumptions were wrong.


Questionnaire Architecture: Building the Full Survey

Individual questions are only part of the challenge. How you arrange those questions into a coherent questionnaire — how you open, flow, branch, and close — shapes data quality at least as much as individual question wording.

The Opening

The first question of a survey sets the respondent's expectations, establishes a cooperative frame, and can prime later responses. Conventional wisdom recommends opening with:

  • Simple, engaging, non-threatening questions. Asking about a local issue or a general mood question ("Would you say things in the country are generally headed in the right direction or wrong direction?") is easier than opening with a sensitive policy question.
  • Not demographics. Asking age, income, and race at the very start feels clinical and can cause early dropout.
  • A natural topic flow. If you're polling about a Senate race, start with general political mood, move to candidate familiarity, then to head-to-head choice, then to policy preferences, then to demographics.

The right-track/wrong-track question ("Generally speaking, do you think things in this country are headed in the right direction or wrong direction?") is so commonly used as an opener that its position has become part of its interpretation. Analysts expect it at the front; if you move it to the middle of a questionnaire, you should disclose that and consider whether earlier questions primed a different response than you would otherwise get.

Screening Questions

Many surveys need to screen respondents to ensure they qualify for subsequent questions. This is particularly important in political polling:

  • Are they registered to vote? (For registered voter samples)
  • Are they likely to vote? (For likely voter samples)
  • Have they heard of the candidates? (Before asking favorability)
  • Are they a resident of the relevant geography? (For district-level polls)

Screening questions should come early and be clearly tied to an obvious relevance criterion. Respondents who are screened out should be thanked and ended — not subjected to a lengthy questionnaire that doesn't apply to them.

The construction of likely voter screens is one of the most consequential and contested decisions in political polling. Different organizations use different screens — some ask only about stated intent to vote, others combine stated intent with past voting behavior, others use a composite "likely voter score" from modeled data. The choice of screen can move the headline number by several percentage points.

Branching and Skip Logic

Branching logic — directing respondents to different subsequent questions based on their answers to earlier ones — allows you to gather richer data without burdening every respondent with irrelevant questions.

Example from the Garza-Whitfield questionnaire:

Q4: "Thinking about the candidates for U.S. Senate, have you heard of Maria Garza?" [Yes/No] → If Yes: Q5a. "Do you have a favorable or unfavorable opinion of Maria Garza?" → If No: Skip to Q6.

This structure allows you to distinguish between two very different groups: those who have a considered view of Garza, and those who don't know who she is. Combining them in a single favorability measure — where "haven't heard of" collapses into "no opinion" — obscures a politically important distinction.

In phone surveys, branching is managed by the interviewer following a script. In online surveys, automated skip logic handles the routing. In either case, the survey designer must map out the complete branching tree before fielding, and the analysis must account for the different paths respondents took.

Flow and Cognitive Load

Long questionnaires are cognitively fatiguing. Respondents who are tired give lower-quality responses — more "don't know" responses, more acquiescent responses, less careful reading of options. Several design principles help manage cognitive load:

Group related questions. Don't jump between topics. Cover everything about candidate favorability before moving to issue preferences; cover issue preferences before moving to demographics. Random ordering within topics can help with order effects, but random ordering across topic areas creates disorientation.

Vary question formats. A questionnaire that asks the same Likert-scale question forty times in a row is monotonous. Vary between scales, feel thermometers, forced-choice, and open-ended questions.

Keep it as short as possible. Every question costs respondent attention. Before adding a question, ask: will the data from this question meaningfully improve my analysis? If not, cut it.

Put sensitive or personal questions near the end. Demographics (age, income, race, education) and other personal questions feel less intrusive after respondents have established a cooperative relationship with the questionnaire. Ending with demographics also ensures that dropout — which increases with questionnaire length — doesn't systematically bias your demographic profile.


Mobile-First Survey Design

More than 60% of online survey completions now occur on mobile devices — smartphones and tablets. This figure is higher among younger respondents and has increased substantially each year. Designing surveys without thinking about mobile experience is an increasingly serious methodological error.

The Mobile Usability Problem

Surveys designed for desktop completion often fail on mobile in specific, predictable ways:

Long matrix questions. A survey question that shows ten items in rows and five response options in columns — a "grid" format — displays clearly on a large monitor. On a smartphone screen, the same grid either becomes too small to read without zooming, or the columns overflow the screen and require horizontal scrolling, which confuses respondents and increases item nonresponse.

Slider scales. Continuous slider scales (drag to indicate your position) are convenient on desktop, where mouse precision is high. On a touchscreen, the slider is difficult to control precisely, leading to inaccurate responses concentrated at whatever value the respondent happens to land near when they touch the slider.

Long answer choices. Response options that run to 40-50 characters can wrap awkwardly on a small screen, making it unclear which radio button corresponds to which label.

Multi-page questions. Questions that display multiple items on a single page often require scrolling on mobile. Respondents who scroll past items may miss them entirely or respond without reading.

Mobile-Optimized Design Principles

The following design choices improve data quality for mobile completions without degrading desktop quality:

Replace grid questions with sequential single-item questions. Instead of a 10-row grid asking about each issue's importance, present each issue as a separate question on its own screen. This is more cognitively manageable on mobile and also eliminates response patterns (like "straight-lining" across a grid) that degrade data quality even on desktop.

Use labeled radio buttons instead of sliders. A 5-point labeled scale presented as five radio buttons is consistently usable across desktop and mobile. If continuous measurement is important, a labeled 7-point scale preserves most of the resolution benefits of a continuous scale while maintaining mobile usability.

Limit response option character counts. Design response options to display legibly at a maximum of 40 characters including spaces. For longer options, use a two-step format: short option label followed by a parenthetical clarification only displayed if the respondent taps the option.

Test on actual devices. Before fielding, complete the survey on at least three different smartphone models at both iOS and Android. What looks perfectly designed in a desktop browser preview may display very differently on actual mobile hardware.

📊 The Mode-Device Interaction

The combination of survey mode (online vs. phone) and device (desktop vs. mobile vs. tablet) creates a matrix of completion contexts with genuinely different data quality implications. Online surveys completed on desktop by engaged respondents who chose to participate represent a very different data environment from online surveys completed on mobile by respondents who clicked through from a social media ad. Both are "online surveys," but their quality characteristics differ substantially. Methodology statements should disclose the device distribution of completions where that information is available.


Longitudinal Survey Design: Tracking Change Over Time

Many of the most important questions in political science require not cross-sectional measurement (what do people think now?) but longitudinal measurement (how has opinion changed?). Tracking change over time requires deliberate design choices that are easy to overlook when building a first instrument.

Why Longitudinal Design Is Different

Tracking change over time seems simple: ask the same questions at multiple time points and compare the results. But there are several ways this can go wrong that are specific to the longitudinal context.

Instrument change invalidates comparisons. If you change a question's wording, response scale, or placement in the questionnaire between waves, you cannot determine whether observed changes in responses reflect genuine opinion change or question change. This seems obvious, but it happens constantly in practice as analysts try to "improve" questions between waves.

Reference period confusion. Questions that ask about "the past year" or "recently" create ambiguity in tracking designs. If you field the same question every six months and always ask about "the past year," respondents in later waves are partially referring to periods covered by earlier waves, creating overlap and temporal confusion.

Panel fatigue in panel designs. Longitudinal surveys can follow the same individuals over time (panel design) or use fresh random samples from the same population (repeated cross-section design). Panel designs are statistically powerful because they control for individual-level heterogeneity — but respondents who complete many surveys over time become increasingly atypical. They learn what questions are likely to be asked, become more politically aware, and develop more stable opinions. This "panel conditioning" can make long-running panels increasingly unrepresentative of the broader population.

Cohort replacement. In repeated cross-section designs, the population being sampled changes over time as older cohorts exit (through death or moving out of the target area) and younger cohorts enter (through turning 18 or moving in). True opinion change is conflated with the compositional change in who is in the population.

Questionnaire Consistency Requirements

The fundamental rule of longitudinal survey design is: every question that will be used in a comparison across time must be identically worded, scaled, and positioned across all waves.

This means: - Tracking questions must be reviewed against previous waves before each new wave is finalized - Changes to non-tracking questions must be checked to ensure they don't alter the priming context for tracking questions - Response scales must be reproduced identically, including the exact label wording for each point - Question position within the questionnaire must remain stable, because position effects on tracking questions will be confounded with time effects if position changes

Meridian maintains a "frozen section" of the Garza tracking poll questionnaire — the set of questions that have been asked identically in every wave since the poll began. New questions can be added in a "fresh questions" section at the end, but the frozen section is treated as immutable until the tracking is no longer needed.

The Repeated Cross-Section vs. Panel Trade-off

For most political campaign tracking purposes, repeated cross-sections — fresh random samples from the same target population, asked the same questions at each time point — are preferable to panel designs, for several reasons:

  • Fresh samples are not subject to panel conditioning
  • Repeated cross-sections are administratively simpler (no respondent tracking or re-contact)
  • Attrition is not a concern (panel surveys lose respondents over time, and the respondents who stay are typically more engaged and unrepresentative)

The limitation of repeated cross-sections is that they cannot separate true individual-level opinion change from cohort replacement or other compositional changes. If Garza's support among suburban women increased from 48% to 55% between two waves, a repeated cross-section cannot determine whether: - Individual suburban women changed their minds - The composition of "suburban women" in the likely voter pool shifted (e.g., because suburban women with college degrees — more pro-Garza — became more likely to vote) - Random sampling variation produced the observed difference

Panel designs, which follow the same individuals, can address the first possibility — but they introduce their own biases and are substantially more expensive to implement.

For the Garza-Whitfield campaign, Nadia Osei's team uses repeated cross-sections for the standard tracking poll and supplements with a small online panel (500 respondents followed for six weeks) during the final month of the campaign. The panel provides individual-level change data; the larger cross-section provides population-representative tracking.


Inside the Meridian Questionnaire: The Garza-Whitfield Survey

Let's watch how these principles played out in practice. Three hours into the Tuesday morning meeting, here is what the Meridian team had built.

The Design Challenge

The Garza-Whitfield race presents specific challenges. It's a competitive Senate race in a purple Sun Belt state with significant demographic diversity: a large Latino community in the southern metro areas, a growing suburban population, a conservative rural base, and a mid-sized college town. The team needs to:

  1. Measure the horse-race accurately (candidate preference)
  2. Understand why voters are where they are (policy priorities, candidate attributes)
  3. Identify persuadable voters (those uncertain or weakly committed)
  4. Provide actionable intelligence for the campaign analytics team

"We can't ask everything," Vivian said. "What's the most important question for this client right now?"

"Whether Garza can break through in the suburbs," said Carlos, who had spent the morning reading the demographic tables from the last cycle.

"That's a story, not a question," Trish said. "The question is: what do suburban voters care about, and does Garza's message reach them? Those are three questions."

Draft Structure

The team settled on a 22-question instrument (approximately 12 minutes by phone):

Section 1: Political Environment (Q1-Q3) Q1: Right/wrong track (national) Q2: Right/wrong track (state) Q3: Governor job approval (warm-up, establishes partisan frame)

Section 2: Senate Race Head-to-Head (Q4-Q5) Q4: Candidate preference, head-to-head (Garza vs. Whitfield) Q4a: Lean probe for undecideds Q5: Strength of preference

Section 3: Candidate Familiarity and Favorability (Q6-Q9) Q6: Heard of Garza? → Q7: Garza favorable/unfavorable Q8: Heard of Whitfield? → Q9: Whitfield favorable/unfavorable

Section 4: Issue Priority and Policy (Q10-Q14) Q10: Most important issue (open-ended, coded) Q11-Q14: Specific issue priorities and positions

Section 5: Candidate Attributes (Q15-Q17) Q15-Q17: Candidate attribute ratings (trustworthy, shares your values, cares about people like you)

Section 6: Demographics (Q18-Q22) Q18: Party identification Q19: Ideology (5-point liberal-conservative scale) Q20: Age Q21: Education Q22: Race/ethnicity

The Debate Over Q4: Question Wording for the Horse Race

The horse-race question seems simple, but it generated fifteen minutes of argument.

Carlos's first draft: "If you were voting today in the U.S. Senate race, who would you vote for?"

Trish shook her head. "You need to name the candidates. Unaided recall favors incumbents and more prominent candidates. We need to cue both names equally."

Revised version: "If the election for U.S. Senate were held today, and the candidates were Maria Garza, the Democrat, and Tom Whitfield, the Republican, for whom would you vote?"

"We need to rotate candidate order," Vivian said. "Half the sample hears Garza first, half hears Whitfield first. Record which version each respondent received."

"And we need a follow-up for undecideds," Trish added. "If they say 'neither' or 'unsure,' we probe: 'Which way are you leaning?'"

Final Q4:

"If the election for U.S. Senate were held today, and the candidates were [ROTATE: Maria Garza, the Democrat, / Tom Whitfield, the Republican] for whom would you vote?"

Response options [rotate with candidate order]: [CANDIDATE A] / [CANDIDATE B] / Undecided/Not sure / Would not vote / Refused

[If undecided/not sure:] "Which way are you leaning — toward [CANDIDATE A] or [CANDIDATE B]?"

The Debate Over Q14: A Sensitive Issue Question

The Garza campaign wanted data on immigration attitudes, a salient issue in the Sun Belt state. This is where the conversation got genuinely difficult.

Carlos drafted: "Do you support or oppose stricter enforcement of immigration laws at the southern border?"

Vivian pulled up the literature. "That wording is going to get us 65% support, because 'stricter enforcement' sounds responsible. We'll be measuring response to rhetoric, not policy preference."

"What's the alternative?" Trish asked. "The specific policies are complicated. We can't explain E-Verify in twelve minutes by phone."

"What does the campaign actually need to know?" Vivian pressed.

"Whether voters prioritize enforcement over a path to citizenship," Carlos said. "Or both."

They settled on a forced-choice between two policy frames:

"Thinking about immigration policy, which of the following comes closer to your own view? ... [A] The most important thing is stronger border security and stricter enforcement of immigration laws. [B] The most important thing is providing a path to legal status for immigrants who are already here. [C] Both are equally important."

"'Both equally' is going to dominate," Trish predicted.

"Yes," Vivian said. "And that's useful data. If 45% say 'both equally,' the campaign knows the either/or framing is a loser."

Trish was right: the final data showed 48% chose "both equally," 32% prioritized enforcement, and 20% prioritized a path to citizenship. The result told Garza's team something actionable: an enforcement-first frame would cost as much as it gained, while a comprehensive approach might hold together a fragile majority.

🔵 Debate: Should Campaign Polls Use More Neutral Wording?

There is a persistent tension in campaign polling between academic-style neutral wording (which maximizes validity) and campaign-relevant wording (which tests how messages actually land). Nadia Osei, Garza's analytics director, pushed Meridian to use the specific language Garza's ads were using, so the poll results would reflect how voters respond to those actual messages. Vivian resisted, arguing that using campaign-specific language would compromise the measurement. The compromise: include some questions with neutral wording (for baseline measurement) and some with campaign-language wording (for message testing). The key is being explicit about which is which when reporting results.

Example: Good vs. Bad Versions of Survey Questions

The following examples show specific improvements the Meridian team made during the questionnaire review:

Topic Bad Version Better Version Problem Fixed
Candidate quality "Don't you think Tom Whitfield is too extreme?" "Do you think Tom Whitfield's political views are too liberal, too conservative, or about right?" Leading
Economic assessment "Given the current economic crisis, how do you rate your financial situation?" "How would you describe your current financial situation?" Loaded ("crisis" presumes a judgment)
Education policy "Do you support public education and teacher pay?" Separate Q on public ed funding; separate Q on teacher salaries Double-barreled
Candidate trust "Is Maria Garza honest?" "How trustworthy do you think Maria Garza is?" with 4-pt scale Yes/no for complex attribute
Immigration "Do you agree that the border should be secured?" Forced choice between policy frames Loaded/acquiescence-prone

Mode Effects: How the Medium Shapes the Message

One dimension of questionnaire design that is easy to overlook: the mode in which the survey is administered — phone, web, mail, in-person — is not a neutral delivery mechanism. Mode affects which respondents participate, how questions are interpreted, and what social pressures shape responses.

Telephone (live interviewer): Highest social desirability pressure. Respondents are on their best behavior. Overstatement of civic engagement (voter turnout, charitable giving). Understatement of stigmatized views. Declining response rates (to below 6% in many commercial polls) mean coverage and nonresponse bias are substantial.

Online panel: Lower social desirability pressure. Cheaper and faster. But online panels are convenience samples (more on this in Chapter 8), with substantial demographic skew toward younger, more educated, more internet-comfortable respondents. Opt-in panelists are also "professional survey takers" who may interpret political questions differently than the general public.

Address-Based Sampling (ABS) / mail: Near-complete coverage of the residential population, because virtually all households have a postal address. Slower. Lower response rates to self-administered instruments. But for hard-to-reach populations (rural areas, elderly, non-internet users), ABS provides coverage that phone and online modes miss.

Text-to-web: Growing mode. Respondents receive a text message with a link to an online survey. Fast and cheap, but completion rates are low and the sampling frame (cell phone numbers) has coverage gaps.

The serious implication for analysts: when you compare polls using different modes, you must consider mode effects before attributing differences to genuine opinion change. A shift from 52% to 58% support between a telephone poll and an online poll could reflect genuine change, sampling differences, or mode differences — or all three. Methodological transparency about mode is essential for valid interpretation.


Putting It Together: The Questionnaire Design Checklist

When reviewing any survey instrument — your own or someone else's — work through this checklist:

Question wording: - [ ] Are there leading words or phrases that signal a preferred answer? - [ ] Does the question contain loaded assumptions not all respondents share? - [ ] Does each question ask about exactly one thing (not double-barreled)? - [ ] Are key terms defined or specific enough to be unambiguous? - [ ] Is the question phrased in accessible language for the target population?

Response scales: - [ ] Does the scale have an appropriate number of points for the construct? - [ ] Is there a midpoint option for respondents with no opinion? - [ ] Is "don't know" or "refused" available where appropriate? - [ ] Are response options exhaustive and mutually exclusive? - [ ] Are response options presented in a randomized order where appropriate?

Order and flow: - [ ] Does question order prime later responses in unintended ways? - [ ] Are related questions grouped appropriately? - [ ] Is skip/branch logic complete and correctly specified? - [ ] Are sensitive or personal questions near the end? - [ ] Is the instrument an appropriate length for the mode and population?

Sensitive topics: - [ ] Have sensitive questions been designed to minimize social desirability bias? - [ ] Have sensitive question techniques (list experiment, RRT) been considered where bias risk is high? - [ ] Is there a "refused" option for highly personal questions?

Mobile and mode considerations: - [ ] Has the instrument been tested on mobile devices? - [ ] Have grid questions been replaced with sequential single-item questions for mobile? - [ ] Are response option lengths appropriate for mobile display?

Longitudinal consistency (if applicable): - [ ] Are tracking questions identically worded to prior waves? - [ ] Are response scales reproduced identically from prior waves? - [ ] Is question position within the instrument stable across waves?

Pretesting: - [ ] Has the instrument been reviewed internally? - [ ] Have cognitive interviews been conducted? - [ ] Has a pilot been fielded to check logic, timing, and response distributions?


Summary

Survey design is where the philosophical complexities of public opinion measurement become practical craft decisions. Every word, every response option, every sequencing choice shapes the distribution of responses you receive — and therefore shapes the reality you report about public opinion.

The major threats to questionnaire validity are well-documented: leading and loaded questions, double-barreled questions, acquiescence bias, vague terminology, order effects, social desirability bias, and mode effects. The tools for addressing these threats are equally well-established: neutral wording, appropriate scale design, randomization of order, branching logic, sensitive question techniques, and rigorous pretesting.

The gallery of real failure cases in this chapter illustrates that these threats are not theoretical — they appear in actual polls conducted by real organizations under time and budget pressure. Every failure case has a corrected version, and the correction is usually simple once the problem is identified. The challenge is developing the diagnostic habit: reading every question critically, asking "what does this assume?", "what does this leave unclear?", and "how might this nudge the respondent toward a particular answer?"

Scale design is not one-size-fits-all. The choice between 4-point, 5-point, and 7-point scales should be matched to the question type, the administration mode, and the analytical purpose. Forcing direction is sometimes appropriate; providing a midpoint is sometimes essential. The key is making the choice deliberately rather than defaulting to whatever scale format seems familiar.

Cognitive interviewing is the most underused tool in the questionnaire designer's toolkit. The discipline of watching a real respondent attempt to interpret your question — and recording the places where their interpretation diverges from your intent — produces more improvement per hour of work than any other pretesting method. It requires intellectual humility: the willingness to acknowledge that the question you spent an hour crafting is still confusing to the person it's meant to reach.

Mobile-first design is no longer optional. With the majority of online survey completions occurring on smartphones, designing for desktop and hoping for the best is a methodological error that generates real data quality problems. Sequential single-item questions, labeled radio buttons, and actual device testing are the minimum standard.

For longitudinal designs, question consistency is the non-negotiable requirement. Change the wording and you cannot claim to be tracking the same thing. This constraint feels limiting, but it is the price of valid tracking — and the data products it enables (time series of political opinion over months and years) are among the most valuable outputs of systematic survey research.

What distinguishes good survey designers from merely adequate ones is not just knowledge of these tools — it's the habit of questioning every assumption, the willingness to cut questions that don't earn their place, and the intellectual honesty to acknowledge what a given question design can and cannot measure.

Trish McGovern, walking out of the conference room with a revised draft under her arm, put it simply: "A good questionnaire doesn't look natural. It looks obvious. When you're done designing it, every choice should feel like it couldn't have been any other way. If you're still explaining why the question is worded the way it is, the question isn't done."

Three hours of argument, fifteen revised questions, and one cut question later, the Garza-Whitfield questionnaire was ready to field. Not perfect — no questionnaire ever is. But defensible, transparent, and designed to measure something real about a complicated political moment.

That's the best any of us can do.