Chapter 30: Field Experiments in Politics

46 min read

On a crisp November morning in 1998, Alan Gerber and Donald Green sat in a Connecticut church parking lot, watching a team of canvassers prepare to knock on doors in New Haven. The researchers had spent the previous months constructing what no...

In This Chapter

30.1 The Causal Inference Problem
30.2 The Randomized Controlled Trial: Logic and Design
30.3 Political Science's Experimental Revolution
30.3a Reading an Experiment: The Critical Consumer's Checklist
30.4 Types of Field Experiments in Political Science
30.5 Experimental Design: Beyond the Basics
30.6 Compliance, Spillover, and ITT vs. LATE
30.7 The Meridian Research Group's Canvassing Experiment
30.8 The Ethics of Political Field Experiments
30.8a The Political Science Experimental Infrastructure
30.9 Non-Experimental Alternatives: When You Can't Randomize
30.10 What Field Experiments Have Found: The Summary Evidence
30.10a Factorial Designs and Multi-Arm Experiments
30.11 Connecting Experiments to Campaign Practice
30.12 The Meridian Study's Results and Implications
30.13 Conclusion: Why Experiments Are Irreplaceable
Key Terms

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 30: Field Experiments in Politics

On a crisp November morning in 1998, Alan Gerber and Donald Green sat in a Connecticut church parking lot, watching a team of canvassers prepare to knock on doors in New Haven. The researchers had spent the previous months constructing what no political scientist had previously attempted at this scale: a randomized experiment on voter turnout. Registered voters had been randomly assigned to receive different types of campaign contact — personal canvassing, phone calls, direct mail, or nothing. After Election Day, Gerber and Green would compare turnout rates across the groups to measure, for the first time with genuine causal rigor, what actually gets people to vote.

The results, published in the American Political Science Review in 2000, were startling. Personal canvassing — door-to-door outreach by trained volunteers — produced a turnout increase of roughly seven to nine percentage points among contacted households. Direct mail produced a smaller but measurable effect. Telephone calls, in contrast, produced effects close to zero. These findings challenged assumptions that had governed campaign practice for decades and launched what political scientists now call the "experimental revolution" in political behavior research.

More than two decades later, the field experiment is the central methodology of political behavior research — and one of the most important tools in the campaign analytics practitioner's toolkit. This chapter examines how field experiments work, what the political experiment literature has found, how researchers at organizations like the Meridian Research Group design and execute them in real electoral environments, and why causal evidence from experiments is both irreplaceable and genuinely difficult to obtain.

30.1 The Causal Inference Problem

Before understanding why field experiments matter, you need to understand the problem they solve: causal inference.

Political scientists care about causation. Not just "what happened" but "what caused what to happen." Did this canvassing program increase turnout, or did the precincts that got canvassed have higher turnout anyway? Does this messaging strategy move voters, or do the voters who respond favorably to the message have different underlying characteristics than those who don't? Did this voter contact program win the election, or would the candidate have won without it?

These causal questions are genuinely difficult to answer from observational data — data collected from the world as it actually happens, without experimental manipulation. The fundamental problem is confounding: the variables you're interested in (contact, messaging, mobilization effort) are systematically correlated with other variables (the underlying characteristics of the precincts or voters targeted, the campaign's strategic choices about where to invest) in ways that make it impossible to isolate the causal effect.

The Fundamental Problem of Causal Inference

Consider the simplest question in political behavior research: does voter contact increase turnout? You look at the data and observe that voters who were contacted by a campaign turned out at 68%, while voters who were not contacted turned out at 52%. Does this mean contact caused a 16-point increase in turnout?

Not necessarily. Campaigns don't contact voters randomly. They contact voters who are already more likely to turn out — high-propensity supporters in persuasion and GOTV universes. If you compare contacted voters to all non-contacted voters, you're comparing two groups that differ in dozens of systematic ways before any contact happened. The contact-turnout correlation you observe is a mixture of the causal effect of contact and the selection effect of who gets contacted.

This is the fundamental problem of causal inference: we cannot observe what would have happened to the contacted voters if they hadn't been contacted. We observe the outcome in the world where they were contacted, not the counterfactual world where they weren't. All causal inference methods are, in essence, attempts to construct credible counterfactuals — to estimate what would have happened in the world without the treatment.

💡 Intuition: Think of causal inference as asking a question that can never be directly answered: "What would this voter's turnout have been if we hadn't called?" No amount of observational data can answer that question for a specific voter. The best we can do is answer it for groups of similar voters — by comparing groups that differ systematically only in whether they received contact, holding everything else constant. The randomized experiment is the most powerful method for constructing that comparison.

30.2 The Randomized Controlled Trial: Logic and Design

The randomized controlled trial (RCT) solves the causal inference problem through random assignment. If voters are randomly assigned to receive contact (the treatment group) or not to receive contact (the control group), then — by the logic of randomization — the two groups are, in expectation, identical in every way before the experiment begins. Any difference in outcome between the groups at the end can therefore be attributed to the treatment, not to pre-existing differences between groups.

This is the fundamental insight of experimental design. Random assignment creates the counterfactual that observational data cannot: the control group is what the treatment group would have looked like if they hadn't been treated. The difference in outcomes between groups is an unbiased estimate of the average treatment effect.

The Basic RCT Architecture

A basic voter contact experiment has several essential components:

Randomization unit: The unit at which randomization occurs — individual voters, households, or precincts. The choice matters enormously for the design.

Treatment: The specific intervention being tested — a personal canvass visit, a direct mail piece, a phone call, a digital advertisement, or some combination.

Control: The comparison condition — typically no contact, though some experiments compare different types of contact.

Outcome: The measured result — most commonly voter turnout, but also vote choice, candidate favorability, knowledge, or participation in other civic activities.

Intent-to-treat analysis: In most political experiments, not everyone in the treatment group actually receives the treatment. A canvasser is assigned to contact a voter, but the voter may not answer the door. The experiment compares the treatment-assigned group to the control group regardless of whether contact actually occurred. This is the intent-to-treat (ITT) analysis.

Compliance adjustment: The local average treatment effect (LATE) is an estimate of the effect on the voters who were actually contacted — the compliers — derived by adjusting the ITT for the contact rate. If 45% of treatment-group voters were actually reached, and the ITT effect is 3.5 percentage points, the LATE (effect on those actually contacted) is approximately 3.5 / 0.45 = 7.8 percentage points.

The Logic of Statistical Inference in Field Experiments

Random assignment ensures that, in expectation, the treatment and control groups are identical before the experiment. But random assignment doesn't guarantee that any particular randomization will produce perfectly balanced groups. By chance, one group might have slightly more frequent past voters. Statistical inference in experiments quantifies this uncertainty: the p-value and confidence interval tell you how often you would observe an effect this large by chance if there were actually no treatment effect.

📊 Real-World Application: A typical voter contact experiment might have treatment and control groups of 10,000 voters each. If the control group turns out at 52% (5,200 voters) and the treatment group at 55% (5,500 voters), the 3-percentage-point difference is real — but is it larger than what chance alone would produce? Statistical testing allows researchers to quantify the probability that this difference reflects genuine causal effect versus random fluctuation. In a sample of 10,000 per group, a 3-percentage-point difference would typically be highly statistically significant, because the standard error is small relative to the observed effect.

30.3 Political Science's Experimental Revolution

The Gerber-Green 2000 study was not the first field experiment in political science — there were scattered earlier examples, and the methodological logic was well understood in other social sciences. But it was the catalyst for what has become a sustained research program that has transformed our understanding of political behavior.

Gerber and Green: The 1998 New Haven Study

The original study is worth understanding in some detail, both because of its findings and because of its methodological design. Gerber and Green worked with the New Haven, Connecticut, register of voters to identify voters for a nonpartisan get-out-the-vote experiment before the November 1998 general election.

Eligible households (those with at least one registered voter who had voted in at most one of the previous three elections — the target population for GOTV interventions) were randomly assigned to one of seven conditions: personal canvassing only, direct mail (one, two, or three pieces), phone call only, canvassing plus mail, or control (no contact). After the election, the researchers matched household records to the voter file to measure whether registered voters in each household had voted.

The key findings were: - Personal canvassing increased turnout by approximately 8.7 percentage points per household contacted (LATE estimate). - Direct mail produced a smaller but statistically significant effect of approximately 0.7 percentage points per piece of mail sent. - Phone contact produced an effect statistically indistinguishable from zero.

The canvassing effect was striking — and it was the type of contact, not just contact itself, that mattered. The large difference between canvassing and phone contact suggested that the quality and personal nature of the interaction was driving the effect, not mere information transmission.

Subsequent Experimental Work: Major Findings

The Gerber-Green study opened a floodgate. By the early 2020s, hundreds of field experiments on political participation had been conducted, and a substantial literature of meta-analyses and replication studies has emerged. Several major findings have proven robust across contexts:

Personal canvassing by strangers: The original large effect has been somewhat revised downward by subsequent studies, particularly when canvassers are strangers (not neighbors). A meta-analysis by Green, McGrath, and Aronow (2013) found an average canvassing effect of approximately 2.5 percentage points per contact — substantial but more modest than the 1998 estimate. The variation across studies is large, suggesting context matters significantly.

Neighbors as canvassers: Several experiments have found that being canvassed by someone you actually know — a neighbor — produces larger effects than being canvassed by a stranger. The 2008 Obama campaign's "neighbor-to-neighbor" program was designed around this finding.

The social pressure mailer: One of the most influential — and controversial — experimental findings is the large effect of "social pressure" direct mail. Gerber, Green, and Larimer (2008) found that a mailer showing whether the recipient and their neighbors had voted in past elections increased turnout by approximately 8 percentage points — larger than the effect of personal canvassing. The mailer worked by invoking social norms around civic participation and accountability. It was enormously effective and enormously controversial, because many recipients found it invasive.

Telephone banking: Multiple studies find very small effects of phone banking on turnout — typically 0.5 to 1.5 percentage points per contact. The decline from the phone effects of an earlier era is attributed to the rise of caller ID, cell phone-only households, and declining answer rates.

Text banking: Recent experiments on SMS-based voter contact find modest positive effects, averaging around 1 to 2 percentage points per contact, with variation by population and message type. Effects tend to be larger among younger voters and first-time voters.

Direct mail: Effects of approximately 0.3 to 0.5 percentage points per piece, relatively consistent across contexts. Mail is cheap enough per contact that even small effects can make it cost-effective.

🔗 Connection: These findings — that personal contact is most effective, phone contact has minimal effect, and social pressure can be powerful but risky — are directly incorporated into the targeting and resource allocation decisions of campaigns like Nadia Osei's. The experimental literature is the empirical foundation on which the entire voter contact industry is built.

30.3a Reading an Experiment: The Critical Consumer's Checklist

Field experiments are the best tool we have for causal inference in political science, but individual experiments vary enormously in their design quality, execution rigor, and the extent to which their findings can be generalized. Learning to read an experimental study critically — to distinguish between findings that are credible and findings that appear rigorous but contain serious limitations — is an essential skill for campaign practitioners and political scientists alike.

Was randomization actually random? This sounds like a trivial question but isn't. Some studies use "quasi-random" assignment — canvassers are deployed to precincts based on logistical convenience with the claim that this approximates randomness. It doesn't. True randomization uses a random number generator or equivalent procedure to assign units, with the assignment recorded before treatment is delivered. Studies that describe their randomization vaguely ("precincts were randomly selected") deserve scrutiny about whether the selection was truly random or whether some systematic factor influenced which units ended up in which condition.

What was the contact rate, and is the ITT or LATE reported? A study that reports only its LATE estimate without clearly stating the contact rate is potentially misleading. A 6-point LATE derived from a 40% contact rate implies a 2.4-point ITT — a very different number for planning purposes than the 6-point figure alone implies. Both estimates should be reported, with the contact rate clearly stated.

How large was the control group, and was it truly untreated? Control group contamination — from the campaign's other outreach activities, from spillover between treatment and control areas, from canvassers who incidentally contacted control-group voters — is the most common threat to validity in political field experiments. A study that doesn't address this question credibly should be read with skepticism.

What is the target population, and does it match your context? A canvassing experiment conducted among high-propensity voters in a presidential election year may produce very different effects than the same program applied to low-propensity voters in an off-year election. Demographic composition, baseline turnout, electoral salience, and the competitive environment all affect how experiments generalize. The most careful studies describe their target population precisely; the most careful readers check whether that population resembles the one they care about.

Is the treatment implemented the way the researchers describe? Implementation fidelity problems — canvassers who deviated from script, mail that wasn't delivered on time, digital ads that reached different audiences than intended — can substantially change what a study is actually testing. Studies that include field monitoring data and report on implementation fidelity are more trustworthy than those that simply assume the treatment was delivered as designed.

Are the results pre-registered? Pre-registration — specifying the hypothesis, design, and primary analysis plan before looking at the data — protects against post-hoc specification searching, in which researchers test multiple analyses and report the significant ones. Experiments with pre-registered analysis plans provide substantially stronger evidence than those where the analysis approach was determined after data collection. The American Economic Association's registry (socialscienceregistry.org) and EGAP (egap.org) are the primary pre-registration platforms for political experiments.

📊 Real-World Application: The replication crisis in social science — the finding that a substantial fraction of high-profile psychology and economics experiments do not reproduce when rerun — has led to increased scrutiny of field experiments in political science as well. Several high-profile canvassing experiments have produced smaller effects in replication than the original study, and at least one well-known effect (certain types of "deep canvassing" on attitude change) has been substantially contested by replication attempts. Critical reading of experimental evidence — including the Gerber-Green canvassing literature — is therefore not skepticism of the entire enterprise but rather an appropriate scientific posture toward any specific finding.

30.4 Types of Field Experiments in Political Science

Field experiments in political contexts vary considerably in their design, setting, and outcome measures. Understanding the major types clarifies what each is designed to learn.

GOTV Experiments

Get-out-the-vote experiments test whether various forms of voter contact increase turnout among targeted populations. They are the most common type of political field experiment and the most directly applicable to campaign operations. Most of what we know about the effectiveness of canvassing, phone banking, mail, and other contact modes comes from GOTV experiments.

GOTV experiments are relatively straightforward to design because the outcome — voter turnout — is publicly observable in the voter file after Election Day. There's no need for follow-up surveys; you simply match experiment participants to the voter file and compare turnout rates.

Persuasion Experiments

Persuasion experiments test whether contact with political messages changes voters' candidate preferences or policy opinions. They are substantially harder to execute than GOTV experiments because vote choice is secret — you cannot observe how someone voted from the voter file, only whether they voted.

Persuasion experiments typically rely on post-contact surveys to measure opinion change. This introduces complications: survey response rates are lower than voter file match rates, survey responses may be affected by social desirability bias, and opinion measured in surveys may not translate into actual vote choice. The most rigorous persuasion experiments combine survey measurement with Vote Choice Inference techniques (inferences about vote choice from precinct returns) and sometimes with field-based survey games.

Voter Registration Experiments

Voter registration experiments test interventions designed to increase registration rates — outreach programs, automatic registration, same-day registration, and similar policy interventions. These are particularly valuable for understanding how the administrative barriers to voting affect participation, a question with direct policy implications.

Registration experiments are straightforward in their outcome measurement (the voter file shows who registered) but require longer time horizons, as the electoral effects of registration can take multiple cycles to fully manifest.

Informational Experiments

Informational experiments test whether providing voters with specific types of political information — candidate policy positions, incumbents' voting records, comparisons to challengers, information about electoral competitiveness — affects their political behavior. These experiments sit at the intersection of political communication research and voting behavior and are particularly relevant for understanding how information environments shape political participation.

Institutional and Policy Experiments

Some political experiments test institutional or policy changes — the effect of changing polling place locations on turnout, the effect of early voting on participation, or the effect of voter ID requirements on registration and turnout. These experiments often require collaboration with election administrators and raise ethical questions about experimenting with the institutions of democracy itself.

⚠️ Common Pitfall: There is a common misunderstanding that field experiments are always about comparing "contact" versus "no contact." In fact, the most policy-relevant experiments are often factorial — comparing multiple treatments simultaneously. A well-designed factorial experiment can test whether canvassing is more effective than mail, whether one message is more effective than another, and whether the combination of canvassing and mail outperforms either alone — all within a single study, with appropriate statistical power for each comparison.

30.5 Experimental Design: Beyond the Basics

Real political field experiments confront design challenges that textbook treatments of randomized experiments often understate. Understanding these challenges is essential for evaluating experimental evidence critically and for designing experiments that will produce actionable findings.

Blocked Randomization

Simple random assignment — the theoretical ideal — produces, in expectation, balanced treatment and control groups. But in any specific experiment with finite sample size, pure randomization may produce accidental imbalance on important background characteristics. Blocked randomization addresses this by stratifying the sample on key variables before randomizing within strata.

For example, in a GOTV experiment, you might block on past turnout history, ensuring that high-turnout voters are equally distributed between treatment and control, and that low-turnout voters are similarly balanced. This increases the precision of the treatment effect estimate (by reducing variance due to baseline differences) and ensures that the experiment's sample looks like the target population on the most important confounding variables.

Trish McGovern, Meridian Research Group's field director, thinks about blocked randomization in explicitly operational terms: "I tell my team that blocking is like good sorting before you deal the cards. You make sure the decks are balanced before you start, so the deal is fair."

Cluster Randomization

In political experiments, you often cannot randomize at the individual level because of spillover — the risk that treatment in one unit contaminates the comparison in an adjacent unit. If households in the same apartment building are randomized to different treatment conditions, a resident who received a canvassing visit might tell their neighbor about it, contaminating the control group.

Cluster randomization addresses spillover by randomizing at a higher level of aggregation than the individual — assigning entire precincts, blocks, or geographic areas to treatment or control, then analyzing outcomes at the individual level within those areas. This eliminates the spillover problem but at a cost: cluster-randomized experiments require larger sample sizes than individual-randomized experiments to achieve the same statistical power, because units within the same cluster are correlated.

The cluster size choice involves a tradeoff. Smaller clusters (blocks) reduce variance but may not fully contain spillover. Larger clusters (precincts, zip codes) more reliably contain spillover but require many more clusters to achieve statistical power.

📊 Real-World Application: The famous "social pressure" mailer experiment (Gerber, Green, Larimer 2008) encountered a spillover challenge: a mailer that shows voters how often their neighbors voted creates social influence not just for the direct recipient but potentially for other household members and close social contacts. The experimenters addressed this by randomizing at the household level, which contained within-household effects but couldn't prevent across-household social transmission. Subsequent research has examined these social network spillovers as a phenomenon worth studying in their own right.

Statistical Power in Political Experiments

Statistical power is the probability that an experiment will detect a true effect if one exists. It depends on three factors: the sample size, the true effect size, and the variance of the outcome measure. Underpowered experiments are expensive — you run the study, find no significant effect, and cannot distinguish between "the treatment doesn't work" and "the experiment wasn't large enough to detect a real effect."

Political field experiments require large samples, for a fundamental reason: turnout is a binary outcome with high variance, and the effects of voter contact are typically small. If the true effect of canvassing is 2.5 percentage points and baseline turnout is 50%, you need roughly 4,000 voters per treatment-control arm to detect this effect with 80% power at a 5% significance level. If you want to detect a smaller effect — say, 1 percentage point — you need roughly 40,000 per arm.

This power requirement is one of the reasons that political field experiments are typically run by campaigns, parties, or well-funded research organizations rather than individual academic researchers working on small grants. The sample sizes required to detect realistic effect sizes are large enough that the operational costs of running the experiment — hiring canvassers, printing mail, conducting the randomization — are substantial.

💡 Intuition: The statistical power problem in political experiments is not a technical failure — it's a reflection of the fact that political behavior is genuinely hard to move. If canvassing raised turnout by 20 percentage points, you'd need tiny samples to detect it. The effect is actually around 2–3 percentage points, so you need large samples. The difficulty of detection is proportional to the smallness of the effect — which itself tells you something important about how hard it is to change political behavior through contact alone.

30.6 Compliance, Spillover, and ITT vs. LATE

Three technical concepts are essential for interpreting political field experiment results: compliance, spillover, and the distinction between ITT and LATE estimates.

Compliance and the Intent-to-Treat Framework

In a canvassing experiment, a voter assigned to the treatment group is supposed to receive a canvass visit. But the canvasser may not find the voter home; the voter may refuse to engage; the canvasser may skip the door. Contact rates in canvassing experiments are typically between 25% and 60%, depending on the population and canvasser skill.

The experiment randomizes assignment to treatment, not receipt of treatment. The intent-to-treat analysis compares the treatment-assigned group to the control group, regardless of whether contact occurred. This is the appropriate comparison for evaluating the effect of the targeting decision (deciding to send canvassers to these voters) as opposed to the effect of the contact itself.

The local average treatment effect (LATE) estimates the effect of actual contact, among the voters who would have been contacted if assigned to treatment (the "compliers"). The LATE is derived by dividing the ITT by the contact rate: LATE = ITT / compliance rate.

For a campaign manager, the ITT is the operationally relevant number: "If I assign canvassers to contact this universe, what turnout effect should I expect?" For a researcher trying to understand the mechanism of voter contact, the LATE is more relevant: "What does being actually contacted do to a voter's probability of turning out?"

Spillover and Its Implications

Spillover (also called "contamination" or "SUTVA violation") occurs when a voter in the control group is affected by the treatment through some mechanism other than direct assignment. Common spillover pathways in political experiments:

Household spillover: A mailer targeted at one registered voter in a household influences other household members who see it.

Social network spillover: A voter who receives a canvass visit tells their neighbor, who wasn't in the treatment group, about the campaign.

Geographic spillover: Campaign activity in a treatment precinct increases the visibility of the campaign in adjacent control precincts.

When spillover is present, the control group's outcomes are partially affected by the treatment, which means the ITT underestimates the true treatment effect (because the counterfactual "no treatment" comparison is partially contaminated). Detecting and accounting for spillover requires either geographic buffer zones between treatment and control areas, or explicit measurement of spillover pathways.

✅ Best Practice: Before finalizing the randomization for a political experiment, map out the plausible spillover pathways given the specific treatment and population. For canvassing experiments in residential neighborhoods, household-level randomization with no geographic buffer is often insufficient; precinct-level randomization may be necessary. For mail experiments, household-level randomization is typically adequate because mail goes to specific addresses.

30.7 The Meridian Research Group's Canvassing Experiment

Dr. Vivian Park founded Meridian Research Group eleven years ago after a career split between academic political science and consulting for advocacy organizations. Meridian's distinctive niche is academic-quality field experiments run in partnership with campaigns and outside groups — bringing the rigor of academic experimental design to the messiness of real electoral environments.

Carlos Mendez, Meridian's junior analyst, joined eighteen months ago fresh from a master's program in quantitative social science at a major research university. He's bright, technically skilled, and still learning the gap between experimental textbooks and operational realities. Trish McGovern, the field director, has been with Meridian for eight years; her background is in union organizing, not research, and her expertise is the operational side — logistics, relationships with campaigns, managing canvasser teams in the field.

In October of the Garza-Whitfield race's final weeks, Meridian is conducting a GOTV field experiment in partnership with an academic team and a progressive outside group that is running an independent canvassing operation in support of Garza. The outside group — legally prohibited from coordinating directly with the Garza campaign — is running voter contact in several counties, and they have agreed to randomize their canvassing assignments to enable Meridian's study.

The Experimental Design

The target population for the experiment is registered Democrats and unaffiliated voters in three suburban counties who have voted in at least one of the last three elections but fewer than all three — a "low-to-moderate propensity" population that the outside group's targeting model has identified as mobilization targets. There are approximately 85,000 such voters in the three counties.

Carlos's power analysis found that to detect a 2.5-percentage-point canvassing effect with 80% power, given a baseline turnout of around 55% for this population in comparable elections, the experiment needs approximately 6,500 voters per arm — 6,500 in treatment and 6,500 in control. To be conservative about the contact rate (expected around 40%), he wants enough treatment-group voters assigned to produce 6,500 actual contacts, which means assigning approximately 16,000 to treatment.

The full experimental sample is thus around 22,500 voters: 16,000 in treatment and 6,500 in control, drawn from the 85,000-person target population. The remaining 62,500 voters in the target population will be canvassed by the outside group's regular (non-experimental) program.

Randomization is at the household level, with blocks defined by precinct and past turnout category (voted in one previous election, voted in two, voted in three but missed at least one). This blocked design ensures that the treatment and control groups are well-balanced on the most important predictor of turnout.

Trish's Operational Challenge

The experiment's value depends entirely on the canvassers actually following the randomized assignment — visiting only the voters in the treatment group, not substituting convenient non-experiment voters for hard-to-reach treatment voters, and not treating control-group voters incidentally when they're working a neighborhood.

This operational fidelity is harder to achieve than it sounds. Canvassers are human. If a voter on the control list answers their door, a canvasser's natural instinct is to have the conversation. If a treatment-list voter is on the third floor of an apartment building without a working buzzer, a canvasser might substitute the more-accessible household next door. Any of these deviations from protocol undermines the randomization and introduces bias into the treatment effect estimate.

Trish's solution is threefold: training, monitoring, and protocol design.

Training: Every canvasser received forty-five minutes of additional training specifically on the experimental protocol — why it exists, why following it exactly matters, and specifically what to do when they encounter the common temptations to deviate (apartment buildings, multi-family homes, voters who ask about campaign activity while they're in control precincts).

Monitoring: The outside group's digital canvassing app is configured to flag any door a canvasser approaches that is not in their assigned treatment list. Supervisors review these flags daily. Canvassers with high flag rates receive additional coaching.

Protocol design: Canvassers are given lists that only show their treatment-group assignments — the control group is invisible to them, not just labeled "do not contact." This eliminates the temptation to contact control-group voters who appear on the list.

⚠️ Common Pitfall: "Operational security" in canvassing experiments has a double meaning. In campaign contexts, operational security means not revealing the campaign's strategy to opponents. In experimental contexts, it means keeping canvassers from knowing the experiment's design in ways that would allow them to selectively contact voters or behave differently with treatment versus control cases. These two meanings can sometimes create genuine tension: campaign staff who are told nothing about why some voters are excluded from contact may create workarounds that undermine the experiment.

The Complications

Two weeks into the experiment, Trish flags a problem: one of the three counties in the experiment is experiencing an unusually high contact rate (52%) relative to the other two counties (38% and 35%). Digging into the field notes, Carlos discovers that the supervisor in the high-contact county has been assigning experienced canvassers specifically to the treatment precincts and less-experienced canvassers to general (non-experiment) work — a rational operational decision that, unfortunately, creates a different type of canvasser in the experimental treatment group than in the outside group's regular program.

This introduces confounding: if the treatment effect in that county is larger than in the others, it might partly reflect the quality of the canvassers rather than simply the effect of contact. Carlos and Vivian discuss whether to continue with the county in the analysis, flag it as a potential moderation analysis, or strip it from the main analysis and treat it as a sensitivity check.

They decide on a pre-registered decision rule: proceed with all three counties in the main analysis, but include county as a block in the statistical model and flag the canvasser quality difference in the limitations section. If the effect is significantly larger in the high-contact county, they'll investigate whether the contact rate or the canvasser quality is driving the difference.

🧪 Try This: Power analysis is essential for experimental design, but it requires assumptions about baseline turnout and the true effect size — neither of which is known precisely in advance. Try computing the required sample size for a GOTV experiment under different assumptions: (a) baseline turnout 45%, effect 3 percentage points; (b) baseline turnout 55%, effect 2 percentage points; (c) baseline turnout 40%, effect 1.5 percentage points. How sensitive is the required sample size to these assumptions? What does this tell you about the importance of pilot data before finalizing an experimental design?

30.8 The Ethics of Political Field Experiments

The conduct of field experiments in political contexts raises ethical issues that researchers have wrestled with extensively since the social pressure mailer controversy of 2008.

In most research with human subjects, informed consent is required: participants must be told that they are participating in a study and must agree to participation before the study proceeds. In political field experiments, informed consent is routinely waived, for a practical reason: telling voters they are in a GOTV experiment changes their behavior, undermining the experiment's validity.

The ethical basis for waiving consent in political experiments is that the treatment (voter contact) is an ordinary feature of political life — campaigns contact voters all the time, without asking permission. The only difference in an experiment is that contact is randomized rather than targeted. The ethical harm of being in the treatment group (receiving unexpected political contact) is no greater than the harm of receiving the campaign's normal outreach.

This argument is more convincing for GOTV experiments than for some others. The social pressure mailer experiment — which disclosed individuals' and their neighbors' voting history — involved revealing personal information in a way that many participants found disturbing and that is not "normal" political contact. Several experiments using highly novel treatments have attracted ethical criticism precisely because the treatments crossed into territory that goes beyond ordinary political practice.

IRB Review

Institutional Review Board (IRB) review is the primary mechanism for ethical oversight of research with human subjects in academic contexts. Whether political field experiments require IRB review is contested. Some researchers argue that experiments conducted in the context of normal campaign activity — with the same treatments as a campaign would normally deploy — are "exempt" from IRB review because they don't constitute research on human subjects in the relevant sense. Others argue that any systematic study of human political behavior requires review.

The IRB question has become more fraught as political field experiments have grown in scale and novelty. Experiments that involve novel treatments — unusual pressure tactics, personalized information disclosure, deceptive elements — are more likely to require and receive careful IRB scrutiny. Meridian Research Group's standard practice is to obtain IRB review for all experiments through its academic partners, even when the academic IRB exemption might be available.

The Use of Results

A subtler ethical issue is who benefits from the findings of political field experiments, and whether those benefits are appropriately distributed. Most political field experiments are conducted in the context of partisan campaigns or advocacy organizations — they're designed to answer questions that are useful to one side of a political contest. The findings may improve the effectiveness of Democratic GOTV programs, or Republican persuasion programs, or advocacy organizations working on specific issues.

The academic researchers who conduct and publish these experiments are bound by norms of scientific openness — they publish their findings in peer-reviewed journals, where they are available to everyone. But the immediate beneficiary of the experiment's findings is typically the campaign or organization that provided access and funding.

⚖️ Ethical Analysis: The Meridian experiment in the Garza-Whitfield race is funded by a progressive outside group working in support of Garza. If the experiment finds a large canvassing effect, the finding will help the outside group improve its operations in the current race and future races. It will also be published and contribute to the broader research literature, which may eventually benefit campaigns on both sides. But the first-mover advantage — knowing the findings before they're public — accrues to the funding organization. Is this arrangement ethically appropriate? It is the current norm in political field research. Whether it should remain the norm is an open question.

30.8a The Political Science Experimental Infrastructure

The explosion of political field experiments since 2000 required not just methodological innovation but institutional infrastructure — organizations and norms that make experiments possible at the scale and frequency the literature now assumes.

The Analyst Institute. Founded in 2006, the Analyst Institute is a research consortium that partners with progressive campaigns and advocacy organizations to conduct field experiments. Its model addresses a structural problem: individual campaigns don't have the resources or methodological expertise to run rigorous experiments on their own, and academic researchers don't have the access or the operational relationships to run experiments in real campaign contexts. The Analyst Institute bridges this gap by providing research design and analysis in exchange for data and field access.

Over its existence, the Analyst Institute has accumulated one of the largest archives of political field experiment results in the world — hundreds of experiments covering canvassing, phone banking, mail, digital advertising, and other voter contact modes across diverse electoral contexts. Its meta-analyses of this archive provide the most contextually grounded evidence available on GOTV effectiveness, far more nuanced than the published academic literature because it encompasses many experiments that were never written up for academic publication.

EGAP (Evidence in Governance and Politics). EGAP is an international research network of political scientists focused on field experiments in governance and politics. It maintains a repository of experiment pre-registrations, conducts collaborative multi-site experiments, and promotes norms of transparency and replication in the political experiment literature. EGAP's work spans both established and developing democracies, providing comparative perspective on how context affects experimental findings.

The academic-campaign partnership model. The most common structure for political field experiments involves a three-way partnership: a campaign or advocacy organization provides access to a GOTV program, an academic researcher provides methodological design and analysis, and an institutional host (a university IRB) provides ethical oversight. This model has produced the bulk of the published experimental literature but also creates tensions around data ownership, publication timelines, and the competing incentives of the partners.

Campaigns need results quickly enough to inform current-cycle operations. Academic researchers need to publish eventually, which requires careful write-up and peer review. The publication delay typically ranges from six months to three years — long after the campaign cares about the results, but short enough (for competitive elections) to give one party informational advantage. Managing these timeline tensions is an ongoing challenge in the academic-campaign partnership model.

The pre-registration norm and its limits. Pre-registration has become increasingly standard in political experiments, driven by the broader replication crisis in social science and by EGAP's promotion of transparency norms. Pre-registration protects against the most obvious forms of specification searching but does not prevent all forms of analytic flexibility — researchers can still make defensible choices within the space of their pre-registered analysis plan that affect findings. And pre-registration is most effective when the community of researchers treats non-pre-registered studies skeptically, which the political science experimental community is still working toward.

🔗 Connection to the Meridian Model: Vivian Park's approach to partnering with academic teams reflects the best features of the academic-campaign partnership model. Meridian brings operational expertise and field access; the academic partners bring methodological rigor and IRB oversight; the combined product is more credible than either could produce alone. The pre-registration that the academic partners insist on — logging the analysis plan on EGAP's registry before data collection begins — is an operational inconvenience that pays dividends in the credibility of the findings.

30.9 Non-Experimental Alternatives: When You Can't Randomize

Field experiments are the gold standard for causal inference, but randomization is not always feasible. Campaigns can't randomize their resource allocation for purely research purposes; many interventions can't be withheld from deserving populations; some effects are too large to permit control groups. Political scientists have developed several non-experimental approaches to causal inference that can produce credible estimates when experiments are impossible.

Regression Discontinuity Design

The regression discontinuity (RD) design exploits situations in which treatment assignment is determined by whether a continuous running variable crosses a threshold. The key insight is that voters (or precincts) just above and just below the threshold are, in expectation, very similar to each other — so comparing outcomes just above and just below the cutoff approximates random assignment near the threshold.

Classic applications in political science include: - The effect of winning close elections on future fundraising, policy influence, and legislative behavior - The effect of campaign finance thresholds (above a certain donation size, additional disclosure requirements apply) on donor behavior - The effect of electoral competitiveness (above a certain margin, races are coded as safe; below, as competitive) on incumbent behavior

RD designs are powerful but have important limitations: the estimates apply only to voters or units near the threshold (not to the broader population), the design requires that no other treatment changes discontinuously at the same threshold, and the "bandwidth" selection (how wide a window around the cutoff to use) involves tradeoffs between bias and precision.

Matching

Matching is a technique for constructing a comparison group from observational data that resembles the treatment group on measured characteristics. If you want to estimate the effect of a canvassing program that targeted specific precincts, you can find comparison precincts that were similar to the canvassed precincts on relevant background characteristics (turnout history, partisan composition, demographic composition) and compare outcomes between the matched pairs.

Matching works well when the treatment and comparison groups differ primarily on observed characteristics that you can measure and match on. It fails when there are important unobserved confounders — characteristics that affect both treatment assignment and outcomes but that you haven't measured. The credibility of a matching analysis depends entirely on how plausible the "ignorability" assumption is — the assumption that, conditional on the matched variables, treatment assignment is as good as random.

Difference-in-Differences

Difference-in-differences (DiD) exploits situations in which you have outcome data for treatment and comparison units both before and after the treatment. If a campaign ran a canvassing program in some precincts but not others, and you have turnout data from the current election and a comparable prior election, you can estimate the canvassing effect by comparing the change in turnout across elections in canvassed versus non-canvassed precincts.

The key identifying assumption in DiD is "parallel trends" — the assumption that, without the treatment, turnout in the canvassed and non-canvassed precincts would have changed at the same rate. If the canvassed precincts were trending upward before the campaign anyway, the DiD estimate will overstate the canvassing effect.

DiD is widely used in political science and economics because the data requirements are relatively modest (you need panel data with pre- and post-treatment observations). Its credibility depends on the plausibility of the parallel trends assumption in the specific application.

🔗 Connection to the Prediction vs. Explanation Theme: The contrast between experimental and non-experimental methods maps directly onto the distinction between prediction and explanation. Observational models — the support scores and turnout propensity scores that campaigns use — are prediction tools. They use historical patterns to predict future behavior; they don't establish why the patterns exist. Field experiments are explanation tools. They tell you not just that contacted voters turn out more, but that contact itself causes the increase in turnout. The difference matters enormously for what you can do with the finding: a predictive model tells you who to target; a causal estimate tells you what effect your targeting will have.

30.10 What Field Experiments Have Found: The Summary Evidence

After more than two decades of political field experiments, a substantial base of cumulative evidence has emerged. Several major findings are sufficiently robust to be considered well-established.

Personal canvassing is the most effective single voter contact mode. Meta-analyses consistently find that face-to-face canvassing produces turnout effects in the range of 2–4 percentage points per contact, with significant variation by canvasser quality, population, and context. The effect is larger when canvassers are neighbors rather than strangers, and when scripts emphasize social norms rather than pure information.

Social pressure is powerful but ethically contested. The "name and shame" social pressure mailer — showing voters their own and their neighbors' turnout histories — produces effects comparable to or larger than personal canvassing, at dramatically lower cost per contact. But the tactic generates backlash: many recipients object strongly to receiving the mail, and it has been associated with increased anger at the sending campaign. Most campaigns use modified versions that invoke social norms without listing specific neighbors.

Phone banking has small effects, text banking has slightly larger ones. Phone contact effects have declined substantially from what earlier (less experimental) studies suggested, as answer rates have fallen. Text banking effects are somewhat larger, particularly among younger voters, and the cost per contact is lower — making it competitive on a cost-per-vote basis even with modest per-contact effects.

Direct mail has small but real effects. Each piece of direct mail produces a turnout effect of approximately 0.3–0.7 percentage points per piece delivered. At typical mail costs, this is competitive on a cost-per-vote basis, particularly for high-volume mail programs that combine multiple pieces.

Persuasion effects are smaller and less consistent. The evidence on whether voter contact changes vote choice (rather than just turnout) is more mixed. Some experiments find meaningful persuasion effects for specific message types and populations; many find effects indistinguishable from zero. The honest summary is that moving voters across partisan lines is hard, and the evidence for large persuasion effects from any single contact mode is weak.

Context and population matter significantly. Effect sizes vary substantially across elections (presidential vs. midterm vs. primary), populations (high-propensity vs. low-propensity voters), and geographies (urban vs. suburban vs. rural). Effect estimates from one context should be applied to other contexts with caution.

30.10a Factorial Designs and Multi-Arm Experiments

Most of the foundational political field experiments were two-arm studies: treatment versus control, one GOTV intervention versus nothing. As the field has matured, more sophisticated multi-arm and factorial designs have become standard, enabling researchers to answer multiple questions within a single experiment.

Multi-arm experiments. A multi-arm experiment simply randomizes units to more than two conditions — for example, three arms comparing canvassing, phone banking, and control. This design is more informative than running separate two-arm experiments because it allows direct comparison of the active treatments, and it is more efficient because a shared control group serves multiple treatment comparisons.

The Gerber-Green (2000) original study was actually a multi-arm design: it had seven conditions (various combinations of canvassing, mail, and phone contact). This design allowed the researchers to estimate the effect of each contact mode against the control, and also to estimate whether combinations of contact modes produced larger effects than single modes (they found limited evidence of synergy between modes).

Factorial designs. A factorial design varies two or more factors simultaneously, with all combinations of factor levels represented. Suppose a campaign wants to test both the contact mode (canvassing vs. mail) and the message frame (economic vs. healthcare). A 2x2 factorial design would have four arms: canvassing-economic, canvassing-healthcare, mail-economic, mail-healthcare, plus a control. This design allows estimation of the main effects of mode and message, and also whether the combination of canvassing and a specific message frame is more effective than either factor alone — the interaction effect.

Factorial designs are powerful but require larger sample sizes to maintain statistical power for each comparison. The required sample size grows roughly in proportion to the number of arms, so a 2x2 factorial with five conditions needs roughly five times the control group sample as a two-arm study. This limits factorial designs to experiments with very large GOTV targets.

Sequential designs and adaptive experiments. An emerging approach in political experiment design is adaptive experimentation — designs that use interim results to update the allocation of new subjects across arms, concentrating resources in the most promising conditions. These designs can identify the best treatment more efficiently than fixed allocation, but they require more sophisticated analysis and raise issues about interim peaking (analyzing results before the experiment is complete) that need to be handled carefully.

The Meridian group does not currently use adaptive designs — Vivian considers the operational complexity too high for typical campaign contexts — but larger organizations like the Analyst Institute have begun exploring them for multi-site experiments where interim results from early sites can inform allocation decisions for later sites.

Moderation analysis. Beyond estimating average treatment effects, experimenters often want to understand whether effects differ across subgroups — whether canvassing is more effective for younger voters than older ones, for example, or more effective in competitive precincts than safe precincts. This analysis of treatment effect heterogeneity is sometimes called "moderation analysis" (asking what moderates the treatment effect) or "subgroup analysis."

Moderation analysis in field experiments requires care. When researchers specify subgroups after seeing the data, they risk false positive findings from multiple testing. Pre-specified subgroup analyses (specified before looking at the data) are more credible. The Meridian experiment pre-specified two subgroups — high-density versus lower-density precincts, and first-time voters versus experienced voters — based on prior literature suggesting these dimensions moderate canvassing effects.

30.11 Connecting Experiments to Campaign Practice

The relationship between field experiment findings and campaign practice is more complicated than a simple "research finds X, campaigns do X" picture would suggest.

On the positive side, the field experiment literature has demonstrably changed how campaigns allocate their resources. The finding that personal canvassing dramatically outperforms phone banking in per-contact efficacy has shifted campaign investment toward field organizing. The finding that neighbor-to-neighbor canvassing is more effective than stranger canvassing drove program design changes in multiple cycles. The social pressure mailer finding created an entire genre of civic mobilization direct mail.

On the complicated side, the experiment findings are often applied without adequate attention to the conditions under which they were obtained. An experiment finding a 3-point canvassing effect in a high-stakes gubernatorial race in a competitive state may not generalize to a school board election in a low-salience environment. An experiment finding large social pressure effects may not replicate with a different population or in a different political climate.

Vivian Park is precise about this with clients. "The research tells you the effect in the environments where it was studied," she tells campaigns. "It tells you very little about the effect in your specific race, with your specific canvassers, in your specific moment. That's why we run our own experiments wherever we can, rather than just applying the literature."

📊 Real-World Application: The Analyst Institute, a research consortium that has partnered with hundreds of progressive campaigns and organizations, has assembled perhaps the largest repository of political field experiment data in existence. Its meta-analyses of GOTV experiments provide the best available estimates of treatment effects under specific conditions, and its practical guidance translates those estimates into resource allocation recommendations. Campaigns with access to Analyst Institute resources can get substantially more contextually appropriate guidance than the published academic literature alone provides.

30.12 The Meridian Study's Results and Implications

Meridian Research Group's canvassing experiment concluded with the general election. In the weeks following Election Day, Carlos and Trish matched the experimental sample to the voter file and calculated the initial intent-to-treat effects.

The ITT estimate across the three counties was 2.8 percentage points (95% CI: 1.4–4.2 pp). With a contact rate of approximately 43%, the LATE estimate was approximately 6.5 percentage points. Both effects were statistically significant and substantively consistent with the published literature.

A notable finding emerged from the county-level heterogeneity analysis: the county with the higher contact rate and (potentially) higher canvasser quality showed a larger ITT effect (3.9 pp) than the other two counties (2.1 and 2.4 pp). Carlos's analysis suggested that roughly half this difference was attributable to the higher contact rate and the other half potentially attributable to canvasser quality — a finding with practical implications for campaign resource allocation.

The study's results are being written up for publication and will be shared with the academic partnership and, with appropriate lag, with the broader research community. For Meridian's client — the outside group — the findings arrive immediately after the election, in time to inform GOTV program design for the next cycle.

30.13 Conclusion: Why Experiments Are Irreplaceable

The field experiment is, in a meaningful sense, the best tool political science has for answering the questions that matter most for democratic practice. Not "who is likely to vote?" but "what makes them vote?" Not "who supports this candidate?" but "what makes them support her?" The experimental method — random assignment, counterfactual comparison, statistical inference — provides answers to causal questions that no amount of observational data can definitively resolve.

This doesn't mean experiments are perfect. They are expensive. They can only run in real electoral environments that consent to randomization. Their findings don't always generalize to new contexts. They raise ethical questions that require careful management. And there are important political questions — the effects of large-scale structural changes, the effects of decades-long information environments, the effects of deep cultural forces — that cannot be addressed experimentally at all.

But for the specific questions that campaigns care most about — does this contact mode work, does this message move voters, does this mobilization strategy produce votes — the field experiment provides the only genuinely credible answer. The prediction vs. explanation distinction is not just a methodological preference; it is the difference between knowing what patterns exist and knowing why they exist. The former is useful for targeting. The latter is what campaigns, researchers, and ultimately democratic citizens need to evaluate whether the tools of modern political mobilization actually serve the goals they're supposed to serve.

Key Terms

Randomized controlled trial (RCT) — A study design in which units are randomly assigned to treatment or control conditions, enabling unbiased causal inference by ensuring groups are equivalent before the experiment.

Intent-to-treat (ITT) — Analysis comparing treatment-assigned and control-assigned groups regardless of whether treatment was actually received. The operationally relevant estimate for campaigns.

Local average treatment effect (LATE) — The estimated effect of actual contact among the voters who would have been contacted if assigned to treatment, derived by dividing the ITT by the contact rate.

Compliance rate — The proportion of treatment-assigned voters who actually received the treatment (e.g., were successfully canvassed).

Blocked randomization — Randomization stratified by key background variables to ensure balanced groups and improve statistical precision.

Cluster randomization — Randomization at the group level (households, precincts, blocks) rather than the individual level, used to prevent spillover between treatment and control units.

Spillover — Contamination of the control group through treatment-to-control transmission mechanisms such as social networks or geographic proximity.

Social pressure mailer — A type of GOTV direct mail that shows recipients their own and their neighbors' voting histories, found in experiments to produce large turnout effects.

Regression discontinuity design — A quasi-experimental method that exploits threshold-based treatment assignment to estimate causal effects for units near the threshold.

Difference-in-differences — A quasi-experimental method that estimates causal effects by comparing changes over time between treatment and comparison groups.

Statistical power — The probability that an experiment will detect a true treatment effect if one exists; depends on sample size, effect size, and outcome variance.

In This Chapter

Chapter 30: Field Experiments in Politics

30.1 The Causal Inference Problem

The Fundamental Problem of Causal Inference

30.2 The Randomized Controlled Trial: Logic and Design

The Basic RCT Architecture

The Logic of Statistical Inference in Field Experiments

30.3 Political Science's Experimental Revolution

Gerber and Green: The 1998 New Haven Study

Subsequent Experimental Work: Major Findings

30.3a Reading an Experiment: The Critical Consumer's Checklist

30.4 Types of Field Experiments in Political Science

GOTV Experiments

Persuasion Experiments

Voter Registration Experiments

Informational Experiments

Institutional and Policy Experiments

30.5 Experimental Design: Beyond the Basics

Blocked Randomization

Cluster Randomization

Statistical Power in Political Experiments

30.6 Compliance, Spillover, and ITT vs. LATE

Compliance and the Intent-to-Treat Framework

Spillover and Its Implications

30.7 The Meridian Research Group's Canvassing Experiment

The Experimental Design

Trish's Operational Challenge

The Complications

30.8 The Ethics of Political Field Experiments

Informed Consent

IRB Review

The Use of Results

30.8a The Political Science Experimental Infrastructure

30.9 Non-Experimental Alternatives: When You Can't Randomize

Regression Discontinuity Design

Matching

Difference-in-Differences

30.10 What Field Experiments Have Found: The Summary Evidence

30.10a Factorial Designs and Multi-Arm Experiments

30.11 Connecting Experiments to Campaign Practice

30.12 The Meridian Study's Results and Implications

30.13 Conclusion: Why Experiments Are Irreplaceable

Key Terms