Case Study 30.1: The Meridian Canvassing Experiment — Design, Execution, and What Went Wrong

Case Study 30.1: The Meridian Canvassing Experiment — Design, Execution, and What Went Wrong

Background

This case study tells the full story of the Meridian Research Group's canvassing experiment in the Garza-Whitfield Senate race — from the initial design conversations to the post-election analysis — with particular attention to the operational complications that almost derailed the study and what the team learned from them.

The Partnership Arrangement

Meridian's client for this experiment was Civic Engagement Forward, a 501(c)(4) organization that was running an independent canvassing operation in support of progressive candidates, including candidates in the general ideological alignment of Maria Garza's Senate campaign. Civic Engagement Forward was legally prohibited from coordinating directly with the Garza campaign, but it had built a significant field presence in three of the state's suburban counties over the summer.

The research partnership arose because Civic Engagement Forward's executive director, having read about Analyst Institute research at a conference, approached Vivian Park about embedding an experiment in their fall canvassing program. The director's motivation was both practical (she wanted to know whether their canvassing model was actually producing votes) and political (she could use rigorous evidence of effectiveness to raise money from donors who were increasingly demanding proof of impact).

Vivian's initial meeting with the director established the parameters: Meridian would design the randomization and analysis; CEF would provide access to its canvassing program and its field data; an academic team from a major state university would provide IRB coverage and peer review; Meridian would publish the results in a peer-reviewed outlet with appropriate delay to protect operational security.

The Design Phase

Carlos Mendez spent three weeks on the design. His starting point was the power analysis: what sample size would be required to detect a meaningful effect with adequate statistical power?

Key design parameters: - Target population: Low-to-moderate propensity registered Democrats and unaffiliated voters in the three target counties (CEF's existing target universe) - Baseline turnout estimate: 55% based on previous comparable elections - Target effect size to detect: 2.5 percentage points (consistent with published meta-analyses) - Statistical power target: 80% at α = 0.05 - Expected contact rate: 40% (based on CEF's historical contact rates in similar environments)

The power analysis produced a required sample of approximately 6,500 per arm at the individual level. Given the expected 40% contact rate, he needed to assign approximately 16,250 to the treatment group to generate 6,500 actual contacts.

Carlos's first instinct was to randomize at the individual level. Vivian pushed back: the experiment was in densely settled suburban neighborhoods with significant multi-unit housing. Spillover through household and neighbor conversations was a genuine concern. After discussion, they settled on a household-level randomization — all eligible voters in the same household would be assigned to the same condition.

This change had a cost: households with multiple voters would contribute correlated outcomes (family members influence each other's turnout), which reduces the effective sample size. Carlos recalculated with an intraclass correlation (ICC) of 0.3 for within-household turnout — a reasonable estimate based on prior literature — and found he needed approximately 25,000 to 30,000 household-assignments to maintain adequate power, with a control group of about 8,000 households.

The blocking design stratified on: county (three levels), past turnout history (voted in 0-1 of last 3 elections, voted in 2, voted in 3 but missed at least 1), and a coarse support score category (low-moderate, moderate, moderate-high). This produced 18 strata across which randomization was conducted within each stratum.

The Randomization Meeting

Two weeks before canvassing was scheduled to begin, Carlos, Trish, and CEF's field director held a two-hour meeting to finalize the randomization. Three issues arose that hadn't been anticipated in the design phase.

Issue 1: The precinct-level problem. CEF's canvassing program was organized by precinct: canvassers were assigned to specific precincts and given walking lists for their assigned geography. If the experiment randomized at the household level within precincts, canvassers would be walking past control-group households on the way to treatment-group households — raising the risk of incidental contact. Trish proposed an alternative: randomize at the precinct level within the three counties. Control-group precincts would not be canvassed at all.

Carlos ran the numbers. Precinct-level randomization would require more precincts (to achieve adequate statistical power at the precinct level, accounting for between-precinct variance) but would almost entirely eliminate the spillover risk. CEF had access to 78 precincts across the three counties within the target universe; Carlos determined that with 78 precincts randomly assigned approximately 2:1 to treatment and control, the experiment would have adequate power for a 2-percentage-point effect.

They switched to precinct-level randomization, accepting the loss of some statistical precision in exchange for dramatically reduced spillover risk.

Issue 2: The non-experimental program. CEF was also running canvassing outside the experimental sample — in counties that weren't part of the experiment and in precincts they had originally planned to target before the study was designed. The non-experimental program was larger than the experimental program. Trish needed to make sure experimental control precincts weren't being canvassed by the non-experimental program.

This required a clear operational delineation: the experimental precincts were quarantined from the non-experimental program, and CEF's field director had to ensure that canvassers working the non-experimental program knew not to work in the experimental precincts. A logistical error here would be undetectable in the data and could entirely invalidate the control group.

Issue 3: The time constraint. The experiment needed to run long enough to generate adequate contacts before Election Day. With Election Day six weeks away and precinct randomization complete, the experimental canvassing program needed to begin immediately. This left minimal time for the additional canvasser training on experimental protocol.

Trish's solution was to use CEF's existing canvassing staff rather than recruiting new canvassers for the experiment — they were already trained and in the field. She condensed the experimental protocol training into a single forty-five-minute session attached to the regular weekly briefing. Every canvasser received a specific briefing card describing what to do if they encountered an address outside their assigned list.

What Went Wrong

Three problems emerged during the six-week canvassing period.

Problem 1: The apartment building gap. Several of the experimental treatment precincts contained large apartment complexes where canvassing was difficult — building security, non-responsive buzzers, and high residential mobility combined to produce contact rates of 18–22% in those precincts, substantially below the overall target. More troubling, canvassers were spending large amounts of time in these buildings relative to contacts made. One canvasser report noted that she spent forty minutes trying to reach voters in one building and successfully contacted only three.

This created a contact rate problem: the apartment precincts were dragging the overall contact rate down, which would inflate the LATE estimate if the effect was similar in all precincts, or could create selection bias in the population of "actually contacted" voters if apartment dwellers were systematically harder to reach (and different from the contacted voters in other buildings).

Carlos flagged this as a potential moderation analysis: he would estimate the ITT separately for high-density multi-unit precincts and lower-density residential precincts to see whether the effects differed.

Problem 2: The experienced-canvasser allocation. As described in the main chapter, one county's field supervisor made a rational but experiment-disrupting decision to assign more experienced canvassers to the experimental treatment precincts. Her reasoning was that experimental precincts were priority precincts (they were in CEF's primary target area) and deserved the best canvassers.

When Trish discovered this during a weekly check-in call, she had a difficult conversation with the county supervisor: the allocation needed to be random, not performance-based. The supervisor was confused and resistant — "you're telling me to put worse canvassers in important precincts?" — and it took Vivian's personal call to explain the experimental logic. The allocation was corrected for the final three weeks of canvassing, but the first three weeks had already produced an imbalanced canvasser quality distribution.

Problem 3: A protocol breach in county two. In the second county, a canvasser working a treatment precinct noticed that a neighbor of one of her assigned voters — who was in a control precinct two blocks away — was a personal acquaintance. She stopped by to say hello and mentioned that she was out volunteering for a civic engagement campaign. The neighbor, curious, asked for information and the canvasser gave her a voter registration update card. This was a clear protocol breach — contact with a control-group voter.

When this came to light through a canvasser supervisor's field report, Carlos and Vivian assessed the impact: one control-group household, one protocol breach, minimal contamination risk to the larger experiment given 78 precincts in the design. They logged the breach, noted the specific household, and excluded it from the analysis. They also added a protocol refresher to the next week's canvasser briefing.

The Analysis and Results

Post-election analysis matched the experimental sample to the voter file and calculated ITT effects by county and in aggregate. The results:

Aggregate ITT: 2.8 percentage points (95% CI: 1.4–4.2 pp). Statistically significant at p < 0.01.

Contact rate: 43% (of treatment-precinct registered voters successfully contacted).

LATE: 6.5 percentage points (95% CI: 3.3–9.7 pp).

County-level heterogeneity: County one (high-contact, potentially high-quality canvassers): 3.9 pp ITT. Counties two and three: 2.1 and 2.4 pp. The difference is statistically significant in a pooled model with county-by-treatment interaction terms.

Carlos's analysis of the canvasser quality issue found that the first three weeks' contact in county one was associated with higher conversation quality scores on CEF's field app (which tracked canvasser self-reported quality ratings) — consistent with the interpretation that the initial high-quality canvasser allocation was partially driving the larger effect. For approximately 55% of the ITT difference across counties, the contact rate alone explained the gap. The remaining 45% was unexplained — consistent with but not definitively attributable to canvasser quality.

Key limitation documented in the write-up: The canvasser quality allocation issue in county one creates ambiguity about whether the larger county-one effect reflects the effect of a higher-quality canvassing model (of practical significance for CEF's program design) or simply a higher contact rate (which CEF could achieve in other counties with additional canvasser training). Disentangling these effects would require a follow-up experiment specifically designed to vary canvasser quality.

Discussion Questions

The team switched from individual-level to precinct-level randomization to address spillover concerns. What was gained and what was lost in this switch? Under what conditions would the individual-level design have been preferable despite the spillover risk?
The canvasser quality problem in county one created ambiguity about whether to attribute the larger effect to contact rate or canvasser quality. Design a follow-up experiment that would cleanly separate these two factors.
The protocol breach involving a control-group household was treated as minor given 78 precincts in the design. At what scale of protocol breach — how many households, what type of contamination — would you have considered stopping the experiment? How would you make that determination?
Trish's training approach condensed experimental protocol education into a forty-five-minute session attached to a regular briefing. What are the limits of this approach? What would more adequate training look like, and what operational costs would it impose?
The publication of the results will occur with a lag designed to protect operational security. Should there be any limit on how long Meridian and CEF can delay publication? What interests are served by the delay, and what interests are harmed?