Capstone 3 Data Appendix: The Campaign Analytics Plan

Section A: Dataset Reference Guide

Primary Dataset: oda_voters.csv

The ODA voter file is a teaching-scale version of the data infrastructure used in real campaign analytics. With approximately 50,000 rows, it provides the foundation for all universe construction work in this capstone. Real Senate race voter files have millions of records; the relative compositions and analytical relationships in this teaching dataset accurately reflect real voter file characteristics.

Complete column reference:

Column Type Range / Values Notes
voter_id string V-XXXXX Unique identifier
state string Single-state All records from the race's state
county string 12 county names See county reference table below
age integer 18-85 Current age at election date
gender categorical M, F, Other
race_ethnicity categorical White, Black, Hispanic, Asian, Other Self-reported where available, estimated otherwise
education categorical Less than HS, HS Diploma, Some College, College Degree, Graduate Degree
income_bracket categorical Under 30K, 30-60K, 60-100K, Over 100K Estimated from public records and commercial data
party_reg categorical Democrat, Republican, Independent, Other Official registration status
vote_history_2018 binary 0, 1 1 = participated in 2018 midterm election
vote_history_2020 binary 0, 1 1 = participated in 2020 presidential election
vote_history_2022 binary 0, 1 1 = participated in 2022 midterm election
urban_rural categorical Urban, Suburban, Rural
support_score float 0-100 Garza support probability × 100
persuadability_score float 0-100 Higher = more movable by campaign contact

County reference table:

County Name Type Approx. Share of State Voters Strategic Designation
Metro Central Urban 22% Garza base; GOTV priority
Metro South Urban 14% Garza base; GOTV priority
Lakeview Suburban 12% Swing; persuasion and GOTV
Riverside Suburban 9% Swing; persuasion priority
Valley North Suburban 7% Lean Whitfield; limited investment
Highfield Rural 8% Whitfield stronghold; GOTV for Garza voters only
Carson Rural 7% Whitfield stronghold; GOTV for Garza voters only
Eastport Rural 5% Whitfield stronghold
Garfield Rural 4% Split rural; some Garza opportunity
Millbrook Suburban/rural mix 5% Lean Whitfield; some persuasion opportunity
Crestview Suburban 4% Lean Garza; GOTV priority
Other/small Mixed 3% Distributed; low investment

Expected support score distribution by county:

County Mean Support Score Std Dev Notes
Metro Central 66.2 18.4 Garza-favorable; large Latino share
Metro South 63.8 19.1 Garza-favorable; large Black share
Lakeview 51.3 22.7 Genuinely swing; college-educated suburbs
Riverside 49.8 23.1 Most competitive county in the state
Valley North 44.2 20.5 Lean Whitfield; some college-ed women opportunity
Highfield 38.5 16.2 Whitfield stronghold
Carson 37.1 15.8 Whitfield stronghold
Garfield 46.3 21.9 Split rural; Garza has a small universe here

Secondary Dataset: oda_polls.csv

Used for polling plan design and public polling analysis.

Key columns for this capstone:

Column Notes
date Date poll was fielded (start date)
pollster Organization conducting the poll
methodology phone, online, mixed, IVR
pct_d, pct_r Candidate percentages
sample_size Total N
margin_error Reported margin of error
population LV (likely voters), RV (registered voters), A (adults)
race_type senate_general, senate_primary, governor_general, etc.

Public polling summary for the Garza-Whitfield race (60-day window):

The following represents the polling landscape as of the campaign's 60-day-out point:

Date Pollster Method Garza % Whitfield % MOE Pop
-62 days University Poll Phone 45 43 4.2 RV
-58 days SurveyUSA Online 47 44 3.8 LV
-54 days FOX State Phone/Online 44 45 3.5 LV
-51 days BluePath (D) Phone 49 42 4.1 LV
-48 days Meridian Research Mixed 46 44 3.6 LV
-44 days Impact Research (D) Online 47 43 3.9 LV
-42 days Cygnal (R) Online 44 46 4.0 LV
-38 days Emerson Online/IVR 46 45 3.7 LV

Weighted public average (as of 60 days out): Garza +2.1 points

Meridian internal poll (most recent, 3 weeks old): - Sample: N=802 likely voters - Method: Mixed mode (50% phone, 50% online panel) - Garza: 46%, Whitfield: 44%, Undecided/Other: 10% - MOE: ±3.5%

Key Meridian subgroup findings (for targeting design):

Segment Garza % Whitfield % Notes
Latino voters 68 24 8% undecided — room to grow
Black voters 87 9 4% undecided — low persuasion room
White college-ed women 55 40 5% undecided — key persuasion target
White college-ed men 45 50 5% undecided — contested
White non-college 37 58 5% undecided — tough terrain for Garza
Young voters 18-29 58 33 9% undecided — high upside if turnout
Suburban voters (all) 51 44 5% undecided
Rural voters (all) 34 61 5% undecided — Garza limited here

Supplementary Dataset: oda_ads.csv

Useful for understanding the advertising landscape and informing the digital program design.

Key columns for this capstone:

Column Notes
sponsor Who paid for the ad
party D, R, third-party
platform TV, digital, radio
state State of airing/targeting
market Media market
spend_usd Estimated spend
impressions Estimated impressions
issue_topic Primary issue topic of the ad
tone positive, negative, contrast
target_demo Demographic target specification

Advertising context for the Garza campaign's 60-day window:

The race's advertising environment as of 60 days out:

  • Garza campaign spending rate: approximately $280K/week on advertising across all platforms
  • Whitfield campaign spending rate: approximately $230K/week
  • Total outside spending (both sides combined): approximately $1.2M/week
  • Top issue topics in Garza ads: healthcare (38%), economic security (29%), immigration/AG record (18%), Whitfield contrast (15%)
  • Top issue topics in Whitfield ads: border security/immigration (42%), economy/jobs (31%), Garza contrast (22%), other (5%)

Section B: Voter Universe Technical Reference

Worked Example: Universe Construction

The following provides a worked example of universe construction using the oda_voters.csv dataset. Students should use this as a reference, not copy it — your implementation should include your own threshold justifications and priority tier criteria.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick

# Load voter file
voters = pd.read_csv('oda_voters.csv')

print("=== VOTER FILE OVERVIEW ===")
print(f"Total registered voters: {len(voters):,}")

# --- STEP 1: CONSTRUCT TURNOUT PROPENSITY ---
# Reasoning: 2022 is most predictive (same cycle type: midterm)
# 2020 is less predictive (presidential — higher baseline turnout)
# 2018 is least predictive (older, but same cycle type)
# Age adjustment: older voters have higher base turnout, floor at 0

def build_turnout_propensity(row):
    base = (row['vote_history_2022'] * 55 +
            row['vote_history_2020'] * 20 +
            row['vote_history_2018'] * 25)
    # Age adjustment: max +10 for voters 65+
    age_bonus = min(10, max(0, (row['age'] - 40) * 0.25))
    return min(100, base + age_bonus)

voters['turnout_propensity'] = voters.apply(build_turnout_propensity, axis=1)

# Validate distribution looks reasonable
print("\nTurnout propensity distribution:")
print(voters['turnout_propensity'].describe())
bins = [0, 10, 25, 50, 75, 90, 100]
print("\nBy bucket:")
print(pd.cut(voters['turnout_propensity'], bins).value_counts().sort_index())

# --- STEP 2: PERSUASION UNIVERSE ---
persuasion = voters[
    (voters['support_score'] >= 40) &
    (voters['support_score'] <= 60) &
    (voters['persuadability_score'] >= 50) &
    (voters['turnout_propensity'] >= 30)
].copy()

# Priority tiers based on combination of persuadability and turnout
def persuasion_tier(row):
    if (row['persuadability_score'] >= 70 and
            row['turnout_propensity'] >= 60):
        return 'Tier 1'
    elif (row['persuadability_score'] >= 58 or
              row['turnout_propensity'] >= 50):
        return 'Tier 2'
    else:
        return 'Tier 3'

persuasion['tier'] = persuasion.apply(persuasion_tier, axis=1)

print(f"\n=== PERSUASION UNIVERSE ===")
print(f"Total: {len(persuasion):,} ({len(persuasion)/len(voters)*100:.1f}% of file)")
print(persuasion['tier'].value_counts())

# --- STEP 3: GOTV UNIVERSE ---
gotv = voters[
    (voters['support_score'] >= 65) &
    (voters['turnout_propensity'] >= 30) &
    (voters['turnout_propensity'] <= 78)
].copy()

def gotv_priority(row):
    if (row['support_score'] >= 78 and
            row['turnout_propensity'].between(45, 72)):
        return 'High'
    else:
        return 'Standard'

gotv['priority'] = gotv.apply(gotv_priority, axis=1)

print(f"\n=== GOTV UNIVERSE ===")
print(f"Total: {len(gotv):,} ({len(gotv)/len(voters)*100:.1f}% of file)")
print(gotv['priority'].value_counts())

# Demographic breakdown of GOTV universe
print("\nGOTV by race/ethnicity:")
print(gotv['race_ethnicity'].value_counts())
print("\nGOTV by urban/rural:")
print(gotv['urban_rural'].value_counts())

# --- STEP 4: COUNTY SUMMARY ---
county_summary = pd.DataFrame()

for county in voters['county'].unique():
    county_voters = voters[voters['county'] == county]
    county_persu = persuasion[persuasion['county'] == county]
    county_gotv = gotv[gotv['county'] == county]

    county_summary = pd.concat([county_summary, pd.DataFrame([{
        'county': county,
        'total_voters': len(county_voters),
        'persuasion_total': len(county_persu),
        'persuasion_tier1': (county_persu['tier'] == 'Tier 1').sum(),
        'gotv_total': len(county_gotv),
        'gotv_high': (county_gotv['priority'] == 'High').sum(),
        'mean_support': county_voters['support_score'].mean().round(1),
        'mean_turnout_prop': county_voters['turnout_propensity'].mean().round(1)
    }])], ignore_index=True)

county_summary = county_summary.sort_values('total_voters', ascending=False)
print("\n=== COUNTY UNIVERSE SUMMARY ===")
print(county_summary.to_string(index=False))

# --- STEP 5: VISUALIZATIONS ---
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Viz 1: Support score distribution by county type
urban_rural_groups = voters.groupby('urban_rural')['support_score']
for label, group in urban_rural_groups:
    axes[0].hist(group, bins=30, alpha=0.6, label=label, density=True)
axes[0].axvline(x=40, color='red', linestyle='--', alpha=0.7, label='Persuasion bounds')
axes[0].axvline(x=60, color='red', linestyle='--', alpha=0.7)
axes[0].set_xlabel('Support Score (0=Whitfield, 100=Garza)')
axes[0].set_ylabel('Density')
axes[0].set_title('Support Score Distribution by Geography')
axes[0].legend()

# Viz 2: Universe size by county
plot_counties = county_summary.head(8)
x = range(len(plot_counties))
width = 0.35
axes[1].bar([i - width/2 for i in x],
            plot_counties['persuasion_total'],
            width, label='Persuasion Universe', color='steelblue', alpha=0.8)
axes[1].bar([i + width/2 for i in x],
            plot_counties['gotv_total'],
            width, label='GOTV Universe', color='darkorange', alpha=0.8)
axes[1].set_xticks(x)
axes[1].set_xticklabels(plot_counties['county'], rotation=45, ha='right')
axes[1].set_ylabel('Voters in Universe')
axes[1].set_title('Persuasion and GOTV Universe Size by County')
axes[1].legend()

plt.tight_layout()
plt.savefig('universe_analysis.png', dpi=150, bbox_inches='tight')
plt.show()
print("\nVisualization saved to universe_analysis.png")

Section C: Contact Program Standards and Benchmarks

Canvassing Program Benchmarks

Metric Low Typical High Notes
Doors per hour 6-8 8-10 10-14 Urban more efficient than rural
Contact rate (someone answers) 18% 25-30% 38% Lower in rural, higher in dense urban
Conversations per contact hour 1.5-2.0 2.5-3.5 4.0+
Turnout lift per contact 2-4 pp 4-6 pp 6-8 pp GOTV contacts; phone lower end
Persuasion lift per contact 0.5-1 pp 1-2 pp 2-3 pp Persuasion contacts
Effective volunteer shift (3 hr) 6-8 contacts 8-12 contacts 12-18 contacts After accounting for travel, setup

Important: "Contacts" means a real conversation with a voter, not doors knocked. A volunteer who knocks 30 doors in a three-hour shift and speaks with 9 people has made 9 contacts, not 30.

Volunteer retention patterns: Campaigns typically lose 30-40% of recruited volunteers after their first shift. Building volunteer retention requires: recognition systems, clear communication about impact, training that sets accurate expectations, and a social community of fellow volunteers.

Phone Banking Benchmarks

Metric Typical Range Notes
Calls per hour 8-15 Varies by list quality and script length
Contact rate (live answer) 8-15% Cold lists; 20-30% for warm (prior contacts)
Conversation completion rate 40-55% Of contacts who pick up
Volunteer hours per 1,000 contacts 80-120 hours Using volunteer phone bank
Turnout lift per phone contact 1-3 pp Live calls; IVR/robocall is 0-0.5 pp

Mail Program Benchmarks

Metric Typical Notes
Delivery rate 92-96% 4-8% returned undeliverable
Piece read rate 30-60% Varies significantly by design and segment
Response / action rate 1-5% For pieces with a specific call to action
Cost per piece (all-in) $0.55-0.85 Printing, postage, list processing
Persuasion effect per piece 0.3-0.8 pp From field experiment literature
Turnout effect per GOTV piece 0.3-1.0 pp Modestly effective; best as supplement

Mail timing notes: USPS delivery times for bulk mail are 5-10 days. Political mail sent at non-bulk (first class) rates arrives in 1-3 days. All mail drop dates in the contact program plan should account for delivery time — a piece intended to arrive at 15 days out should be dropped at 22-25 days out if sent bulk.

Digital Advertising Benchmarks

Metric Typical Range Notes
CPM (cost per 1K impressions) $5-25 Varies by platform and targeting precision
Click-through rate (display) 0.05-0.2% Very low for display; higher for video
Video completion rate 35-60% Platform and length dependent
Persuasion effect per exposure 0.3-0.8 pp From field experiment literature
GOTV effect per exposure 0.3-1.5 pp Varies widely by creative quality
Voter file match rate 55-75% Share of file successfully matched to platform

Text/SMS Benchmarks

Metric Typical Notes
Delivery rate (opt-in list) 95-99% Much higher than purchased lists
Open rate 85-95% SMS vastly outperforms email
Response rate (two-way SMS) 15-25% For conversational texts
Opt-out rate per send 1-3% Higher if contact frequency is too high
GOTV effect (opt-in list) 2-5 pp Higher effectiveness than cold contact

Section D: Polling Design Reference

Meridian Research Group Survey Design Standards

The following summarizes Meridian's standard survey design practices, consistent with what Dr. Vivian Park and Carlos Mendez apply to the Garza campaign's internal surveys.

Likely voter screen: Meridian uses a seven-question screen based on the Gallup model, plus a state-specific component accounting for the state's early voting infrastructure. The screen is calibrated to the state's historical midterm turnout (approximately 48% of registered voters).

Sampling methodology: Meridian uses a multi-mode approach: 50% address-based sampling with online response, 50% cell phone (live interviewer). This approach oversamples demographic groups with lower online panel participation (older voters, lower-income households, rural voters) and weights back to population benchmarks.

Standard survey battery for Senate race:

  1. Horse race question: "If the election for U.S. Senate were held today, for whom would you vote: Maria Garza, the Democrat, or Tom Whitfield, the Republican?" (Rotate name order; include "someone else," "don't know/no preference" options)

  2. Candidate favorability: Four-point scale (very favorable, somewhat favorable, somewhat unfavorable, very unfavorable) for each candidate; "don't know/no opinion" as fifth option

  3. Issue priority: "What is the single most important issue in deciding your vote for U.S. Senate?" (Open-ended, coded to issue categories)

  4. Issue battery: Importance ratings for five to eight issues using four-point scale

  5. Candidate attribute ratings: "Thinking about [Candidate], please tell me how well each of the following describes her/him..." — attributes include: shares my values, has the experience needed, trustworthy, understands people like me, would fight for the middle class, effective in government

  6. Message test (when included): Split-sample — half receive Message A, half receive Message B; remeasure horse race after exposure

  7. Demographics: Age, gender, race/ethnicity, education, income, party identification, religious attendance, zip code (for urban/rural classification)

Sample size guidance:

Survey Type Minimum N Recommended N MOE at 95% CI
Full benchmark 600 800-1,000 ±3.5-4.1%
Tracking 400 500-600 ±4.2-4.8%
Message test (split sample) 300 per arm 400-500 per arm ±4.8-5.3%
County-level (substate) 300 400 ±5.2%

Note on likely voter screen variability: As Election Day approaches, the likely voter pool becomes more stable and the screen more accurate. Polls conducted 60+ days out have higher uncertainty in LV composition than polls at 14 days out. Account for this in how you communicate uncertainty from early-cycle polls.


Section E: Budget Reference — Line Item Detail

Standard Campaign Analytics Budget Components (Competitive Senate Race)

Use the following as a benchmark. Your budget should be calibrated to your specific plan — the program choices you made in Deliverable 2 should drive the budget, not the other way around.

Direct mail: Estimating cost

Cost per piece (all-in): $0.60-0.80 for standard political mail (design, print, postage, list processing) - Total mail cost = (number of pieces) × (cost per piece) - Standard persuasion sequence: 4 pieces to persuasion universe - Standard GOTV sequence: 2 pieces to GOTV universe - Spanish-language premium: add ~15% for translation and separate printing run

Example: 175,000 persuasion voters × 4 pieces + 225,000 GOTV voters × 2 pieces = 1,150,000 pieces total × $0.70 = $805,000

(Students' budgets will differ based on their universe sizes from Deliverable 1)

Digital advertising: Estimating cost

Voter file match of universe to platforms: $15,000-25,000 per platform (one-time or per-cycle fee) Advertising placement: CPM $8-18 depending on targeting precision - Total digital cost = (desired impressions) × (CPM / 1000) - A 60-day persuasion digital program might target 50,000 persuasion voters at 20 impressions each = 1,000,000 impressions × $12 CPM = $12,000 plus platform fees

Canvassing: Estimating cost

  • Volunteer canvassing direct costs: $15-25 per completed contact (staff coordination, materials, training time)
  • If using paid canvassers: $40-60 per completed contact
  • Typically 80-90% of canvassing is done by volunteers; paid canvassers fill gaps

Polling: Market rates

Survey type Typical cost (Meridian-equivalent quality)
Statewide benchmark (N=800, mixed mode) $38,000-55,000
Statewide tracking (N=500, mixed mode) $22,000-30,000
County-level (N=400, phone) $18,000-28,000
Message test (split sample, N=800) $30,000-45,000

VAN platform costs

  • State Democratic Party VAN access: $8,000-15,000 for the campaign cycle (varies by state and negotiation)
  • Additional modules (data integration, texting): $5,000-12,000
  • Field director and data staff time for VAN administration: significant; typically 0.5 FTE during peak campaign period

Section F: Ethics Reference — Extended Frameworks

The Voter Privacy Spectrum

Political campaigns operate in a complex legal and ethical landscape around voter data. The following framework, drawn from Chapter 38 and Chapter 39 discussions, summarizes the key distinctions.

Tier 1 — Public record data (legally and broadly ethically uncontroversial): - Voter registration records (name, address, party registration, vote history) — public records in most states - Candidate financial disclosure records - Campaign finance records (FEC, state filings) - Official electoral results

Tier 2 — Commercial append data (legal, ethically contested): - Consumer behavior segments (lifestyle, purchasing patterns) - Estimated income and homeownership - Estimated age and gender - Estimated education level - Consumer-based political scores

Tier 3 — Sensitive commercial data (legal but ethically problematic per this capstone's framework): - Inferred health conditions from consumer purchase data - Inferred financial distress indicators - Inferred religious practice - Inferred relationship status from social media behavioral signals - Lookalike audience modeling matched to social media behavioral profiles

The Garza campaign's analytics plan explicitly prohibits Tier 3 uses. Your ethics review should document this constraint and apply the reasoning from Section 3 of the main capstone text.

VAN Data Ethics Standards

VAN (Voter Activation Network) maintains data use policies that all campaigns accessing the system must agree to. Key constraints:

  • Voter file data may not be used for commercial purposes
  • Voter file data may not be sold or transferred to non-political third parties
  • All canvass data and voter contact records remain the property of the party, not the campaign — they persist for future cycles
  • Campaigns accessing the party's master file have an obligation to return contact data to the party infrastructure

These contractual obligations are separate from — and in addition to — the ethical constraints the campaign's analytics plan adopts.

Equity Framework for Targeting Decisions

The following checklist questions from ODA's equity framework, applied to campaign context:

1. Does the targeting plan systematically invest less in communities of color? Review: What share of Tier 1 GOTV targets are voters of color vs. white voters? What share of canvassing resources go to majority-minority precincts vs. majority-white precincts? Is the disparity, if any, explained by cost-efficiency factors or by strategic deprioritization?

2. Does the language access plan cover the full Spanish-speaking universe? Review: What percentage of Hispanic voters in the GOTV universe are receiving Spanish-language outreach? What's the gap? What's preventing full coverage?

3. Does the young voter program address structural barriers? Review: Does the program provide actionable registration and polling information, or only motivational messaging? Does it reach students at community colleges and vocational schools, not just four-year universities?

4. Does the GOTV program reach voters with disabilities? Review: Are canvassing scripts and materials accessible? Is there a protocol for voters who cannot answer the door due to mobility limitations?

5. Are historically underserved communities' political concerns reflected in the campaign's issue messaging? Review: Does the campaign's message matrix include messages that speak to the specific concerns of Black, Latino, and working-class communities, or are messages primarily calibrated to suburban moderates?