Capstone 3 Data Appendix: The Campaign Analytics Plan

Capstone 3 Data Appendix: The Campaign Analytics Plan

Section A: Dataset Reference Guide

Primary Dataset: `oda_voters.csv`

The ODA voter file is a teaching-scale version of the data infrastructure used in real campaign analytics. With approximately 50,000 rows, it provides the foundation for all universe construction work in this capstone. Real Senate race voter files have millions of records; the relative compositions and analytical relationships in this teaching dataset accurately reflect real voter file characteristics.

Complete column reference:

Column	Type	Range / Values	Notes
`voter_id`	string	V-XXXXX	Unique identifier
`state`	string	Single-state	All records from the race's state
`county`	string	12 county names	See county reference table below
`age`	integer	18-85	Current age at election date
`gender`	categorical	M, F, Other	—
`race_ethnicity`	categorical	White, Black, Hispanic, Asian, Other	Self-reported where available, estimated otherwise
`education`	categorical	Less than HS, HS Diploma, Some College, College Degree, Graduate Degree	—
`income_bracket`	categorical	Under 30K, 30-60K, 60-100K, Over 100K	Estimated from public records and commercial data
`party_reg`	categorical	Democrat, Republican, Independent, Other	Official registration status
`vote_history_2018`	binary	0, 1	1 = participated in 2018 midterm election
`vote_history_2020`	binary	0, 1	1 = participated in 2020 presidential election
`vote_history_2022`	binary	0, 1	1 = participated in 2022 midterm election
`urban_rural`	categorical	Urban, Suburban, Rural	—
`support_score`	float	0-100	Garza support probability × 100
`persuadability_score`	float	0-100	Higher = more movable by campaign contact

County reference table:

County Name	Type	Approx. Share of State Voters	Strategic Designation
Metro Central	Urban	22%	Garza base; GOTV priority
Metro South	Urban	14%	Garza base; GOTV priority
Lakeview	Suburban	12%	Swing; persuasion and GOTV
Riverside	Suburban	9%	Swing; persuasion priority
Valley North	Suburban	7%	Lean Whitfield; limited investment
Highfield	Rural	8%	Whitfield stronghold; GOTV for Garza voters only
Carson	Rural	7%	Whitfield stronghold; GOTV for Garza voters only
Eastport	Rural	5%	Whitfield stronghold
Garfield	Rural	4%	Split rural; some Garza opportunity
Millbrook	Suburban/rural mix	5%	Lean Whitfield; some persuasion opportunity
Crestview	Suburban	4%	Lean Garza; GOTV priority
Other/small	Mixed	3%	Distributed; low investment

Expected support score distribution by county:

County	Mean Support Score	Std Dev	Notes
Metro Central	66.2	18.4	Garza-favorable; large Latino share
Metro South	63.8	19.1	Garza-favorable; large Black share
Lakeview	51.3	22.7	Genuinely swing; college-educated suburbs
Riverside	49.8	23.1	Most competitive county in the state
Valley North	44.2	20.5	Lean Whitfield; some college-ed women opportunity
Highfield	38.5	16.2	Whitfield stronghold
Carson	37.1	15.8	Whitfield stronghold
Garfield	46.3	21.9	Split rural; Garza has a small universe here

Secondary Dataset: `oda_polls.csv`

Used for polling plan design and public polling analysis.

Key columns for this capstone:

Column	Notes
`date`	Date poll was fielded (start date)
`pollster`	Organization conducting the poll
`methodology`	phone, online, mixed, IVR
`pct_d`, `pct_r`	Candidate percentages
`sample_size`	Total N
`margin_error`	Reported margin of error
`population`	LV (likely voters), RV (registered voters), A (adults)
`race_type`	senate_general, senate_primary, governor_general, etc.

Public polling summary for the Garza-Whitfield race (60-day window):

The following represents the polling landscape as of the campaign's 60-day-out point:

Date	Pollster	Method	Garza %	Whitfield %	MOE	Pop
-62 days	University Poll	Phone	45	43	4.2	RV
-58 days	SurveyUSA	Online	47	44	3.8	LV
-54 days	FOX State	Phone/Online	44	45	3.5	LV
-51 days	BluePath (D)	Phone	49	42	4.1	LV
-48 days	Meridian Research	Mixed	46	44	3.6	LV
-44 days	Impact Research (D)	Online	47	43	3.9	LV
-42 days	Cygnal (R)	Online	44	46	4.0	LV
-38 days	Emerson	Online/IVR	46	45	3.7	LV

Weighted public average (as of 60 days out): Garza +2.1 points

Meridian internal poll (most recent, 3 weeks old): - Sample: N=802 likely voters - Method: Mixed mode (50% phone, 50% online panel) - Garza: 46%, Whitfield: 44%, Undecided/Other: 10% - MOE: ±3.5%

Key Meridian subgroup findings (for targeting design):

Segment	Garza %	Whitfield %	Notes
Latino voters	68	24	8% undecided — room to grow
Black voters	87	9	4% undecided — low persuasion room
White college-ed women	55	40	5% undecided — key persuasion target
White college-ed men	45	50	5% undecided — contested
White non-college	37	58	5% undecided — tough terrain for Garza
Young voters 18-29	58	33	9% undecided — high upside if turnout
Suburban voters (all)	51	44	5% undecided
Rural voters (all)	34	61	5% undecided — Garza limited here

Supplementary Dataset: `oda_ads.csv`

Useful for understanding the advertising landscape and informing the digital program design.

Key columns for this capstone:

Column	Notes
`sponsor`	Who paid for the ad
`party`	D, R, third-party
`platform`	TV, digital, radio
`state`	State of airing/targeting
`market`	Media market
`spend_usd`	Estimated spend
`impressions`	Estimated impressions
`issue_topic`	Primary issue topic of the ad
`tone`	positive, negative, contrast
`target_demo`	Demographic target specification

Advertising context for the Garza campaign's 60-day window:

The race's advertising environment as of 60 days out:

Garza campaign spending rate: approximately $280K/week on advertising across all platforms
Whitfield campaign spending rate: approximately $230K/week
Total outside spending (both sides combined): approximately $1.2M/week
Top issue topics in Garza ads: healthcare (38%), economic security (29%), immigration/AG record (18%), Whitfield contrast (15%)
Top issue topics in Whitfield ads: border security/immigration (42%), economy/jobs (31%), Garza contrast (22%), other (5%)

Section B: Voter Universe Technical Reference

Worked Example: Universe Construction

The following provides a worked example of universe construction using the oda_voters.csv dataset. Students should use this as a reference, not copy it — your implementation should include your own threshold justifications and priority tier criteria.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick

# Load voter file
voters = pd.read_csv('oda_voters.csv')

print("=== VOTER FILE OVERVIEW ===")
print(f"Total registered voters: {len(voters):,}")

# --- STEP 1: CONSTRUCT TURNOUT PROPENSITY ---
# Reasoning: 2022 is most predictive (same cycle type: midterm)
# 2020 is less predictive (presidential — higher baseline turnout)
# 2018 is least predictive (older, but same cycle type)
# Age adjustment: older voters have higher base turnout, floor at 0

def build_turnout_propensity(row):
    base = (row['vote_history_2022'] * 55 +
            row['vote_history_2020'] * 20 +
            row['vote_history_2018'] * 25)
    # Age adjustment: max +10 for voters 65+
    age_bonus = min(10, max(0, (row['age'] - 40) * 0.25))
    return min(100, base + age_bonus)

voters['turnout_propensity'] = voters.apply(build_turnout_propensity, axis=1)

# Validate distribution looks reasonable
print("\nTurnout propensity distribution:")
print(voters['turnout_propensity'].describe())
bins = [0, 10, 25, 50, 75, 90, 100]
print("\nBy bucket:")
print(pd.cut(voters['turnout_propensity'], bins).value_counts().sort_index())

# --- STEP 2: PERSUASION UNIVERSE ---
persuasion = voters[
    (voters['support_score'] >= 40) &
    (voters['support_score'] <= 60) &
    (voters['persuadability_score'] >= 50) &
    (voters['turnout_propensity'] >= 30)
].copy()

# Priority tiers based on combination of persuadability and turnout
def persuasion_tier(row):
    if (row['persuadability_score'] >= 70 and
            row['turnout_propensity'] >= 60):
        return 'Tier 1'
    elif (row['persuadability_score'] >= 58 or
              row['turnout_propensity'] >= 50):
        return 'Tier 2'
    else:
        return 'Tier 3'

persuasion['tier'] = persuasion.apply(persuasion_tier, axis=1)

print(f"\n=== PERSUASION UNIVERSE ===")
print(f"Total: {len(persuasion):,} ({len(persuasion)/len(voters)*100:.1f}% of file)")
print(persuasion['tier'].value_counts())

# --- STEP 3: GOTV UNIVERSE ---
gotv = voters[
    (voters['support_score'] >= 65) &
    (voters['turnout_propensity'] >= 30) &
    (voters['turnout_propensity'] <= 78)
].copy()

def gotv_priority(row):
    if (row['support_score'] >= 78 and
            row['turnout_propensity'].between(45, 72)):
        return 'High'
    else:
        return 'Standard'

gotv['priority'] = gotv.apply(gotv_priority, axis=1)

print(f"\n=== GOTV UNIVERSE ===")
print(f"Total: {len(gotv):,} ({len(gotv)/len(voters)*100:.1f}% of file)")
print(gotv['priority'].value_counts())

# Demographic breakdown of GOTV universe
print("\nGOTV by race/ethnicity:")
print(gotv['race_ethnicity'].value_counts())
print("\nGOTV by urban/rural:")
print(gotv['urban_rural'].value_counts())

# --- STEP 4: COUNTY SUMMARY ---
county_summary = pd.DataFrame()

for county in voters['county'].unique():
    county_voters = voters[voters['county'] == county]
    county_persu = persuasion[persuasion['county'] == county]
    county_gotv = gotv[gotv['county'] == county]

    county_summary = pd.concat([county_summary, pd.DataFrame([{
        'county': county,
        'total_voters': len(county_voters),
        'persuasion_total': len(county_persu),
        'persuasion_tier1': (county_persu['tier'] == 'Tier 1').sum(),
        'gotv_total': len(county_gotv),
        'gotv_high': (county_gotv['priority'] == 'High').sum(),
        'mean_support': county_voters['support_score'].mean().round(1),
        'mean_turnout_prop': county_voters['turnout_propensity'].mean().round(1)
    }])], ignore_index=True)

county_summary = county_summary.sort_values('total_voters', ascending=False)
print("\n=== COUNTY UNIVERSE SUMMARY ===")
print(county_summary.to_string(index=False))

# --- STEP 5: VISUALIZATIONS ---
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Viz 1: Support score distribution by county type
urban_rural_groups = voters.groupby('urban_rural')['support_score']
for label, group in urban_rural_groups:
    axes[0].hist(group, bins=30, alpha=0.6, label=label, density=True)
axes[0].axvline(x=40, color='red', linestyle='--', alpha=0.7, label='Persuasion bounds')
axes[0].axvline(x=60, color='red', linestyle='--', alpha=0.7)
axes[0].set_xlabel('Support Score (0=Whitfield, 100=Garza)')
axes[0].set_ylabel('Density')
axes[0].set_title('Support Score Distribution by Geography')
axes[0].legend()

# Viz 2: Universe size by county
plot_counties = county_summary.head(8)
x = range(len(plot_counties))
width = 0.35
axes[1].bar([i - width/2 for i in x],
            plot_counties['persuasion_total'],
            width, label='Persuasion Universe', color='steelblue', alpha=0.8)
axes[1].bar([i + width/2 for i in x],
            plot_counties['gotv_total'],
            width, label='GOTV Universe', color='darkorange', alpha=0.8)
axes[1].set_xticks(x)
axes[1].set_xticklabels(plot_counties['county'], rotation=45, ha='right')
axes[1].set_ylabel('Voters in Universe')
axes[1].set_title('Persuasion and GOTV Universe Size by County')
axes[1].legend()

plt.tight_layout()
plt.savefig('universe_analysis.png', dpi=150, bbox_inches='tight')
plt.show()
print("\nVisualization saved to universe_analysis.png")

Section C: Contact Program Standards and Benchmarks

Canvassing Program Benchmarks

Metric	Low	Typical	High	Notes
Doors per hour	6-8	8-10	10-14	Urban more efficient than rural
Contact rate (someone answers)	18%	25-30%	38%	Lower in rural, higher in dense urban
Conversations per contact hour	1.5-2.0	2.5-3.5	4.0+	—
Turnout lift per contact	2-4 pp	4-6 pp	6-8 pp	GOTV contacts; phone lower end
Persuasion lift per contact	0.5-1 pp	1-2 pp	2-3 pp	Persuasion contacts
Effective volunteer shift (3 hr)	6-8 contacts	8-12 contacts	12-18 contacts	After accounting for travel, setup

Important: "Contacts" means a real conversation with a voter, not doors knocked. A volunteer who knocks 30 doors in a three-hour shift and speaks with 9 people has made 9 contacts, not 30.

Volunteer retention patterns: Campaigns typically lose 30-40% of recruited volunteers after their first shift. Building volunteer retention requires: recognition systems, clear communication about impact, training that sets accurate expectations, and a social community of fellow volunteers.

Phone Banking Benchmarks

Metric	Typical Range	Notes
Calls per hour	8-15	Varies by list quality and script length
Contact rate (live answer)	8-15%	Cold lists; 20-30% for warm (prior contacts)
Conversation completion rate	40-55%	Of contacts who pick up
Volunteer hours per 1,000 contacts	80-120 hours	Using volunteer phone bank
Turnout lift per phone contact	1-3 pp	Live calls; IVR/robocall is 0-0.5 pp

Mail Program Benchmarks

Metric	Typical	Notes
Delivery rate	92-96%	4-8% returned undeliverable
Piece read rate	30-60%	Varies significantly by design and segment
Response / action rate	1-5%	For pieces with a specific call to action
Cost per piece (all-in)	$0.55-0.85	Printing, postage, list processing
Persuasion effect per piece	0.3-0.8 pp	From field experiment literature
Turnout effect per GOTV piece	0.3-1.0 pp	Modestly effective; best as supplement

Mail timing notes: USPS delivery times for bulk mail are 5-10 days. Political mail sent at non-bulk (first class) rates arrives in 1-3 days. All mail drop dates in the contact program plan should account for delivery time — a piece intended to arrive at 15 days out should be dropped at 22-25 days out if sent bulk.

Digital Advertising Benchmarks

Metric	Typical Range	Notes
CPM (cost per 1K impressions)	$5-25	Varies by platform and targeting precision
Click-through rate (display)	0.05-0.2%	Very low for display; higher for video
Video completion rate	35-60%	Platform and length dependent
Persuasion effect per exposure	0.3-0.8 pp	From field experiment literature
GOTV effect per exposure	0.3-1.5 pp	Varies widely by creative quality
Voter file match rate	55-75%	Share of file successfully matched to platform

Text/SMS Benchmarks

Metric	Typical	Notes
Delivery rate (opt-in list)	95-99%	Much higher than purchased lists
Open rate	85-95%	SMS vastly outperforms email
Response rate (two-way SMS)	15-25%	For conversational texts
Opt-out rate per send	1-3%	Higher if contact frequency is too high
GOTV effect (opt-in list)	2-5 pp	Higher effectiveness than cold contact

Section D: Polling Design Reference

Meridian Research Group Survey Design Standards

The following summarizes Meridian's standard survey design practices, consistent with what Dr. Vivian Park and Carlos Mendez apply to the Garza campaign's internal surveys.

Likely voter screen: Meridian uses a seven-question screen based on the Gallup model, plus a state-specific component accounting for the state's early voting infrastructure. The screen is calibrated to the state's historical midterm turnout (approximately 48% of registered voters).

Sampling methodology: Meridian uses a multi-mode approach: 50% address-based sampling with online response, 50% cell phone (live interviewer). This approach oversamples demographic groups with lower online panel participation (older voters, lower-income households, rural voters) and weights back to population benchmarks.

Standard survey battery for Senate race:

Horse race question: "If the election for U.S. Senate were held today, for whom would you vote: Maria Garza, the Democrat, or Tom Whitfield, the Republican?" (Rotate name order; include "someone else," "don't know/no preference" options)
Candidate favorability: Four-point scale (very favorable, somewhat favorable, somewhat unfavorable, very unfavorable) for each candidate; "don't know/no opinion" as fifth option
Issue priority: "What is the single most important issue in deciding your vote for U.S. Senate?" (Open-ended, coded to issue categories)
Issue battery: Importance ratings for five to eight issues using four-point scale
Candidate attribute ratings: "Thinking about [Candidate], please tell me how well each of the following describes her/him..." — attributes include: shares my values, has the experience needed, trustworthy, understands people like me, would fight for the middle class, effective in government
Message test (when included): Split-sample — half receive Message A, half receive Message B; remeasure horse race after exposure
Demographics: Age, gender, race/ethnicity, education, income, party identification, religious attendance, zip code (for urban/rural classification)

Sample size guidance:

Survey Type	Minimum N	Recommended N	MOE at 95% CI
Full benchmark	600	800-1,000	±3.5-4.1%
Tracking	400	500-600	±4.2-4.8%
Message test (split sample)	300 per arm	400-500 per arm	±4.8-5.3%
County-level (substate)	300	400	±5.2%

Note on likely voter screen variability: As Election Day approaches, the likely voter pool becomes more stable and the screen more accurate. Polls conducted 60+ days out have higher uncertainty in LV composition than polls at 14 days out. Account for this in how you communicate uncertainty from early-cycle polls.

Section E: Budget Reference — Line Item Detail

Standard Campaign Analytics Budget Components (Competitive Senate Race)

Use the following as a benchmark. Your budget should be calibrated to your specific plan — the program choices you made in Deliverable 2 should drive the budget, not the other way around.

Direct mail: Estimating cost

Cost per piece (all-in): $0.60-0.80 for standard political mail (design, print, postage, list processing) - Total mail cost = (number of pieces) × (cost per piece) - Standard persuasion sequence: 4 pieces to persuasion universe - Standard GOTV sequence: 2 pieces to GOTV universe - Spanish-language premium: add ~15% for translation and separate printing run

Example: 175,000 persuasion voters × 4 pieces + 225,000 GOTV voters × 2 pieces = 1,150,000 pieces total × $0.70 = $805,000

(Students' budgets will differ based on their universe sizes from Deliverable 1)

Digital advertising: Estimating cost

Voter file match of universe to platforms: $15,000-25,000 per platform (one-time or per-cycle fee) Advertising placement: CPM $8-18 depending on targeting precision - Total digital cost = (desired impressions) × (CPM / 1000) - A 60-day persuasion digital program might target 50,000 persuasion voters at 20 impressions each = 1,000,000 impressions × $12 CPM = $12,000 plus platform fees

Canvassing: Estimating cost

Volunteer canvassing direct costs: $15-25 per completed contact (staff coordination, materials, training time)
If using paid canvassers: $40-60 per completed contact
Typically 80-90% of canvassing is done by volunteers; paid canvassers fill gaps

Polling: Market rates

Survey type	Typical cost (Meridian-equivalent quality)
Statewide benchmark (N=800, mixed mode)	$38,000-55,000
Statewide tracking (N=500, mixed mode)	$22,000-30,000
County-level (N=400, phone)	$18,000-28,000
Message test (split sample, N=800)	$30,000-45,000

VAN platform costs

State Democratic Party VAN access: $8,000-15,000 for the campaign cycle (varies by state and negotiation)
Additional modules (data integration, texting): $5,000-12,000
Field director and data staff time for VAN administration: significant; typically 0.5 FTE during peak campaign period

Section F: Ethics Reference — Extended Frameworks

The Voter Privacy Spectrum

Political campaigns operate in a complex legal and ethical landscape around voter data. The following framework, drawn from Chapter 38 and Chapter 39 discussions, summarizes the key distinctions.

Tier 1 — Public record data (legally and broadly ethically uncontroversial): - Voter registration records (name, address, party registration, vote history) — public records in most states - Candidate financial disclosure records - Campaign finance records (FEC, state filings) - Official electoral results

Tier 2 — Commercial append data (legal, ethically contested): - Consumer behavior segments (lifestyle, purchasing patterns) - Estimated income and homeownership - Estimated age and gender - Estimated education level - Consumer-based political scores

Tier 3 — Sensitive commercial data (legal but ethically problematic per this capstone's framework): - Inferred health conditions from consumer purchase data - Inferred financial distress indicators - Inferred religious practice - Inferred relationship status from social media behavioral signals - Lookalike audience modeling matched to social media behavioral profiles

The Garza campaign's analytics plan explicitly prohibits Tier 3 uses. Your ethics review should document this constraint and apply the reasoning from Section 3 of the main capstone text.

VAN Data Ethics Standards

VAN (Voter Activation Network) maintains data use policies that all campaigns accessing the system must agree to. Key constraints:

Voter file data may not be used for commercial purposes
Voter file data may not be sold or transferred to non-political third parties
All canvass data and voter contact records remain the property of the party, not the campaign — they persist for future cycles
Campaigns accessing the party's master file have an obligation to return contact data to the party infrastructure

These contractual obligations are separate from — and in addition to — the ethical constraints the campaign's analytics plan adopts.

Equity Framework for Targeting Decisions

The following checklist questions from ODA's equity framework, applied to campaign context:

1. Does the targeting plan systematically invest less in communities of color? Review: What share of Tier 1 GOTV targets are voters of color vs. white voters? What share of canvassing resources go to majority-minority precincts vs. majority-white precincts? Is the disparity, if any, explained by cost-efficiency factors or by strategic deprioritization?

2. Does the language access plan cover the full Spanish-speaking universe? Review: What percentage of Hispanic voters in the GOTV universe are receiving Spanish-language outreach? What's the gap? What's preventing full coverage?

3. Does the young voter program address structural barriers? Review: Does the program provide actionable registration and polling information, or only motivational messaging? Does it reach students at community colleges and vocational schools, not just four-year universities?

4. Does the GOTV program reach voters with disabilities? Review: Are canvassing scripts and materials accessible? Is there a protocol for voters who cannot answer the door due to mobility limitations?

5. Are historically underserved communities' political concerns reflected in the campaign's issue messaging? Review: Does the campaign's message matrix include messages that speak to the specific concerns of Black, Latino, and working-class communities, or are messages primarily calibrated to suburban moderates?

Capstone 3 Data Appendix: The Campaign Analytics Plan

Section A: Dataset Reference Guide

Primary Dataset: oda_voters.csv

Secondary Dataset: oda_polls.csv

Supplementary Dataset: oda_ads.csv

Section B: Voter Universe Technical Reference

Worked Example: Universe Construction

Section C: Contact Program Standards and Benchmarks

Canvassing Program Benchmarks

Phone Banking Benchmarks

Mail Program Benchmarks

Digital Advertising Benchmarks

Text/SMS Benchmarks

Section D: Polling Design Reference

Meridian Research Group Survey Design Standards

Section E: Budget Reference — Line Item Detail

Standard Campaign Analytics Budget Components (Competitive Senate Race)

Section F: Ethics Reference — Extended Frameworks

The Voter Privacy Spectrum

VAN Data Ethics Standards

Equity Framework for Targeting Decisions

Primary Dataset: `oda_voters.csv`

Secondary Dataset: `oda_polls.csv`

Supplementary Dataset: `oda_ads.csv`