Capstone 3 Data Appendix: The Campaign Analytics Plan
Section A: Dataset Reference Guide
Primary Dataset: oda_voters.csv
The ODA voter file is a teaching-scale version of the data infrastructure used in real campaign analytics. With approximately 50,000 rows, it provides the foundation for all universe construction work in this capstone. Real Senate race voter files have millions of records; the relative compositions and analytical relationships in this teaching dataset accurately reflect real voter file characteristics.
Complete column reference:
| Column | Type | Range / Values | Notes |
|---|---|---|---|
voter_id |
string | V-XXXXX | Unique identifier |
state |
string | Single-state | All records from the race's state |
county |
string | 12 county names | See county reference table below |
age |
integer | 18-85 | Current age at election date |
gender |
categorical | M, F, Other | — |
race_ethnicity |
categorical | White, Black, Hispanic, Asian, Other | Self-reported where available, estimated otherwise |
education |
categorical | Less than HS, HS Diploma, Some College, College Degree, Graduate Degree | — |
income_bracket |
categorical | Under 30K, 30-60K, 60-100K, Over 100K | Estimated from public records and commercial data |
party_reg |
categorical | Democrat, Republican, Independent, Other | Official registration status |
vote_history_2018 |
binary | 0, 1 | 1 = participated in 2018 midterm election |
vote_history_2020 |
binary | 0, 1 | 1 = participated in 2020 presidential election |
vote_history_2022 |
binary | 0, 1 | 1 = participated in 2022 midterm election |
urban_rural |
categorical | Urban, Suburban, Rural | — |
support_score |
float | 0-100 | Garza support probability × 100 |
persuadability_score |
float | 0-100 | Higher = more movable by campaign contact |
County reference table:
| County Name | Type | Approx. Share of State Voters | Strategic Designation |
|---|---|---|---|
| Metro Central | Urban | 22% | Garza base; GOTV priority |
| Metro South | Urban | 14% | Garza base; GOTV priority |
| Lakeview | Suburban | 12% | Swing; persuasion and GOTV |
| Riverside | Suburban | 9% | Swing; persuasion priority |
| Valley North | Suburban | 7% | Lean Whitfield; limited investment |
| Highfield | Rural | 8% | Whitfield stronghold; GOTV for Garza voters only |
| Carson | Rural | 7% | Whitfield stronghold; GOTV for Garza voters only |
| Eastport | Rural | 5% | Whitfield stronghold |
| Garfield | Rural | 4% | Split rural; some Garza opportunity |
| Millbrook | Suburban/rural mix | 5% | Lean Whitfield; some persuasion opportunity |
| Crestview | Suburban | 4% | Lean Garza; GOTV priority |
| Other/small | Mixed | 3% | Distributed; low investment |
Expected support score distribution by county:
| County | Mean Support Score | Std Dev | Notes |
|---|---|---|---|
| Metro Central | 66.2 | 18.4 | Garza-favorable; large Latino share |
| Metro South | 63.8 | 19.1 | Garza-favorable; large Black share |
| Lakeview | 51.3 | 22.7 | Genuinely swing; college-educated suburbs |
| Riverside | 49.8 | 23.1 | Most competitive county in the state |
| Valley North | 44.2 | 20.5 | Lean Whitfield; some college-ed women opportunity |
| Highfield | 38.5 | 16.2 | Whitfield stronghold |
| Carson | 37.1 | 15.8 | Whitfield stronghold |
| Garfield | 46.3 | 21.9 | Split rural; Garza has a small universe here |
Secondary Dataset: oda_polls.csv
Used for polling plan design and public polling analysis.
Key columns for this capstone:
| Column | Notes |
|---|---|
date |
Date poll was fielded (start date) |
pollster |
Organization conducting the poll |
methodology |
phone, online, mixed, IVR |
pct_d, pct_r |
Candidate percentages |
sample_size |
Total N |
margin_error |
Reported margin of error |
population |
LV (likely voters), RV (registered voters), A (adults) |
race_type |
senate_general, senate_primary, governor_general, etc. |
Public polling summary for the Garza-Whitfield race (60-day window):
The following represents the polling landscape as of the campaign's 60-day-out point:
| Date | Pollster | Method | Garza % | Whitfield % | MOE | Pop |
|---|---|---|---|---|---|---|
| -62 days | University Poll | Phone | 45 | 43 | 4.2 | RV |
| -58 days | SurveyUSA | Online | 47 | 44 | 3.8 | LV |
| -54 days | FOX State | Phone/Online | 44 | 45 | 3.5 | LV |
| -51 days | BluePath (D) | Phone | 49 | 42 | 4.1 | LV |
| -48 days | Meridian Research | Mixed | 46 | 44 | 3.6 | LV |
| -44 days | Impact Research (D) | Online | 47 | 43 | 3.9 | LV |
| -42 days | Cygnal (R) | Online | 44 | 46 | 4.0 | LV |
| -38 days | Emerson | Online/IVR | 46 | 45 | 3.7 | LV |
Weighted public average (as of 60 days out): Garza +2.1 points
Meridian internal poll (most recent, 3 weeks old): - Sample: N=802 likely voters - Method: Mixed mode (50% phone, 50% online panel) - Garza: 46%, Whitfield: 44%, Undecided/Other: 10% - MOE: ±3.5%
Key Meridian subgroup findings (for targeting design):
| Segment | Garza % | Whitfield % | Notes |
|---|---|---|---|
| Latino voters | 68 | 24 | 8% undecided — room to grow |
| Black voters | 87 | 9 | 4% undecided — low persuasion room |
| White college-ed women | 55 | 40 | 5% undecided — key persuasion target |
| White college-ed men | 45 | 50 | 5% undecided — contested |
| White non-college | 37 | 58 | 5% undecided — tough terrain for Garza |
| Young voters 18-29 | 58 | 33 | 9% undecided — high upside if turnout |
| Suburban voters (all) | 51 | 44 | 5% undecided |
| Rural voters (all) | 34 | 61 | 5% undecided — Garza limited here |
Supplementary Dataset: oda_ads.csv
Useful for understanding the advertising landscape and informing the digital program design.
Key columns for this capstone:
| Column | Notes |
|---|---|
sponsor |
Who paid for the ad |
party |
D, R, third-party |
platform |
TV, digital, radio |
state |
State of airing/targeting |
market |
Media market |
spend_usd |
Estimated spend |
impressions |
Estimated impressions |
issue_topic |
Primary issue topic of the ad |
tone |
positive, negative, contrast |
target_demo |
Demographic target specification |
Advertising context for the Garza campaign's 60-day window:
The race's advertising environment as of 60 days out:
- Garza campaign spending rate: approximately $280K/week on advertising across all platforms
- Whitfield campaign spending rate: approximately $230K/week
- Total outside spending (both sides combined): approximately $1.2M/week
- Top issue topics in Garza ads: healthcare (38%), economic security (29%), immigration/AG record (18%), Whitfield contrast (15%)
- Top issue topics in Whitfield ads: border security/immigration (42%), economy/jobs (31%), Garza contrast (22%), other (5%)
Section B: Voter Universe Technical Reference
Worked Example: Universe Construction
The following provides a worked example of universe construction using the oda_voters.csv dataset. Students should use this as a reference, not copy it — your implementation should include your own threshold justifications and priority tier criteria.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
# Load voter file
voters = pd.read_csv('oda_voters.csv')
print("=== VOTER FILE OVERVIEW ===")
print(f"Total registered voters: {len(voters):,}")
# --- STEP 1: CONSTRUCT TURNOUT PROPENSITY ---
# Reasoning: 2022 is most predictive (same cycle type: midterm)
# 2020 is less predictive (presidential — higher baseline turnout)
# 2018 is least predictive (older, but same cycle type)
# Age adjustment: older voters have higher base turnout, floor at 0
def build_turnout_propensity(row):
base = (row['vote_history_2022'] * 55 +
row['vote_history_2020'] * 20 +
row['vote_history_2018'] * 25)
# Age adjustment: max +10 for voters 65+
age_bonus = min(10, max(0, (row['age'] - 40) * 0.25))
return min(100, base + age_bonus)
voters['turnout_propensity'] = voters.apply(build_turnout_propensity, axis=1)
# Validate distribution looks reasonable
print("\nTurnout propensity distribution:")
print(voters['turnout_propensity'].describe())
bins = [0, 10, 25, 50, 75, 90, 100]
print("\nBy bucket:")
print(pd.cut(voters['turnout_propensity'], bins).value_counts().sort_index())
# --- STEP 2: PERSUASION UNIVERSE ---
persuasion = voters[
(voters['support_score'] >= 40) &
(voters['support_score'] <= 60) &
(voters['persuadability_score'] >= 50) &
(voters['turnout_propensity'] >= 30)
].copy()
# Priority tiers based on combination of persuadability and turnout
def persuasion_tier(row):
if (row['persuadability_score'] >= 70 and
row['turnout_propensity'] >= 60):
return 'Tier 1'
elif (row['persuadability_score'] >= 58 or
row['turnout_propensity'] >= 50):
return 'Tier 2'
else:
return 'Tier 3'
persuasion['tier'] = persuasion.apply(persuasion_tier, axis=1)
print(f"\n=== PERSUASION UNIVERSE ===")
print(f"Total: {len(persuasion):,} ({len(persuasion)/len(voters)*100:.1f}% of file)")
print(persuasion['tier'].value_counts())
# --- STEP 3: GOTV UNIVERSE ---
gotv = voters[
(voters['support_score'] >= 65) &
(voters['turnout_propensity'] >= 30) &
(voters['turnout_propensity'] <= 78)
].copy()
def gotv_priority(row):
if (row['support_score'] >= 78 and
row['turnout_propensity'].between(45, 72)):
return 'High'
else:
return 'Standard'
gotv['priority'] = gotv.apply(gotv_priority, axis=1)
print(f"\n=== GOTV UNIVERSE ===")
print(f"Total: {len(gotv):,} ({len(gotv)/len(voters)*100:.1f}% of file)")
print(gotv['priority'].value_counts())
# Demographic breakdown of GOTV universe
print("\nGOTV by race/ethnicity:")
print(gotv['race_ethnicity'].value_counts())
print("\nGOTV by urban/rural:")
print(gotv['urban_rural'].value_counts())
# --- STEP 4: COUNTY SUMMARY ---
county_summary = pd.DataFrame()
for county in voters['county'].unique():
county_voters = voters[voters['county'] == county]
county_persu = persuasion[persuasion['county'] == county]
county_gotv = gotv[gotv['county'] == county]
county_summary = pd.concat([county_summary, pd.DataFrame([{
'county': county,
'total_voters': len(county_voters),
'persuasion_total': len(county_persu),
'persuasion_tier1': (county_persu['tier'] == 'Tier 1').sum(),
'gotv_total': len(county_gotv),
'gotv_high': (county_gotv['priority'] == 'High').sum(),
'mean_support': county_voters['support_score'].mean().round(1),
'mean_turnout_prop': county_voters['turnout_propensity'].mean().round(1)
}])], ignore_index=True)
county_summary = county_summary.sort_values('total_voters', ascending=False)
print("\n=== COUNTY UNIVERSE SUMMARY ===")
print(county_summary.to_string(index=False))
# --- STEP 5: VISUALIZATIONS ---
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Viz 1: Support score distribution by county type
urban_rural_groups = voters.groupby('urban_rural')['support_score']
for label, group in urban_rural_groups:
axes[0].hist(group, bins=30, alpha=0.6, label=label, density=True)
axes[0].axvline(x=40, color='red', linestyle='--', alpha=0.7, label='Persuasion bounds')
axes[0].axvline(x=60, color='red', linestyle='--', alpha=0.7)
axes[0].set_xlabel('Support Score (0=Whitfield, 100=Garza)')
axes[0].set_ylabel('Density')
axes[0].set_title('Support Score Distribution by Geography')
axes[0].legend()
# Viz 2: Universe size by county
plot_counties = county_summary.head(8)
x = range(len(plot_counties))
width = 0.35
axes[1].bar([i - width/2 for i in x],
plot_counties['persuasion_total'],
width, label='Persuasion Universe', color='steelblue', alpha=0.8)
axes[1].bar([i + width/2 for i in x],
plot_counties['gotv_total'],
width, label='GOTV Universe', color='darkorange', alpha=0.8)
axes[1].set_xticks(x)
axes[1].set_xticklabels(plot_counties['county'], rotation=45, ha='right')
axes[1].set_ylabel('Voters in Universe')
axes[1].set_title('Persuasion and GOTV Universe Size by County')
axes[1].legend()
plt.tight_layout()
plt.savefig('universe_analysis.png', dpi=150, bbox_inches='tight')
plt.show()
print("\nVisualization saved to universe_analysis.png")
Section C: Contact Program Standards and Benchmarks
Canvassing Program Benchmarks
| Metric | Low | Typical | High | Notes |
|---|---|---|---|---|
| Doors per hour | 6-8 | 8-10 | 10-14 | Urban more efficient than rural |
| Contact rate (someone answers) | 18% | 25-30% | 38% | Lower in rural, higher in dense urban |
| Conversations per contact hour | 1.5-2.0 | 2.5-3.5 | 4.0+ | — |
| Turnout lift per contact | 2-4 pp | 4-6 pp | 6-8 pp | GOTV contacts; phone lower end |
| Persuasion lift per contact | 0.5-1 pp | 1-2 pp | 2-3 pp | Persuasion contacts |
| Effective volunteer shift (3 hr) | 6-8 contacts | 8-12 contacts | 12-18 contacts | After accounting for travel, setup |
Important: "Contacts" means a real conversation with a voter, not doors knocked. A volunteer who knocks 30 doors in a three-hour shift and speaks with 9 people has made 9 contacts, not 30.
Volunteer retention patterns: Campaigns typically lose 30-40% of recruited volunteers after their first shift. Building volunteer retention requires: recognition systems, clear communication about impact, training that sets accurate expectations, and a social community of fellow volunteers.
Phone Banking Benchmarks
| Metric | Typical Range | Notes |
|---|---|---|
| Calls per hour | 8-15 | Varies by list quality and script length |
| Contact rate (live answer) | 8-15% | Cold lists; 20-30% for warm (prior contacts) |
| Conversation completion rate | 40-55% | Of contacts who pick up |
| Volunteer hours per 1,000 contacts | 80-120 hours | Using volunteer phone bank |
| Turnout lift per phone contact | 1-3 pp | Live calls; IVR/robocall is 0-0.5 pp |
Mail Program Benchmarks
| Metric | Typical | Notes |
|---|---|---|
| Delivery rate | 92-96% | 4-8% returned undeliverable |
| Piece read rate | 30-60% | Varies significantly by design and segment |
| Response / action rate | 1-5% | For pieces with a specific call to action |
| Cost per piece (all-in) | $0.55-0.85 | Printing, postage, list processing |
| Persuasion effect per piece | 0.3-0.8 pp | From field experiment literature |
| Turnout effect per GOTV piece | 0.3-1.0 pp | Modestly effective; best as supplement |
Mail timing notes: USPS delivery times for bulk mail are 5-10 days. Political mail sent at non-bulk (first class) rates arrives in 1-3 days. All mail drop dates in the contact program plan should account for delivery time — a piece intended to arrive at 15 days out should be dropped at 22-25 days out if sent bulk.
Digital Advertising Benchmarks
| Metric | Typical Range | Notes |
|---|---|---|
| CPM (cost per 1K impressions) | $5-25 | Varies by platform and targeting precision |
| Click-through rate (display) | 0.05-0.2% | Very low for display; higher for video |
| Video completion rate | 35-60% | Platform and length dependent |
| Persuasion effect per exposure | 0.3-0.8 pp | From field experiment literature |
| GOTV effect per exposure | 0.3-1.5 pp | Varies widely by creative quality |
| Voter file match rate | 55-75% | Share of file successfully matched to platform |
Text/SMS Benchmarks
| Metric | Typical | Notes |
|---|---|---|
| Delivery rate (opt-in list) | 95-99% | Much higher than purchased lists |
| Open rate | 85-95% | SMS vastly outperforms email |
| Response rate (two-way SMS) | 15-25% | For conversational texts |
| Opt-out rate per send | 1-3% | Higher if contact frequency is too high |
| GOTV effect (opt-in list) | 2-5 pp | Higher effectiveness than cold contact |
Section D: Polling Design Reference
Meridian Research Group Survey Design Standards
The following summarizes Meridian's standard survey design practices, consistent with what Dr. Vivian Park and Carlos Mendez apply to the Garza campaign's internal surveys.
Likely voter screen: Meridian uses a seven-question screen based on the Gallup model, plus a state-specific component accounting for the state's early voting infrastructure. The screen is calibrated to the state's historical midterm turnout (approximately 48% of registered voters).
Sampling methodology: Meridian uses a multi-mode approach: 50% address-based sampling with online response, 50% cell phone (live interviewer). This approach oversamples demographic groups with lower online panel participation (older voters, lower-income households, rural voters) and weights back to population benchmarks.
Standard survey battery for Senate race:
-
Horse race question: "If the election for U.S. Senate were held today, for whom would you vote: Maria Garza, the Democrat, or Tom Whitfield, the Republican?" (Rotate name order; include "someone else," "don't know/no preference" options)
-
Candidate favorability: Four-point scale (very favorable, somewhat favorable, somewhat unfavorable, very unfavorable) for each candidate; "don't know/no opinion" as fifth option
-
Issue priority: "What is the single most important issue in deciding your vote for U.S. Senate?" (Open-ended, coded to issue categories)
-
Issue battery: Importance ratings for five to eight issues using four-point scale
-
Candidate attribute ratings: "Thinking about [Candidate], please tell me how well each of the following describes her/him..." — attributes include: shares my values, has the experience needed, trustworthy, understands people like me, would fight for the middle class, effective in government
-
Message test (when included): Split-sample — half receive Message A, half receive Message B; remeasure horse race after exposure
-
Demographics: Age, gender, race/ethnicity, education, income, party identification, religious attendance, zip code (for urban/rural classification)
Sample size guidance:
| Survey Type | Minimum N | Recommended N | MOE at 95% CI |
|---|---|---|---|
| Full benchmark | 600 | 800-1,000 | ±3.5-4.1% |
| Tracking | 400 | 500-600 | ±4.2-4.8% |
| Message test (split sample) | 300 per arm | 400-500 per arm | ±4.8-5.3% |
| County-level (substate) | 300 | 400 | ±5.2% |
Note on likely voter screen variability: As Election Day approaches, the likely voter pool becomes more stable and the screen more accurate. Polls conducted 60+ days out have higher uncertainty in LV composition than polls at 14 days out. Account for this in how you communicate uncertainty from early-cycle polls.
Section E: Budget Reference — Line Item Detail
Standard Campaign Analytics Budget Components (Competitive Senate Race)
Use the following as a benchmark. Your budget should be calibrated to your specific plan — the program choices you made in Deliverable 2 should drive the budget, not the other way around.
Direct mail: Estimating cost
Cost per piece (all-in): $0.60-0.80 for standard political mail (design, print, postage, list processing) - Total mail cost = (number of pieces) × (cost per piece) - Standard persuasion sequence: 4 pieces to persuasion universe - Standard GOTV sequence: 2 pieces to GOTV universe - Spanish-language premium: add ~15% for translation and separate printing run
Example: 175,000 persuasion voters × 4 pieces + 225,000 GOTV voters × 2 pieces = 1,150,000 pieces total × $0.70 = $805,000
(Students' budgets will differ based on their universe sizes from Deliverable 1)
Digital advertising: Estimating cost
Voter file match of universe to platforms: $15,000-25,000 per platform (one-time or per-cycle fee) Advertising placement: CPM $8-18 depending on targeting precision - Total digital cost = (desired impressions) × (CPM / 1000) - A 60-day persuasion digital program might target 50,000 persuasion voters at 20 impressions each = 1,000,000 impressions × $12 CPM = $12,000 plus platform fees
Canvassing: Estimating cost
- Volunteer canvassing direct costs: $15-25 per completed contact (staff coordination, materials, training time)
- If using paid canvassers: $40-60 per completed contact
- Typically 80-90% of canvassing is done by volunteers; paid canvassers fill gaps
Polling: Market rates
| Survey type | Typical cost (Meridian-equivalent quality) |
|---|---|
| Statewide benchmark (N=800, mixed mode) | $38,000-55,000 |
| Statewide tracking (N=500, mixed mode) | $22,000-30,000 |
| County-level (N=400, phone) | $18,000-28,000 |
| Message test (split sample, N=800) | $30,000-45,000 |
VAN platform costs
- State Democratic Party VAN access: $8,000-15,000 for the campaign cycle (varies by state and negotiation)
- Additional modules (data integration, texting): $5,000-12,000
- Field director and data staff time for VAN administration: significant; typically 0.5 FTE during peak campaign period
Section F: Ethics Reference — Extended Frameworks
The Voter Privacy Spectrum
Political campaigns operate in a complex legal and ethical landscape around voter data. The following framework, drawn from Chapter 38 and Chapter 39 discussions, summarizes the key distinctions.
Tier 1 — Public record data (legally and broadly ethically uncontroversial): - Voter registration records (name, address, party registration, vote history) — public records in most states - Candidate financial disclosure records - Campaign finance records (FEC, state filings) - Official electoral results
Tier 2 — Commercial append data (legal, ethically contested): - Consumer behavior segments (lifestyle, purchasing patterns) - Estimated income and homeownership - Estimated age and gender - Estimated education level - Consumer-based political scores
Tier 3 — Sensitive commercial data (legal but ethically problematic per this capstone's framework): - Inferred health conditions from consumer purchase data - Inferred financial distress indicators - Inferred religious practice - Inferred relationship status from social media behavioral signals - Lookalike audience modeling matched to social media behavioral profiles
The Garza campaign's analytics plan explicitly prohibits Tier 3 uses. Your ethics review should document this constraint and apply the reasoning from Section 3 of the main capstone text.
VAN Data Ethics Standards
VAN (Voter Activation Network) maintains data use policies that all campaigns accessing the system must agree to. Key constraints:
- Voter file data may not be used for commercial purposes
- Voter file data may not be sold or transferred to non-political third parties
- All canvass data and voter contact records remain the property of the party, not the campaign — they persist for future cycles
- Campaigns accessing the party's master file have an obligation to return contact data to the party infrastructure
These contractual obligations are separate from — and in addition to — the ethical constraints the campaign's analytics plan adopts.
Equity Framework for Targeting Decisions
The following checklist questions from ODA's equity framework, applied to campaign context:
1. Does the targeting plan systematically invest less in communities of color? Review: What share of Tier 1 GOTV targets are voters of color vs. white voters? What share of canvassing resources go to majority-minority precincts vs. majority-white precincts? Is the disparity, if any, explained by cost-efficiency factors or by strategic deprioritization?
2. Does the language access plan cover the full Spanish-speaking universe? Review: What percentage of Hispanic voters in the GOTV universe are receiving Spanish-language outreach? What's the gap? What's preventing full coverage?
3. Does the young voter program address structural barriers? Review: Does the program provide actionable registration and polling information, or only motivational messaging? Does it reach students at community colleges and vocational schools, not just four-year universities?
4. Does the GOTV program reach voters with disabilities? Review: Are canvassing scripts and materials accessible? Is there a protocol for voters who cannot answer the door due to mobility limitations?
5. Are historically underserved communities' political concerns reflected in the campaign's issue messaging? Review: Does the campaign's message matrix include messages that speak to the specific concerns of Black, Latino, and working-class communities, or are messages primarily calibrated to suburban moderates?