Capstone 1 Data Appendix: Data Sources, Methodology Notes, and Analysis Steps
Overview
This appendix documents all data sources referenced in the Garza-Whitfield Senate race audit and provides detailed methodology notes for each major analytical procedure. Students using Option A should treat this appendix as the full documentation of the data environment in which their analysis occurs. Students using Options B or C should use this appendix as a template for their own data documentation.
The appendix is organized to mirror the sections of the main capstone document. Each analytical section has a corresponding appendix entry describing: (1) what data was used, (2) where it came from, (3) how it was processed, and (4) what analytical decisions were made that could have been made differently.
Documenting methodological decisions is one of the most important — and most undervalued — skills in political analytics. The reproducibility of an analysis depends on knowing not just what calculation was performed, but which version of the data, which inclusion/exclusion criteria, and which definitional choices shaped that calculation.
Section A: Polling Data
A.1 Poll Compilation Methodology
Source: All polls in Table 1 were compiled from the following sources: (a) the American Association for Public Opinion Research's online database of disclosed polls; (b) FiveThirtyEight's archived poll tracker for the relevant Senate race; (c) individual pollster press releases and university report pages; (d) the state's primary daily newspapers (Riverside Courier, Metro Tribune), which reported on polls as they were released.
Inclusion criteria: A poll was included in the dataset if it (1) was released publicly (not just reported to have occurred), (2) covered the statewide Senate race, (3) was conducted within the 90-day window, and (4) reported a likely-voter or registered-voter sample. Polls that reported only a ballot test among all adults (without a likely-voter or registered-voter filter) were excluded on the grounds that they do not measure the electorate likely to vote.
Exclusion decisions: One additional poll, released by a firm with no prior track record and no methodology disclosure, was identified but excluded from the dataset. It showed Whitfield ahead by 11 points. The decision to exclude rather than down-weight this poll is defensible on transparency grounds (a grade of F would have given it zero weight anyway), but students who encounter analogous polls in their own research should document the decision explicitly. The poll is noted here for transparency.
Data recorded for each poll: Pollster name; sponsor name; sponsor political affiliation (if any); methodology (live phone, IVR, online panel, text-to-web); cell phone inclusion (yes/no); start date; end date; days to Election Day (calculated from end date); total sample size; likely voter sample size; Garza topline (%); Whitfield topline (%); undecided/other (%); disclosed weighting variables (yes/no); AAPOR Transparency Initiative membership (yes/no); likely voter screen methodology (disclosed, partial, not disclosed); question wording disclosed (yes/no); track record data available (yes/no).
A.2 Poll Quality Grading Rubric — Detailed Criteria
The ODA quality grading rubric applies the following criteria. Each criterion is scored as Met, Partial, or Not Met.
AAPOR Transparency Initiative membership (weight: 20% of grade): - Met: Pollster is a current AAPOR TI member (verified via AAPOR's public member list). - Not Met: Pollster is not a TI member or membership status is unverifiable.
Methodology transparency (weight: 25% of grade): - Met: Pollster discloses specific method (not just "telephone" but "live caller using RDD with cell phone supplements"), discloses whether cells or landlines or both are called, and discloses field dates. - Partial: Method is named but cell/landline breakdown is missing or field dates are only approximate. - Not Met: Method is not described beyond a general category, or is not disclosed at all.
Sample construction transparency (weight: 20% of grade): - Met: Sample size is disclosed; likely voter screen methodology is described in sufficient detail to understand what criteria determine likely voter status (e.g., "respondents who reported voting in the last two of three general elections and stated definite intent to vote"). - Partial: Sample size is disclosed but likely voter screen is described only generally ("those who said they were likely to vote"). - Not Met: Sample size is not disclosed, or the poll reports only a registered-voter sample with no explanation of why a likely-voter screen was not applied.
Weighting transparency (weight: 20% of grade): - Met: Variables used for weighting are disclosed (e.g., age, gender, race/ethnicity, education, geography) and the targets for each weight variable are described (e.g., "weighted to 2020 Census estimates for race/ethnicity"). - Partial: Some weighting variables are disclosed but targets are not described. - Not Met: No weighting information disclosed.
Question wording disclosure (weight: 15% of grade): - Met: Full question wording for the Senate ballot test question is published, including whether it was "generic" (just names) or included party labels, titles, or descriptions. - Partial: Question is paraphrased but not quoted verbatim. - Not Met: Question wording not disclosed.
Grade conversion: - All five criteria fully Met: Grade A - Four criteria fully Met, one Partial: Grade A - Three criteria fully Met, two Partial or one Not Met: Grade B+ - Two criteria fully Met, significant Not Met: Grade B - Majority of criteria Not Met: Grade C - Methodology not disclosed: Grade F
Partisan sponsor adjustment: Any poll sponsored by a campaign, political party, party committee, super PAC, 501(c)(4) with documented political mission, or candidate-aligned advocacy organization receives a one-grade penalty applied after the base grade is assigned. A Grade B partisan poll becomes a Grade C for weighting purposes.
A.3 Recency Adjustment — Rationale
The recency adjustment reflects a fundamental property of political polls: their evidentiary value about current vote intention decays as elections approach and conditions change. A poll conducted 88 days before Election Day measures vote intention at a point when most voters have not yet engaged with the race in any depth. By Day -12, the race is fully engaged, late-breaking developments have occurred, and both campaigns' advertising has shaped opinion substantially.
The specific recency multipliers (0.6 for polls 60+ days out, 0.8 for 30–60 days, 1.0 for 14–30 days, 1.2 for the final two weeks) are judgment calls, not derived from a mathematical formula. They are calibrated to give approximately twice as much weight to a recent high-quality poll as to an equally high-quality poll from the start of the window, while not entirely discarding the early polling signal.
Students who believe a different recency decay function is more appropriate may use it, provided they document their reasoning. Common alternatives include exponential decay functions and step functions with different cut points.
A.4 House Effect Estimation
For pollsters with two or more polls in the dataset (Meridian Research: Polls 1, 7, 12; Coastal University: Polls 3, 10; National Political Survey: Polls 6, 14; DataPulse: Polls 8, 13), house effects are estimated by comparing each pollster's result to the field average at the time the poll was conducted.
Meridian Research house effect: Poll 1 (Day -86–88): field average at time ≈ G+1.5; Meridian showed G+2; difference = +0.5 (slightly Garza-favorable). Poll 7 (Day -55–58): field average ≈ G+2; Meridian showed G+3; difference = +1.0. Poll 12 (Day -28–30): field average ≈ G+1.2; Meridian showed G+1; difference = -0.2. Average Meridian house effect: approximately +0.4 toward Garza. This is within the expected range of sampling variability and does not represent a clear systematic bias.
Coastal University house effect: Consistent G+2 across both polls (Days -77–80 and -40–44). No detectable house effect relative to the nonpartisan consensus at those time points.
National Political Survey house effect: Consistent G+2 across both polls (Days -61–65 and -9–12). No detectable house effect.
DataPulse text-to-web: Poll 8 showed a tie at Day -51–52, while the nonpartisan consensus at that time was approximately G+2. Poll 13 at Day -18–19 showed G+1, consistent with the consensus. The Day -52 tie is approximately 2 points more Whitfield-favorable than the consensus — a potentially meaningful house effect for the text-to-web methodology, or possibly a true signal of movement that the live-caller polls missed.
Note: House effect estimates based on two polls are very imprecise. These estimates should be treated as directional indicators rather than precise measurements.
Section B: Demographic and Electoral Geography Data
B.1 Voter Registration File
Source: The state's official voter registration database, accessed via the state Secretary of State's office. The file used in this analysis represents the registration snapshot taken 90 days before Election Day (the start of the audit window).
Variables used: Registration date; party affiliation code (D, R, NPA, minor parties); county of residence; census tract of residence (to enable ACS demographic matching); voter history (flags for participation in the last four general elections).
Party registration categorization: All major-party affiliates (D or R codes) are categorized as Democratic or Republican. All non-affiliated and minor-party registrants are categorized as NPA/Other. Students should be aware that the composition of the "NPA/Other" category varies substantially across counties — in Vega County, a substantial share of NPA registrants are Independents with weak partisan leanings; in Redstone County, many NPA registrants are conservative-leaning voters who have not affiliated formally with the Republican Party.
B.2 Demographic Estimates
Source: U.S. Census Bureau, American Community Survey 5-year estimates (most recent available). County-level population totals and demographic composition (race/ethnicity, educational attainment, age distribution) are taken from Table B03002 (Hispanic/Latino origin by race), Table B15003 (educational attainment), and Table B01001 (age by sex).
Matching voter file to ACS: The voter registration file provides county of residence and census tract for each registered voter, enabling matching to ACS tract-level estimates. The demographic composition in Table 2 (registered voters) is derived from a combination of the voter file's party registration data and ACS demographic estimates applied at the tract level. This is an approximation, not a precise measurement — ACS demographics describe the full population, not just registered voters, and the demographic composition of registered voters may differ from the general population in ways that ACS cannot capture.
BISG disclaimer: Estimates of Hispanic/Latino composition that go below the county level (e.g., the Vega County sub-community analysis) rely in part on Bayesian Improved Surname Geocoding (BISG), a probabilistic method that uses surnames and census-tract demographics to estimate the racial/ethnic composition of a list of individuals. BISG is a standard analytical tool with documented accuracy rates, but it has known failure modes: it systematically misclassifies Latinos with non-Hispanic surnames (e.g., English-origin surnames from long-assimilated families), and it cannot distinguish between Latino subgroups (Mexican-American vs. Cuban-American) except through supplementary geocoding of country-of-origin populations in specific census tracts. All BISG-based estimates should be treated as approximations with uncertainty ranges of ±5–8 percentage points.
B.3 Historical Election Results
Source: State Secretary of State election results database, official canvass results for the following elections: 2016 presidential; 2018 Senate (Democratic win); 2020 presidential; 2022 Senate (Republican win). County-level results for all four elections are compiled into a single historical dataset.
Precinct-level data: For Millbrook County and Vega County — the two most analytically complex counties — precinct-level results from the 2020 presidential election are also used to identify within-county geographic patterns. Precinct results are matched to census tracts to enable demographic cross-tabulation. This granular analysis supports the swing-universe identification described in Section 3 of the main capstone document.
Two-party vs. all-party results: All historical election results used in this analysis are expressed as two-party (D vs. R) share — the total vote for the Democratic candidate divided by the combined Democratic plus Republican vote. Third-party and write-in votes are excluded from this calculation. This convention allows cleaner comparison across years with different third-party vote levels. Students should note that in 2022, there was a substantial Libertarian candidate who received approximately 2.8% of the vote statewide, which means the two-party calculation overstates both candidates' performance relative to their actual share of all votes.
Section C: Turnout Modeling
C.1 Baseline Turnout Rate Construction
Method: The baseline turnout rate for each county is calculated as the average of the county's turnout rates in the two most recent comparable elections (2018 and 2022 Senate elections), adjusted by ±1.5 percentage points based on current early-vote data relative to 2022 pace.
The choice of 2018 and 2022 as the comparison elections reflects the judgment that midterm Senate elections are more analogous to the 2024 race than presidential-year Senate elections (which inflate turnout due to the presence of the presidential contest). Students who prefer to use a three-election average (2016, 2018, 2022) or a four-election average (all elections since 2014) should document their rationale and check whether the alternative produces a substantially different baseline.
Turnout denominator: All turnout rates are expressed as a percentage of registered voters as of 90 days before Election Day (the registration snapshot date). This is the appropriate denominator for a model predicting how many registered voters will participate. Expressing turnout as a percentage of voting-age population or citizen voting-age population would require converting to a different scale and is not used in this model.
C.2 Candidate Vote Share Assumptions
Method: Baseline vote share assumptions for each county are derived from the average of the candidate's share in the 2018 and 2022 Senate elections, adjusted for (a) the demographic changes in the county since 2022 (based on registration change data) and (b) the current polling evidence, where county-level crosstabs are available.
Polling crosstabs limitation: Of the fourteen polls in the dataset, only four (the two Meridian Research polls and the two National Political Survey polls) provided county-level or regional crosstabs. These crosstabs are based on very small sub-samples and carry margins of error of ±8–12 percentage points. They are used directionally — to check whether county-level assumptions are broadly consistent with available polling — but not as precise inputs.
Confidence note: The vote share assumptions in Table 3 carry substantial uncertainty, particularly for Vega County (where Hispanic subgroup composition creates analytical complexity) and for the exurban counties (where the education-realignment trend makes recent elections somewhat more predictive than older ones). Students should treat the county-level assumptions as informed estimates rather than precise predictions.
C.3 Three-Scenario Specifications
Low turnout scenario (55% overall): Achieves the lower bound by applying a uniform 5-percentage-point reduction to all county turnout rates, then adjusting Riverside County downward an additional 2 points (reflecting the hypothesis that soft Democratic registrants in urban areas are more sensitive to low-enthusiasm environments) and adjusting Redstone County upward 1 point (reflecting the hypothesis that motivated rural Republican voters maintain high turnout even in low-enthusiasm environments). No changes to candidate vote share assumptions — the scenario is driven purely by turnout composition, not persuasion.
Medium turnout scenario (60% overall): Baseline model as described in Table 3 and Section C.1–C.2 above.
High turnout scenario (65% overall): Achieves the upper bound by applying a uniform 5-percentage-point increase to all county turnout rates, then adjusting Riverside County upward an additional 1 point and Vega County upward an additional 1 point beyond the uniform increase, reflecting the hypothesis that mobilization efforts and candidate enthusiasm generate above-average surges in these specific counties. No changes to candidate vote share assumptions.
Whitfield-win low-turnout scenario: As described in Table 6, this scenario requires simultaneously: overall turnout of 55% (the low scenario); Riverside County underperformance of 3+ points below the already-reduced low-scenario baseline; and Millbrook County breaking 5 points more Republican than the baseline. This compound scenario requires three things to go wrong for Garza simultaneously, which is why the medium scenario (and most modelers) favor Garza. But the scenario is internally consistent and historically achievable: 2022 Republican performance in many Sun Belt states exhibited exactly this pattern of compound Democratic underperformance.
Section D: Media and Advertising Data
D.1 Broadcast Television Advertising Data
Source: FCC public inspection files (accessible via fcc.gov/media/public-inspection-file-listings) and AdImpact's publicly available tracking summaries. FCC files document all political advertising purchased on broadcast television stations, including buyer name, program, airdate, airtimes, and rate paid. AdImpact tracks estimated gross rating points and total spend estimates across cable and broadcast.
Note on digital advertising data: Digital advertising — Facebook, YouTube, Google Search, Connected TV, and programmatic display — is substantially less transparent than broadcast television. Meta's Ad Library provides advertiser-level disclosure of active and recently completed political ads, but spend estimates are not provided for individual placements and geographic targeting is disclosed only at a broad level. Google's Political Ads Transparency Center provides more granular spend data but has coverage gaps. The digital advertising figures in Table 4 are estimates derived from AdImpact's digital tracking methodology, which aggregates Meta Ad Library disclosures with proprietary panel data on ad exposure. Uncertainty in digital spend estimates is approximately ±15–20%.
Geographic allocation: The geographic distribution of broadcast spending (described in Section 5 of the capstone document) is derived from identifying which media markets' stations received the advertising buys. The Millbrook County "market" corresponds to the designated market area (DMA) covering Millbrook County and adjacent communities. Because DMA boundaries do not perfectly align with county boundaries, some Millbrook County spending reaches adjacent exurban communities, and some spending allocated to the "Vega County market" may reach portions of Riverside County.
D.2 Message Analysis Methodology
Sources: Analysis of advertising messaging draws on: (a) candidate campaign ad archives published on official campaign YouTube channels; (b) FCC public inspection file creative titles and descriptions; (c) independent ad trackers maintained by the Wesleyan Media Project, which catalogs political advertising content and provides issue-coding for broadcast spots; (d) news coverage describing ad content where direct viewing was not possible.
Coding framework: Ads are characterized along four dimensions: (1) tone (positive/contrast/attack), (2) primary issue focus (economy, healthcare, immigration/border, public safety, representation, other), (3) target audience (suburban moderates, base mobilization, Hispanic/Latino community, other), and (4) format (biographical/character, policy position, opponent attack, testimonial). This coding framework is adapted from the Wesleyan Media Project's standard political advertising codebook.
D.3 Fact-Check Sources
All fact-check ratings cited in Section 5 of the capstone document are derived from the following organizations: the Riverside Courier's fact-checking desk; PolitiFact's state-level bureau; FactCheck.org's coverage of the race. Where multiple fact-checkers have reviewed the same claim, ratings are reported from the most detailed and best-documented review. Discrepancies between fact-checker ratings for the same claim are noted.
Section E: Campaign Finance Data
E.1 FEC Filing Data
Source: Federal Election Commission data accessed via FEC.gov, Open Secrets (OpenSecrets.org), and the Campaign Finance Institute's database. All figures reflect official filed reports through the most recent pre-election deadline (typically the 15-day pre-election report). Late contributions and last-minute spending may not be captured.
Key filing types consulted: - Form 3 (Campaign committee receipts and disbursements): Used for candidate campaign totals, itemized contributions, and operating expenditures. - Form 3X (Super PAC receipts and disbursements): Used for Senate Majority PAC and American Leadership Fund data. - Form 3N (Non-connected committee filings): Not applicable in this race. - Form 24 (Independent expenditure reports): Used for tracking independent expenditures (IEs) by all entities. IEs are required to be disclosed within 48 hours when over $10,000 close to an election.
State-level disclosure: The fictional state in this capstone has its own campaign finance disclosure portal for state-registered committees and expenditures. The Heritage Alliance and Progress Now 501(c)(4)s are registered with the state as "political committees" under state law, which requires disclosure of expenditures but not donor lists. This is a common regulatory structure in U.S. states.
E.2 Small-Dollar Donor Calculation
Method: The small-dollar donor percentage is calculated from itemized vs. unitemized contributions as reported on Form 3. Under FEC rules, contributions of $200 or less from a single donor in a reporting period do not require itemization. Contributions over $200 require itemization (donor name, address, occupation, employer). The "small-dollar" percentage is therefore an approximation: it represents the share of total funds raised that came through unitemized contributions, which are dominated by donors giving $200 or less but may also include partial-year contributions from donors who gave more in later periods.
ActBlue/WinRed processing: Both campaigns process small-dollar online contributions through ActBlue (Garza) and WinRed (Whitfield). These platforms aggregate contributions and report them to the FEC on behalf of campaigns. The frequency of online fundraising solicitations can be tracked through public email list subscriptions, though this approach requires a systematic subscription and archiving methodology that is beyond the scope of this capstone.
E.3 Dark Money Documentation
Source: OpenSecrets' 501(c)(4) tracker; the Campaign Finance Institute's outside spending database; news investigations by the Riverside Courier and the Metro Tribune into the Heritage Alliance's donor network.
Methodological note on "connected to fossil fuel interests" claims: The characterization of the Heritage Alliance's donor network as having connections to fossil fuel industry interests is derived from: (a) FEC records showing organizational transfers from two 501(c)(4)s (Americans for Responsible Energy Policy and the National Resource Enterprise Forum) to Heritage Alliance in the current cycle; (b) IRS Form 990 disclosures from prior years for both organizations, which show substantial revenue from a small number of major donors; (c) board membership overlap between the two organizations and industry trade associations documented by a news investigation; and (d) the stated policy positions of all three organizations, which consistently favor fossil fuel development. This chain of evidence is suggestive but not definitive regarding specific donor identities, and should be represented as such.
Section F: Forecasting Methodology
F.1 Component Weight Justification
The four-component forecast model assigns weights as follows: polling (55%), fundamentals (20%), demographic/structural (15%), campaign-specific (10%). These weights reflect the following reasoning:
Polling (55%): In the final two weeks of a well-polled Senate race, polling is the primary source of information about current vote intention. A 55% weight reflects substantial but not overwhelming confidence in the polling signal, recognizing that polling averages in competitive Senate races have shown mean absolute errors of approximately 2–3 points in recent cycles.
Fundamentals (20%): Economic and political fundamentals predict vote shares in the range of 50–55% accuracy in competitive races — better than chance, but substantially less predictive than late-cycle polling. A 20% weight reflects this moderate predictive value.
Demographic/structural (15%): Long-term demographic trends are already partly captured in the polling (which reflects the current electorate) and in the historical election results used to construct the baseline. The incremental predictive value of structural factors beyond polling and fundamentals is modest, hence the 15% weight.
Campaign-specific (10%): Campaign quality, resource deployment, and ground game effects have documented but typically small effects on electoral outcomes — rarely more than 1–2 percentage points in competitive Senate races, often less. The 10% weight reflects this small but real marginal effect.
F.2 Probability Conversion Method
The point estimate (Garza +1.7) is converted to a win probability using a t-distribution with 6 degrees of freedom and a scale parameter of 2.0. The t-distribution (rather than a normal distribution) reflects the heavier-tailed nature of actual polling error distributions in competitive Senate races — large errors (4+ points) occur more frequently than a normal distribution would predict.
Specifically: the probability that the true margin exceeds 0 (Garza wins) given a t-distribution centered at +1.7 with scale 2.0 and 6 degrees of freedom is approximately 64%. This calculation can be reproduced in Python as follows:
from scipy import stats
# Parameters
point_estimate = 1.7 # Garza's lead in percentage points
scale = 2.0 # Scale parameter (analogous to standard deviation)
df = 6 # Degrees of freedom for t-distribution
# Probability of Garza winning (true margin > 0)
# We want P(X > 0) where X ~ point_estimate + t(df, scale)
# Equivalently: P(t > -point_estimate/scale) with df degrees of freedom
t_statistic = point_estimate / scale
win_probability = stats.t.sf(-t_statistic, df)
print(f"Garza win probability: {win_probability:.2%}")
# Expected output: ~64%
Students who use the forecasting methods from Chapter 28 should be able to derive a similar calculation using the Bayesian updating framework developed there.
Section G: Analysis Steps Checklist
This checklist documents the complete sequence of analytical steps performed to produce the capstone document. Students using Option A should be able to reproduce each step from the data described in this appendix. Students using Options B or C should adapt this checklist to document their own analysis.
- [ ] Compiled all public polls from the 90-day window using AAPOR database and news archives
- [ ] Recorded all metadata variables for each poll (method, dates, sample size, sponsor, disclosures)
- [ ] Applied quality grading rubric to each poll
- [ ] Applied partisan sponsor adjustment where applicable
- [ ] Computed combined weight (grade × recency) for each poll
- [ ] Computed weighted polling average for both candidates
- [ ] Estimated house effects for multi-poll pollsters
- [ ] Identified trend patterns and linked to plausible causal factors
- [ ] Downloaded voter registration file snapshot from state SoS
- [ ] Downloaded ACS 5-year demographic estimates for all counties
- [ ] Constructed county-level demographic table (Table 2)
- [ ] Downloaded historical election results for 2016, 2018, 2020, 2022
- [ ] Converted all historical results to two-party vote share
- [ ] Identified county political trend patterns (improvement/decline by party)
- [ ] Applied BISG-assisted analysis for Vega County Latino subgroup composition
- [ ] Identified and sized the swing universe by county
- [ ] Ran counterfactual scenario (Hispanic turnout +5%) with step-by-step arithmetic
- [ ] Computed baseline turnout rates using 2018/2022 average methodology
- [ ] Constructed turnout projection table (Table 3) with county-level assumptions
- [ ] Computed net vote estimate for each county in baseline scenario
- [ ] Developed Low, Medium, High scenario specifications
- [ ] Computed margins for each scenario
- [ ] Downloaded early vote data from state portal
- [ ] Calculated early vote party registration composition
- [ ] Compared early vote pace to 2022 (percentage above/below)
- [ ] Compiled ad spending from FCC filings and AdImpact tracker
- [ ] Organized spending by entity and medium (Table 4)
- [ ] Coded advertising messages for tone, issue focus, target audience, format
- [ ] Compiled fact-check tracker from fact-checker archives
- [ ] Downloaded FEC filings for both campaigns (Form 3) and major super PACs (Form 3X)
- [ ] Calculated small-dollar donor percentages from itemized/unitemized data
- [ ] Compiled outside spending (IE and electioneering communication) by entity
- [ ] Documented dark money entity connections using available public records
- [ ] Computed integrated forecast using four-component model
- [ ] Converted point estimate to win probability using t-distribution
- [ ] Identified key sensitivity assumptions and threshold values for each
- [ ] Analyzed polling sample demographic composition (representation gaps)
- [ ] Documented voter access concerns with supporting evidence
- [ ] Applied Adaeze's equity checklist to the audit document itself
- [ ] Wrote all sections of the audit document
- [ ] Wrote conclusions section answering all six audit questions explicitly
- [ ] Completed this data appendix
End of Capstone 1 Data Appendix