Case Study 5.1: When the Voter File Lies — Data Quality in Practice

DataField.Dev

Case Study 5.1: When the Voter File Lies — Data Quality in Practice

Background

Nadia Osei has been with the Garza campaign for two weeks when she discovers that something is wrong with the voter file. It is not a dramatic discovery — no single outlier catches her eye, no alarm fires. It is the slow accumulation of small inconsistencies that eventually adds up to a problem she cannot ignore.

It starts with age. Running a simple describe() on the voter file for the Garza-Whitfield state, she notices that the minimum age in the file is 14. The voting age is 18. Fourteen-year-olds cannot be registered voters; this is an impossible value. She runs a count:

age_impossible = voters[voters['age'] < 18]
print(f"Voters with age < 18: {len(age_impossible)}")

There are 847 records with ages below 18. Some of these, she suspects, are data entry errors in the source records — a "1967" birth year entered as "1997," producing an apparent 29-year-old when the person is actually 59. Others may be genuine registration errors. A small number may be preregistered youth who appear in the file but are not yet eligible to vote.

The age problem leads her to look more carefully at other fields.

The Cascade of Data Quality Issues

Issue 1: Inconsistent race/ethnicity coding. The race_ethnicity column uses different codes from what appears to be multiple data sources. Some records show "Hispanic," others show "Latino," others show "Hispanic/Latino," and a small number show "H/L" — apparently abbreviated from a legacy database. For analytical purposes, these should all map to the same category, but an automated value_counts() would split them into four separate buckets.

# What Nadia found:
print(voters['race_ethnicity'].value_counts())
# Output includes: Hispanic (45,221), Latino (3,104), Hispanic/Latino (891), H/L (67)

Issue 2: Zip codes in the state column. A subset of records — about 1.2% — has a five-digit number in the state column that looks like a zip code. This happened because a data integration step misaligned columns when merging records from different county systems. The voter in question lives in the state, but their record has corrupted geographic fields.

Issue 3: Duplicate voter IDs. Running a uniqueness check on voter_id reveals 2,341 duplicate IDs. These likely represent the same individual registered in two different counties — a legal impossibility in most states, but a data reality that happens through name changes, address moves processed at different times, and legacy system merge failures.

dup_ids = voters[voters.duplicated(subset=['voter_id'], keep=False)]
print(f"Records with duplicate voter_id: {len(dup_ids)}")

Issue 4: Vote history inconsistencies. Some voters have vote_history_2022 == 1 but vote_history_2020 == 0 and vote_history_2018 == 0. While it is entirely possible to vote in 2022 as a first-time voter, the pattern is worth examining. More concerning: 34 voters show vote_history_2022 == 1 with a registered voter date after November 2022 — they are listed as having voted before they were registered. This is another data integrity failure, not a miracle of time travel.

Nadia's Data Cleaning Protocol

Nadia develops a systematic cleaning protocol that she documents thoroughly, because every decision she makes will affect every analysis that follows:

Step 1: Flag, don't delete. Rather than removing problematic records, she creates boolean flag columns that mark each type of data quality issue. This preserves the original data while making the problems visible:

voters['flag_age_impossible'] = voters['age'] < 18
voters['flag_dup_voter_id'] = voters.duplicated(subset=['voter_id'], keep=False)
voters['flag_reg_after_vote'] = (
    (voters['vote_history_2022'] == 1) &
    (pd.to_datetime(voters['reg_date'], errors='coerce') > pd.Timestamp('2022-11-08'))
)

Step 2: Standardize categoricals. She creates a mapping dictionary for race/ethnicity harmonization and applies it:

ethnicity_map = {
    'Hispanic': 'Hispanic/Latino',
    'Latino': 'Hispanic/Latino',
    'H/L': 'Hispanic/Latino',
    'Hispanic/Latino': 'Hispanic/Latino',
    # ... other mappings
}
voters['race_ethnicity_clean'] = voters['race_ethnicity'].map(ethnicity_map).fillna(
    voters['race_ethnicity']
)

Step 3: Document everything. For each cleaning decision, she writes a comment specifying what she did, why, and what the alternative would have been:

# Age: flagging but NOT removing records with age < 18.
# Reason: some may be preregistered youth; others are data errors.
# These records are excluded from turnout modeling (min_age_filter)
# but preserved in the full file for transparency.
# Total affected: 847 records (0.07% of file)

The Analytical Implications

These data quality issues are not cosmetic. Each one, if unaddressed, would corrupt downstream analysis in specific ways:

Age errors distort the age distribution analysis and — more seriously — could cause a turnout model trained on this data to learn spurious relationships between apparent age and voting behavior.

Ethnicity inconsistency would cause a groupby('race_ethnicity') analysis to split a single demographic group into four distinct buckets, dramatically underestimating the size and characteristics of the Hispanic/Latino community in the state — which is precisely the community whose mobilization Garza's campaign most depends on.

Duplicate voter IDs would double-count some individuals in any count-based analysis and inflate the apparent size of the registered voter universe.

Vote history inconsistencies would corrupt the turnout model's training labels — producing a model that learned from incorrect "voted" designations — and would inflate estimated turnout rates.

None of these individually would sink the campaign's analytics program. Together, uncleaned, they would produce a quietly distorted picture of the electorate that no subsequent sophistication of analysis could fix.

The Broader Lesson

The lesson Nadia takes from this episode is not that voter file data is unreliable — it is that all data has quality issues, and the analyst's job is to find them, characterize them, and handle them transparently. The campaigns that trust vendor-supplied data without checking are trusting a product that has been assembled from dozens of inconsistent county databases, processed through multiple automated cleaning steps, and delivered with confidence intervals that are rarely disclosed.

The rule Nadia writes in her analysis log at the end of the week: "Never describe the data before looking at it. Never model the data before cleaning it. Never publish the analysis before documenting the cleaning."

It is inelegant as maxims go. But it is accurate.

Discussion Questions

1. Nadia's protocol flags problematic records rather than deleting them. What are the advantages and risks of this approach compared to deletion? Under what circumstances might deletion be preferable?

2. The ethnicity inconsistency problem (Hispanic vs. Latino vs. H/L) is not technically a "data error" — each value represents a legitimate way to describe a demographic group. What does this tell you about the nature of data quality in political datasets? Who defines what the "correct" value is?

3. Duplicate voter IDs can arise from legitimate cross-county registrations, name changes, and address changes — not just data entry errors. How would you determine which duplicates represent genuine data problems vs. which represent legitimate multiple records for the same person?

4. If Nadia had not run these quality checks and had proceeded with the raw voter file, which of the four issues would have had the largest analytical impact on the campaign's Latino mobilization strategy? Justify your answer.

5. Data quality documentation (Step 3 in Nadia's protocol) takes time that campaigns rarely feel they have. Make an argument for why documentation is analytically as important as the cleaning itself. What would be lost if the cleaning decisions were undocumented?