Case Study 2: When

Contributors to Introduction to Data Science

Case Study 2: When "Clean" Data Lies — The Danger of Over-Cleaning

Tier 2 — Attributed Findings: This case study is inspired by real, documented examples of how data cleaning and preprocessing decisions have led to biased outcomes in healthcare, criminal justice, and public health. The Obermeyer et al. (2019) study on racial bias in healthcare algorithms was published in Science and is cited directly. The specific scenario involving "Lakewood County" and its characters is fictional and composite, but the types of bias described are well-documented in the data science and algorithmic fairness literature.

The Story

Lakewood County's Department of Public Health was proud of its data. Over the past five years, the department had invested heavily in modernizing its data infrastructure — replacing paper forms with digital intake systems, standardizing reporting protocols, and hiring a small analytics team. When the department received a federal grant to study COVID-19 vaccination equity across the county, it felt ready.

The lead analyst, Dr. Amara Obi, had a clear mandate: identify which communities had the lowest vaccination rates, understand why, and recommend targeted interventions. She had access to vaccination records from all county clinics, demographic data from the Census Bureau, and social determinants of health data from the county's community health assessment.

The data was messy — of course it was. But Dr. Obi was thorough. She built a careful cleaning pipeline, documented every step, and produced a dataset she was confident in. Her final report identified three neighborhoods as priority areas for intervention, and the county allocated $2 million in outreach funding based on her recommendations.

Six months later, a community health worker named David Herrera raised an alarm. The neighborhoods where his organization worked — predominantly Hispanic communities with large numbers of recent immigrants — hadn't been identified as priority areas. But from David's ground-level view, vaccination rates in these communities were far lower than the report suggested.

Dr. Obi went back to her data. What she found haunted her.

The Cleaning Decisions

Let's walk through Dr. Obi's cleaning pipeline and identify where things went wrong. The errors weren't careless — they were the kinds of reasonable-sounding decisions that any analyst might make. That's what makes them dangerous.

Decision 1: Dropping Records with Missing Addresses

The vaccination records included patient addresses, which Dr. Obi needed to assign each record to a neighborhood. About 8% of records had missing or incomplete address fields. Dr. Obi investigated and found two main causes:

Some clinic intake forms had a non-mandatory address field, and rushed staff sometimes skipped it.
Some patients provided a P.O. box or a shelter address that couldn't be mapped to a neighborhood.

Dr. Obi's decision: drop the 8% of records with missing addresses. Her rationale was clear and documented: "Cannot assign to neighborhood without valid address. Removing 8% of records. Impact: minor."

The problem she didn't see: The patients with missing addresses were not a random sample. They disproportionately included: - Unhoused individuals, who used shelter addresses or left the field blank - Undocumented immigrants, who were sometimes reluctant to provide a home address - Residents of informal housing (shared rooms, temporary arrangements) whose addresses didn't match standard formats

By dropping these records, Dr. Obi didn't just lose 8% of her data. She lost 8% of her data in a way that systematically under-counted vaccinations among the most vulnerable populations — and simultaneously removed evidence that these populations existed in the dataset at all.

Decision 2: Standardizing Race/Ethnicity Categories

The raw data had 47 distinct values in the race/ethnicity field, including free-text entries. Dr. Obi standardized them into six categories aligned with Census Bureau reporting: White, Black or African American, Hispanic or Latino, Asian, Other, and Unknown/Not Reported.

Reasonable. Standard. And deeply problematic.

The problem: The "Other" category absorbed 11% of records. Within that 11%: - 34% were patients who had written in specific ethnic identities (Hmong, Somali, Marshallese, Maya) that didn't map cleanly to the Census categories - 22% were multiracial individuals - 44% were records where staff had entered a single word or abbreviation that didn't clearly fit the standard categories

The "Unknown/Not Reported" category absorbed another 9%.

When Dr. Obi analyzed vaccination rates by race/ethnicity, the "Other" and "Unknown" categories showed vaccination rates close to the county average — unremarkable. But hidden within "Other" were communities with dramatically different vaccination rates. The Marshallese community, for example, had a vaccination rate of 31% — one of the lowest in the county. The Somali community was at 38%. But lumped into "Other" alongside groups with higher rates, their crisis was invisible.

Decision 3: Removing "Implausible" Vaccination Dates

Some vaccination records had dates that seemed wrong — dates in the future (data entry errors), dates before the vaccines were available (probably records from the wrong table), and dates from December 2020 that were before the official public rollout in the county but after the earliest healthcare worker vaccinations began.

Dr. Obi set a validation rule: vaccination dates must fall between January 1, 2021 (the start of the county's public vaccination program) and the current date. Anything outside that window was flagged as an error and removed.

The problem: The county had conducted a small number of vaccinations in late December 2020 for frontline healthcare workers — part of the initial Phase 1a allocation. By setting January 1, 2021 as the lower bound, Dr. Obi deleted these legitimate records. More importantly, she also deleted records from several pop-up clinics held in December 2020 that specifically targeted high-risk essential workers in the meatpacking and agricultural industries — workers who were predominantly Hispanic and immigrant.

It was only 340 records. But those 340 records represented some of the earliest and most proactive vaccination efforts in the county's most vulnerable communities.

Decision 4: Deduplicating on Name + Date of Birth

Dr. Obi needed to count unique individuals vaccinated, not just doses administered. She used a composite key of first name, last name, and date of birth to identify duplicate records.

The problem: In several of the county's immigrant communities, naming conventions differed from the Western first-name/last-name format. Some Vietnamese patients had family names that were extremely common (Nguyen, Tran, Le), leading to a higher rate of false matches. Some Hispanic patients were recorded inconsistently — sometimes with both maternal and paternal surnames, sometimes with just one. Some records from community health clinics used phonetic spellings of names from languages without Latin scripts.

The result: Dr. Obi's deduplication algorithm had a higher false-positive rate (incorrectly merging distinct patients) in non-white communities, which meant it under-counted unique individuals vaccinated in those communities. The error was small in percentage terms — perhaps 1-2% — but it consistently went in the same direction: undercounting minority patients.

Decision 5: Removing Clinics with "Incomplete" Reporting

Three community health clinics had been flagged by IT as having "incomplete data" — their records were sometimes submitted days late and occasionally had formatting inconsistencies. Dr. Obi excluded these clinics from the analysis to ensure "data quality."

The problem: These three clinics were community health centers that specifically served uninsured and underinsured populations. They were chronically understaffed, which explained both the reporting delays and the formatting issues. Their data wasn't lower quality in terms of accuracy — it was lower quality in terms of timeliness and formatting. The actual vaccination records were valid.

By excluding them, Dr. Obi removed the primary data source for three of the county's lowest-income neighborhoods. Her analysis literally could not see these communities.

The Cumulative Effect

No single decision was catastrophic. Each one removed a small slice of data — 8% here, a few hundred records there, three clinics with formatting issues. Each decision had a documented, defensible rationale. Each was the kind of cleaning operation that a textbook (including this one) might recommend.

But the effects weren't random. They were correlated. Every decision disproportionately affected the same populations: low-income residents, immigrants, people of color, unhoused individuals. The cumulative result was a dataset that systematically under-represented the county's most vulnerable communities.

When Dr. Obi's report said "these three neighborhoods have the lowest vaccination rates," it was wrong — not because the math was wrong, but because the data had been cleaned in a way that made certain neighborhoods invisible.

The $2 million in outreach funding went to neighborhoods that needed it. But the neighborhoods that needed it most were left out.

What Should Have Happened

David Herrera's alarm led to a review of the analysis. Dr. Obi — to her credit — conducted the review herself, identified each problematic decision, and published a corrected report. The corrected analysis added two additional priority neighborhoods and redistributed the funding.

But the better question is: how could these errors have been prevented in the first place?

Practice 1: The Demographic Audit

Before dropping any data, check whether the rows you're about to remove have a demographic profile that differs from the rows you're keeping.

# Before dropping rows with missing addresses
dropped = df[df['address'].isnull()]
kept = df[df['address'].notnull()]

# Compare demographics
print("Kept vs. Dropped — Race/Ethnicity:")
print(kept['race_ethnicity'].value_counts(normalize=True).round(3))
print()
print(dropped['race_ethnicity'].value_counts(normalize=True).round(3))

If the demographic profiles differ significantly, your cleaning operation will introduce bias.

Practice 2: Granular Categories Before Aggregation

Don't aggregate categories before you understand what's in them. Dr. Obi should have analyzed the distribution of free-text race/ethnicity entries before collapsing them into six categories. If she'd seen that "Marshallese" appeared 2,400 times, she might have preserved it as its own category — or at least checked the vaccination rate for that group before merging it into "Other."

Practice 3: Sensitivity Analysis

Run the analysis under multiple cleaning scenarios and see whether the conclusions change.

# Scenario A: Drop missing addresses
result_a = analyze(df.dropna(subset=['address']))

# Scenario B: Keep missing addresses, assign to "Unknown Neighborhood"
df_b = df.copy()
df_b['neighborhood'] = df_b['neighborhood'].fillna('Unknown')
result_b = analyze(df_b)

# Compare: do the priority neighborhoods change?

If your conclusions are fragile — if they change depending on how you clean the data — that's crucial information that should appear in your report.

Practice 4: Community Validation

Data analysis about communities should be validated with communities. David Herrera knew something was wrong because he worked in those neighborhoods every day. His lived experience was data too — qualitative data that quantitative analysis alone couldn't capture.

Building relationships with community organizations isn't just good ethics. It's good data science. They can tell you when your numbers don't match reality.

Practice 5: Documenting Who Is Missing

Every cleaning log should include a section titled "Who is missing?" that explicitly considers whether the removed, imputed, or recategorized data disproportionately affects specific groups.

This isn't a standard section in most data cleaning workflows. It should be.

The Broader Lesson

The Obermeyer et al. (2019) study published in Science examined a commercial healthcare algorithm used by major U.S. health systems to identify patients who would benefit from "high-risk care management" programs. The researchers found that the algorithm was significantly less likely to refer Black patients, even when they were equally or more sick than white patients. The root cause was not malice or even carelessness — it was a data processing decision. The algorithm used healthcare spending as a proxy for healthcare needs. Because Black patients historically had less access to healthcare (and thus lower spending), the algorithm equated lower spending with lower need. The data was "clean" — there were no missing values or duplicate records. But the choice of what the data represented encoded a bias that affected millions of patients.

Dr. Obi's story is smaller in scale but identical in structure. Clean data is not the same as accurate data. A dataset can be perfectly formatted, fully complete, and free of duplicates — and still lie, because the cleaning decisions that got it there systematically erased certain voices.

The threshold concept of this chapter — cleaning IS analysis — has a corollary: cleaning can also be bias. Every dropna(), every replace(), every drop_duplicates() can remove exactly the data you most need to see.

This is not an argument against data cleaning. Messy data must be cleaned. But it is an argument for cleaning carefully, cleaning thoughtfully, and always asking the question that too many analysts skip:

Whose data am I about to erase, and what will that mean for my conclusions?

Discussion Questions

Dr. Obi documented every cleaning decision and provided a rationale for each. Yet the cumulative effect was still biased. What additional documentation practices might have caught the problem earlier?
The case describes a tension between data quality (removing records from clinics with formatting issues) and data equity (those clinics served vulnerable populations). How would you navigate this tension? Is there a way to have both?
The "Other" category in race/ethnicity absorbed communities with very different vaccination rates. At what point does aggregation become erasure? How do you decide when a category is "too small" to report separately — and what are the ethical implications of that decision?
David Herrera's community knowledge revealed a problem that the data analysis missed. How would you design a data science workflow that incorporates community feedback as a standard step, not just an after-the-fact correction?
Reflect on the Obermeyer et al. (2019) study. The algorithm used healthcare spending as a proxy for healthcare need — a data choice, not a cleaning choice. Where is the line between "cleaning" and "analytical design"? Does the distinction matter?