Case Study 1: The Messy Reality of Hospital Records

Contributors to Introduction to Data Science

Case Study 1: The Messy Reality of Hospital Records

Tier 3 — Illustrative/Composite Example: Elena Martinez is a fictional character, and Memorial Regional Health System is a fictional organization. However, this case study is built from widely documented patterns in healthcare data quality. The types of data problems described here — duplicate patient records, inconsistent coding, missing values correlated with demographics — are well-established challenges in health informatics literature. All names, institutions, and specific figures are invented for pedagogical purposes.

The Setting

Elena Martinez has been working as a data analyst at Memorial Regional Health System for just over a year. The health system operates four hospitals and twenty-three outpatient clinics spread across two counties. Each facility has its own electronic health records (EHR) system — a legacy of mergers and acquisitions over the past decade. The central IT department has been working on unifying these systems, but the project is behind schedule and over budget, which is to say it's a completely normal healthcare IT project.

Elena's boss, the Chief Quality Officer, has given her what sounds like a straightforward assignment: create a report showing hospital readmission rates by diagnosis and demographic group for the past two years. The Centers for Medicare and Medicaid Services (CMS) tracks readmission rates as a quality metric, and hospitals with high rates face financial penalties. Understanding who is being readmitted and why is essential for reducing those rates.

"It should be simple," her boss said. "We have all the data in the warehouse."

Elena has learned that "it should be simple" and "we have all the data" are two of the most dangerous phrases in data science.

The Data

Elena pulls a dataset from the enterprise data warehouse. It has 847,000 rows — one per hospital visit — covering 24 months from January 2022 through December 2023. The columns include:

Column	Description	Expected Type
`patient_id`	Unique patient identifier	String
`visit_date`	Date of hospital admission	Date
`discharge_date`	Date of discharge	Date
`facility`	Which hospital or clinic	Categorical
`primary_diagnosis`	ICD-10 diagnosis code	String
`diagnosis_desc`	Human-readable diagnosis	String
`age`	Patient age at visit	Numeric
`sex`	Patient sex	Categorical
`race_ethnicity`	Patient race/ethnicity	Categorical
`insurance_type`	Insurance category	Categorical
`zip_code`	Patient home ZIP code	String
`length_of_stay`	Days in hospital	Numeric
`readmission_flag`	Was this a readmission within 30 days?	Boolean

She opens her Jupyter notebook, loads the data, and begins her assessment.

import pandas as pd

df = pd.read_csv('hospital_visits.csv')
print(f"Shape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")

The output is not encouraging.

Problem 1: The Duplicate Patients

Elena's first surprise comes when she checks for duplicate patient IDs. She expects each patient to appear multiple times (people can have multiple hospital visits), but she finds something else entirely.

A patient named "Robert Chen" appears under three different IDs: one from the Main Campus hospital, one from the Westside facility, and one from an outpatient clinic. Same date of birth, same address, same insurance — but three IDs because each system assigned its own.

This is the duplicate patient problem, and it's endemic in healthcare. When Elena runs a systematic check, she finds:

847,000 total visit records
312,000 unique patient IDs
Estimated 280,000 unique actual patients (roughly 32,000 patients have duplicate IDs across facilities)

The implications are serious. If Elena counts readmissions using patient_id, she'll miss readmissions where a patient was first admitted at one hospital and readmitted at another — because they have different IDs in the two systems. She'll undercount readmissions.

Elena's approach:

She can't fix the master patient index (that's a multi-year IT project). But she can build an approximate matching algorithm:

# Create a composite key for fuzzy patient matching
df['match_key'] = (
    df['date_of_birth'].astype(str) + '_' +
    df['sex'].str.lower() + '_' +
    df['zip_code'].astype(str).str[:5] + '_' +
    df['last_name'].str.lower().str[:4]
)

# Group by match_key to find likely-same patients
potential_dupes = df.groupby('match_key')['patient_id'].nunique()
multi_id = potential_dupes[potential_dupes > 1]
print(f"Patients with multiple IDs: {len(multi_id)}")

She assigns a unified patient_key to matched records and logs the entire process, including the matching criteria and the number of patients affected.

Cleaning log entry: "Identified approximately 32,000 patients with multiple IDs across facilities using composite matching on date_of_birth + sex + ZIP + last_name_prefix. Created unified patient_key. Matching is imperfect — estimated 2-3% false positive rate based on manual review of 200 random matches."

Problem 2: The Missing Race/Ethnicity Data

The race_ethnicity column stops Elena cold. Overall, 14% of records have missing race/ethnicity data. But the missingness is far from random:

Facility	Missing Race/Ethnicity
Main Campus	8%
Eastside Hospital	11%
Westside Facility	24%
Northpoint Hospital	19%

When Elena digs deeper, she discovers that the Westside and Northpoint facilities serve neighborhoods with larger immigrant populations. Intake staff at those facilities reported that patients sometimes declined to provide race/ethnicity information, and in busy periods, the field was sometimes skipped during registration.

This is a textbook case of data that is not Missing Completely at Random. The missing values are concentrated among facilities that serve specific communities. If Elena simply drops the rows with missing race/ethnicity, she'll disproportionately remove records from hospitals serving immigrant communities — exactly the populations she needs to understand for the equity analysis her boss requested.

Elena's approach:

She does not drop the missing rows. Instead, she takes a three-pronged approach:

Run the analysis both ways — with and without the missing rows — and report whether the conclusions differ.
Create a "Not Reported" category for race/ethnicity, treating it as its own group rather than deleting it.
Use ZIP code as a supplementary indicator — while not a substitute for individual race/ethnicity data, ZIP-level demographic data can provide context about the communities being served by each facility.

Cleaning log entry: "Race/ethnicity missing for 14% of records overall, concentrated at Westside (24%) and Northpoint (19%). Missingness is NOT random — correlates with facility and likely with patient demographics. Decision: retain missing values as 'Not Reported' category. Will run sensitivity analysis with and without this group."

Problem 3: The Inconsistent Diagnosis Codes

The primary_diagnosis column uses ICD-10 codes, which should be standardized. But Elena discovers three types of inconsistency:

Precision differences: Some records use the three-character code (I10 for essential hypertension), others use the full code with decimal (I10.0). These are the same diagnosis at different levels of specificity.
Version conflicts: A small number of records from early 2022 still use ICD-9 codes from the old system — the data migration was incomplete.
Free-text contamination: About 1,200 records have diagnosis_desc values in the primary_diagnosis field instead of codes — someone apparently swapped the columns during a data export.

# Find records where primary_diagnosis doesn't look like an ICD-10 code
# ICD-10 codes follow the pattern: letter + 2 digits + optional decimal + digits
import re
icd10_pattern = r'^[A-Z]\d{2}\.?\d{0,4}$'
non_standard = df[~df['primary_diagnosis'].str.match(icd10_pattern, na=False)]
print(f"Non-standard diagnosis codes: {len(non_standard)}")

Elena's approach:

She standardizes ICD-10 codes to the three-character level for the readmission analysis (grouping I10 and I10.0 together). She flags the ICD-9 records and the swapped-field records separately, maps the ICD-9 codes to their ICD-10 equivalents where possible, and excludes the truly unmappable records — documenting how many and why.

Problem 4: The Impossible Values

During her routine range checks, Elena finds:

47 records with age values of 0 (legitimate — newborns are patients too)
3 records with age values over 120 (maximum documented human age is 122)
189 records with length_of_stay of 0 days (same-day discharges — legitimate for observation stays)
12 records with length_of_stay over 365 days (long-term care patients — unusual but real)
1 record with a negative length_of_stay (discharge date before admission date — definitely an error)

This is where outlier handling requires domain knowledge. Elena consults with a clinical informaticist, who confirms:

Ages 0-120 are plausible (though ages over 110 are extremely rare and worth flagging)
Same-day stays are legitimate for observation and outpatient procedures
Stays over 365 days happen for patients in long-term ventilator units or rehabilitation
Negative length of stay is a data entry error

Elena's approach:

She removes the 1 record with negative length of stay, flags the 3 records with age over 120 for manual review, and keeps everything else. She adds a note in her cleaning log about consulting with the clinical informaticist — domain expertise informing the cleaning decision.

The Result

After two weeks of cleaning, Elena's dataset has gone from 847,000 raw records to 841,200 cleaned records. She can account for every removed row:

Step	Records Affected	Action
Unified duplicate patient IDs	~32,000 patients re-linked	No records removed; IDs merged
Removed swapped-field diagnosis records	1,200	Removed (could not reliably extract code)
Removed ICD-9 records that couldn't be mapped	340	Removed
Removed negative length of stay	1	Removed
Flagged age > 120 for review	3	Kept, flagged
Filled missing race/ethnicity	0	Recoded as "Not Reported"

Her analysis ultimately reveals that readmission rates differ significantly across demographic groups and that certain diagnoses (heart failure, pneumonia, COPD) drive the majority of readmissions — findings that would have been distorted if she'd handled the data quality issues carelessly.

What Elena Learned

Reflecting on the project, Elena identifies three lessons that the textbooks didn't prepare her for:

First, cleaning IS the analysis. Her decision to merge duplicate patient IDs increased the measured readmission rate by 4 percentage points — because cross-facility readmissions that were previously invisible now showed up. A different analyst who didn't merge IDs would have reached a different (and less accurate) conclusion.

Second, missing data tells a story. The pattern of missing race/ethnicity data — concentrated at facilities serving immigrant communities — was itself a finding. It pointed to systemic gaps in data collection that the health system needed to address.

Third, domain knowledge is irreplaceable. Elena couldn't have made correct decisions about newborn ages, same-day stays, or long-term care stays without talking to the clinical informaticist. The data alone didn't tell her whether age 0 was a valid patient or an error. The person who understands the context did.

Connection: Elena's cleaning pipeline exemplifies the chapter's threshold concept: every cleaning decision encodes an assumption. Merging duplicate IDs assumes the matching algorithm is correct. Keeping "Not Reported" as a category assumes missingness is informative. Standardizing ICD-10 codes to three characters assumes that level of detail is sufficient for the analysis. These are analytical choices, not merely technical ones.

In Chapter 9, Elena will reshape this cleaned data — pivoting readmission rates by diagnosis and demographic group for comparison. In Chapter 15, she'll visualize the disparities she's found. But none of that work would be meaningful without the careful, documented cleaning she did here.

Discussion Questions

Elena's patient matching algorithm has an estimated 2-3% false positive rate (incorrectly merging two different patients). How might this affect her readmission analysis? Should she use a more conservative matching criterion?
The decision to create a "Not Reported" category for race/ethnicity rather than imputing it is a methodological choice. What are the trade-offs? Under what circumstances might imputation be more appropriate?
Elena removed 1,200 records with swapped diagnosis fields. That's 0.14% of the data — tiny in percentage terms. But what if those 1,200 records were disproportionately from one facility or one time period? How would you check for this, and what would it mean for the analysis?