Case Study 2: Cleaning Drug Names — When Text Data Can Save Lives
Tier 3 — Illustrative/Composite Example: This case study uses a fictional hospital and fictional patient data to illustrate real challenges in medication name standardization. The types of errors described — misspellings, abbreviations, inconsistent formats, and look-alike/sound-alike drug name confusions — are well-documented in patient safety literature, including publications by the Institute for Safe Medication Practices (ISMP) and the Joint Commission. All names, institution details, and specific data values are invented for pedagogical purposes.
The Setting
Elena, our public health analyst, has been asked to help with a problem at Meridian General Hospital that goes beyond her usual vaccination work. The hospital's pharmacy department needs to audit six months of electronic health records to identify potential medication errors — cases where the wrong drug may have been administered because of a name confusion.
The problem is real and urgent. According to the Institute for Safe Medication Practices, medication name confusion is one of the most common sources of drug errors. "Metformin" and "metoprolol." "Hydroxyzine" and "hydralazine." "Celebrex" and "Celexa." These names look or sound similar enough that a hurried pharmacist or nurse might select the wrong one — and the consequences can be serious.
Elena's task: clean and standardize six months of medication dispensing records so that every entry maps to an unambiguous, canonical drug name. The pharmacy team can then cross-reference dispensed medications against prescriptions to find mismatches.
The Data
Elena loads the dispensing records:
import pandas as pd
records = pd.read_csv("dispensing_records.csv")
print(records.shape)
print(records.columns.tolist())
(45200, 8)
['record_id', 'patient_id', 'dispense_date', 'medication_text',
'dose_text', 'route', 'prescriber', 'pharmacy_tech']
The medication_text column is the problem. It was entered by dozens of different pharmacy technicians, each with their own habits:
records["medication_text"].value_counts().head(20)
Metformin 500mg 892
metformin HCl 500 mg 756
METFORMIN HYDROCHLORIDE 500MG 634
Lisinopril 10mg 521
lisinopril 10 mg 498
Atorvastatin 20mg tab 412
ATORVASTATIN CALCIUM 20 MG 389
atorvastatin 20mg tablet 367
Metformin 500 mg tablet 345
Omeprazole 20mg cap 312
omeprazole DR 20 mg 298
Amlodipine 5mg 276
amlodipine besylate 5 mg tab 254
Metoprolol Succinate 50mg ER 223
metoprolol succinate ER 50 mg 198
metoprolol tartrate 25mg 187
HYDROCHLOROTHIAZIDE 25MG 176
losartan potassium 50 mg 165
Losartan 50mg 154
Atenolol 50mg 143
Name: medication_text, dtype: int64
Elena counts unique values:
records["medication_text"].nunique()
1847
Nearly two thousand unique medication entries for what should be a manageable formulary. This isn't just a data quality problem — it's a patient safety problem. If "metformin" appears as 15 different text variations, then a query like "show me all patients on metformin" would miss most of them.
Step 1: Understanding the Structure of Medication Text
Before writing any code, Elena studies the patterns. A typical medication entry has this structure:
[Drug Name] [Salt Form] [Dose Amount][Dose Unit] [Formulation]
For example: - "Metformin HCl 500 mg tablet" = Drug: Metformin, Salt: HCl, Dose: 500mg, Form: tablet - "Atorvastatin calcium 20mg tab" = Drug: Atorvastatin, Salt: calcium, Dose: 20mg, Form: tablet
But not every entry follows this pattern. Some omit the salt form. Some abbreviate the formulation. Some include extra information like "ER" (extended release) or "DR" (delayed release). Some are just plain wrong.
Step 2: The Normalization Pipeline
Elena builds her pipeline one step at a time, checking results after each step:
# Step 1: Basic standardization
records["med_clean"] = (records["medication_text"]
.str.strip()
.str.lower()
.str.replace(r"\s+", " ", regex=True))
print(f"After basic cleaning: {records['med_clean'].nunique()} unique")
After basic cleaning: 1204 unique
# Step 2: Standardize common abbreviations
abbreviations = {
r"\btab\b": "tablet",
r"\btabs\b": "tablet",
r"\bcap\b": "capsule",
r"\bcaps\b": "capsule",
r"\binj\b": "injection",
r"\bsoln\b": "solution",
r"\bsusp\b": "suspension",
r"\bhcl\b": "hydrochloride",
r"\ber\b": "extended-release",
r"\bdr\b": "delayed-release",
r"\bsr\b": "sustained-release",
r"\bxr\b": "extended-release",
r"\bla\b": "long-acting",
}
for pattern, replacement in abbreviations.items():
records["med_clean"] = records["med_clean"].str.replace(
pattern, replacement, regex=True)
print(f"After abbreviation standardization: {records['med_clean'].nunique()} unique")
After abbreviation standardization: 987 unique
# Step 3: Remove salt forms (they're important for chemistry
# but not for identifying the core drug)
salt_forms = [
r"\bhydrochloride\b", r"\bbesylate\b", r"\bcalcium\b",
r"\bsodium\b", r"\bpotassium\b", r"\bsuccinate\b",
r"\btartrate\b", r"\bfumarate\b", r"\bmaleate\b",
r"\bmesylate\b"
]
salt_pattern = "|".join(salt_forms)
records["drug_core"] = (records["med_clean"]
.str.replace(salt_pattern, "", regex=True)
.str.replace(r"\s+", " ", regex=True)
.str.strip())
print(f"After removing salt forms: {records['drug_core'].nunique()} unique")
After removing salt forms: 723 unique
Step 3: Extracting Components with Regex
Now Elena uses capture groups to extract the drug name and dose separately:
import re
# Pattern: drug name (letters and hyphens), then dose (number + unit)
extracted = records["drug_core"].str.extract(
r"^(?P<drug_name>[a-z][a-z\s-]+?)\s*" # drug name
r"(?P<dose_amount>\d+\.?\d*)\s*" # dose number
r"(?P<dose_unit>mg|mcg|ml|g|units?)?" # dose unit
)
records["drug_name"] = extracted["drug_name"].str.strip()
records["dose_amount"] = pd.to_numeric(extracted["dose_amount"],
errors="coerce")
records["dose_unit"] = extracted["dose_unit"].fillna("mg")
print(records["drug_name"].nunique())
198
Down from 1,847 unique medication texts to 198 unique drug names. But Elena isn't done — she needs to verify these are all legitimate drug names and catch any misspellings.
Step 4: Catching Dangerous Look-Alikes
This is where the work becomes genuinely important for patient safety. Elena creates a reference list of known look-alike/sound-alike drug pairs:
lookalike_pairs = [
("metformin", "metoprolol"),
("hydroxyzine", "hydralazine"),
("clonidine", "klonopin"),
("losartan", "valsartan"),
("prednisone", "prednisolone"),
("glipizide", "glyburide"),
("tramadol", "trazodone"),
("bupropion", "buspirone"),
]
# For each pair, find records where one might be confused for the other
for drug_a, drug_b in lookalike_pairs:
# Check for entries that look like neither drug clearly
ambiguous = records[
records["drug_name"].str.contains(drug_a[:5], na=False) |
records["drug_name"].str.contains(drug_b[:5], na=False)
]
if len(ambiguous) > 0:
variants = ambiguous["drug_name"].value_counts()
if len(variants) > 2: # More than expected variants
print(f"\nPotential confusion: {drug_a} / {drug_b}")
print(variants.head())
Potential confusion: metformin / metoprolol
metformin 2627
metoprolol 608
metformin extended-release 345
metoprolol extended-release 198
metforman 3
Name: drug_name, dtype: int64
There it is. Three records with "metforman" — a misspelling. Is it a misspelling of "metformin" or "metoprolol"? They're different drugs with different purposes (metformin is for diabetes; metoprolol is for blood pressure). Elena flags these for pharmacist review rather than auto-correcting them — because the wrong assumption could mask a real medication error.
Step 5: Building the Canonical Mapping
Elena creates a master mapping from every observed variation to a canonical name:
# Build the mapping
canonical_map = {}
# Get all unique drug names
all_drugs = records["drug_name"].dropna().unique()
# Reference list of canonical names (from hospital formulary)
formulary = [
"metformin", "lisinopril", "atorvastatin", "omeprazole",
"amlodipine", "metoprolol", "hydrochlorothiazide",
"losartan", "atenolol", "simvastatin", "levothyroxine",
"pantoprazole", "gabapentin", "prednisone", "furosemide",
# ... (full formulary would have hundreds of entries)
]
# For each observed drug name, find the closest formulary match
for drug in all_drugs:
# Exact match
base_drug = drug.split()[0] # first word
if base_drug in formulary:
canonical_map[drug] = base_drug
else:
canonical_map[drug] = f"REVIEW: {drug}"
records["drug_canonical"] = records["drug_name"].map(canonical_map)
Any entry that doesn't match the formulary exactly gets flagged with "REVIEW:" for manual inspection. This is critical — in healthcare data, automated corrections can be dangerous. It's better to flag uncertainties than to silently make wrong assumptions.
Step 6: The Dose Component
Elena also standardizes the dose text to enable numeric comparisons:
# Extract dose from original text
dose_extracted = records["medication_text"].str.extract(
r"(\d+\.?\d*)\s*(mg|mcg|ml|g|units?|iu)",
flags=re.IGNORECASE
)
dose_extracted.columns = ["amount", "unit"]
dose_extracted["amount"] = pd.to_numeric(
dose_extracted["amount"], errors="coerce")
dose_extracted["unit"] = dose_extracted["unit"].str.lower()
# Flag unusual doses (potential data entry errors)
# Each drug has a known dose range
dose_ranges = {
"metformin": (250, 2000),
"lisinopril": (2.5, 40),
"atorvastatin": (10, 80),
"amlodipine": (2.5, 10),
"metoprolol": (25, 200),
}
records["dose_amount"] = dose_extracted["amount"]
records["dose_unit"] = dose_extracted["unit"]
# Check for out-of-range doses
for drug, (low, high) in dose_ranges.items():
mask = (records["drug_canonical"] == drug) & (
(records["dose_amount"] < low) |
(records["dose_amount"] > high))
flagged = records[mask]
if len(flagged) > 0:
print(f"{drug}: {len(flagged)} records with "
f"unusual doses")
print(flagged[["medication_text", "dose_amount"]]
.head(3))
print()
metformin: 7 records with unusual doses
medication_text dose_amount
1234 metformin 5000mg 5000.0
2891 Metformin 50 mg 50.0
3102 metformin 5mg 5.0
lisinopril: 3 records with unusual doses
medication_text dose_amount
456 lisinopril 100mg 100.0
1678 Lisinopril 0.5mg 0.5
4023 lisinopril 50 mg 50.0
These out-of-range doses are potential errors. "Metformin 5000mg" could be a typo for 500mg. "Lisinopril 100mg" doesn't exist as a standard dose. Each of these gets flagged for pharmacist review.
The Results
Elena delivers a cleaned dataset with:
- drug_canonical: A standardized drug name matching the hospital formulary
- dose_amount: Numeric dose value
- dose_unit: Standardized unit
- review_flags: A column listing any concerns (misspelling, unusual dose, look-alike match)
summary = records.groupby("drug_canonical").agg(
total_records=("record_id", "count"),
unique_texts=("medication_text", "nunique"),
flagged=("review_flags", lambda x: x.notna().sum())
).sort_values("total_records", ascending=False).head(10)
print(summary)
total_records unique_texts flagged
drug_canonical
metformin 2627 15 10
lisinopril 1019 9 3
atorvastatin 1168 11 2
amlodipine 530 7 1
metoprolol 806 8 4
omeprazole 610 6 2
...
The pharmacy department now has a clean dataset where "metformin" is always "metformin," regardless of whether the technician typed "Metformin HCl 500mg," "METFORMIN HYDROCHLORIDE 500MG," or "metformin 500 mg tablet." And the flagged records — misspellings, unusual doses, look-alike confusions — are queued for human review.
The Takeaway
This case study is different from most data cleaning stories because the stakes are tangible. A misspelled drug name isn't just a messy data point — it could represent a medication error that affected a patient. That reality shaped Elena's approach in several important ways:
1. Conservative automation, aggressive flagging. Elena automated the straightforward cases (case normalization, abbreviation expansion, salt form removal) but flagged anything ambiguous for human review. In healthcare data, a false positive (flagging something that's fine) is much safer than a false negative (silently approving something wrong).
2. Domain knowledge matters. Elena's regex patterns were informed by pharmaceutical naming conventions — she knew that drug names are followed by salt forms, then doses, then formulations. Without that domain knowledge, the patterns would have been less effective.
3. Look-alike/sound-alike awareness. The most dangerous text data errors aren't the obviously wrong ones — they're the plausibly wrong ones. "Metforman" is obviously a typo. But if someone typed "metoprolol" when they meant "metformin," there's no string method or regex that would catch it. That's why Elena built the look-alike pair check.
4. Validation through dose ranges. Text cleaning isn't just about the text. By extracting numeric doses and comparing them to known ranges, Elena caught data entry errors that would have been invisible to purely text-based cleaning.
5. Every cleaning decision is documented. Elena's code includes comments explaining why each pattern was chosen, which abbreviations are expanded, and why certain entries are flagged rather than auto-corrected. When the pharmacy team asks "why was this record flagged?", the code provides the answer.
The tools are the same ones you learned in Chapter 10 — str.lower(), str.replace(), str.extract(), regex patterns with capture groups. But the stakes remind us that data cleaning isn't just a technical exercise. It's a responsibility. The data we clean becomes the basis for decisions that affect real people, and we owe it to them to clean it carefully, transparently, and honestly.