Case Study 2: Cleaning Drug Names — When Text Data Can Save Lives

Contributors to Introduction to Data Science

Case Study 2: Cleaning Drug Names — When Text Data Can Save Lives

Tier 3 — Illustrative/Composite Example: This case study uses a fictional hospital and fictional patient data to illustrate real challenges in medication name standardization. The types of errors described — misspellings, abbreviations, inconsistent formats, and look-alike/sound-alike drug name confusions — are well-documented in patient safety literature, including publications by the Institute for Safe Medication Practices (ISMP) and the Joint Commission. All names, institution details, and specific data values are invented for pedagogical purposes.

The Setting

Elena, our public health analyst, has been asked to help with a problem at Meridian General Hospital that goes beyond her usual vaccination work. The hospital's pharmacy department needs to audit six months of electronic health records to identify potential medication errors — cases where the wrong drug may have been administered because of a name confusion.

The problem is real and urgent. According to the Institute for Safe Medication Practices, medication name confusion is one of the most common sources of drug errors. "Metformin" and "metoprolol." "Hydroxyzine" and "hydralazine." "Celebrex" and "Celexa." These names look or sound similar enough that a hurried pharmacist or nurse might select the wrong one — and the consequences can be serious.

Elena's task: clean and standardize six months of medication dispensing records so that every entry maps to an unambiguous, canonical drug name. The pharmacy team can then cross-reference dispensed medications against prescriptions to find mismatches.

The Data

Elena loads the dispensing records:

import pandas as pd

records = pd.read_csv("dispensing_records.csv")
print(records.shape)
print(records.columns.tolist())

(45200, 8)
['record_id', 'patient_id', 'dispense_date', 'medication_text',
 'dose_text', 'route', 'prescriber', 'pharmacy_tech']

The medication_text column is the problem. It was entered by dozens of different pharmacy technicians, each with their own habits:

records["medication_text"].value_counts().head(20)

Metformin 500mg                    892
metformin HCl 500 mg               756
METFORMIN HYDROCHLORIDE 500MG      634
Lisinopril 10mg                    521
lisinopril 10 mg                   498
Atorvastatin 20mg tab              412
ATORVASTATIN CALCIUM 20 MG         389
atorvastatin 20mg tablet           367
Metformin 500 mg tablet            345
Omeprazole 20mg cap                312
omeprazole DR 20 mg                298
Amlodipine 5mg                     276
amlodipine besylate 5 mg tab       254
Metoprolol Succinate 50mg ER       223
metoprolol succinate ER 50 mg      198
metoprolol tartrate 25mg           187
HYDROCHLOROTHIAZIDE 25MG           176
losartan potassium 50 mg           165
Losartan 50mg                      154
Atenolol 50mg                      143
Name: medication_text, dtype: int64

Elena counts unique values:

records["medication_text"].nunique()

Nearly two thousand unique medication entries for what should be a manageable formulary. This isn't just a data quality problem — it's a patient safety problem. If "metformin" appears as 15 different text variations, then a query like "show me all patients on metformin" would miss most of them.

Step 1: Understanding the Structure of Medication Text

Before writing any code, Elena studies the patterns. A typical medication entry has this structure:

[Drug Name] [Salt Form] [Dose Amount][Dose Unit] [Formulation]

For example: - "Metformin HCl 500 mg tablet" = Drug: Metformin, Salt: HCl, Dose: 500mg, Form: tablet - "Atorvastatin calcium 20mg tab" = Drug: Atorvastatin, Salt: calcium, Dose: 20mg, Form: tablet

But not every entry follows this pattern. Some omit the salt form. Some abbreviate the formulation. Some include extra information like "ER" (extended release) or "DR" (delayed release). Some are just plain wrong.

Step 2: The Normalization Pipeline

Elena builds her pipeline one step at a time, checking results after each step:

# Step 1: Basic standardization
records["med_clean"] = (records["medication_text"]
    .str.strip()
    .str.lower()
    .str.replace(r"\s+", " ", regex=True))

print(f"After basic cleaning: {records['med_clean'].nunique()} unique")

After basic cleaning: 1204 unique

# Step 2: Standardize common abbreviations
abbreviations = {
    r"\btab\b": "tablet",
    r"\btabs\b": "tablet",
    r"\bcap\b": "capsule",
    r"\bcaps\b": "capsule",
    r"\binj\b": "injection",
    r"\bsoln\b": "solution",
    r"\bsusp\b": "suspension",
    r"\bhcl\b": "hydrochloride",
    r"\ber\b": "extended-release",
    r"\bdr\b": "delayed-release",
    r"\bsr\b": "sustained-release",
    r"\bxr\b": "extended-release",
    r"\bla\b": "long-acting",
}

for pattern, replacement in abbreviations.items():
    records["med_clean"] = records["med_clean"].str.replace(
        pattern, replacement, regex=True)

print(f"After abbreviation standardization: {records['med_clean'].nunique()} unique")

After abbreviation standardization: 987 unique

# Step 3: Remove salt forms (they're important for chemistry
# but not for identifying the core drug)
salt_forms = [
    r"\bhydrochloride\b", r"\bbesylate\b", r"\bcalcium\b",
    r"\bsodium\b", r"\bpotassium\b", r"\bsuccinate\b",
    r"\btartrate\b", r"\bfumarate\b", r"\bmaleate\b",
    r"\bmesylate\b"
]

salt_pattern = "|".join(salt_forms)
records["drug_core"] = (records["med_clean"]
    .str.replace(salt_pattern, "", regex=True)
    .str.replace(r"\s+", " ", regex=True)
    .str.strip())

print(f"After removing salt forms: {records['drug_core'].nunique()} unique")

After removing salt forms: 723 unique

Step 3: Extracting Components with Regex

Now Elena uses capture groups to extract the drug name and dose separately:

import re

# Pattern: drug name (letters and hyphens), then dose (number + unit)
extracted = records["drug_core"].str.extract(
    r"^(?P<drug_name>[a-z][a-z\s-]+?)\s*"  # drug name
    r"(?P<dose_amount>\d+\.?\d*)\s*"         # dose number
    r"(?P<dose_unit>mg|mcg|ml|g|units?)?"    # dose unit
)

records["drug_name"] = extracted["drug_name"].str.strip()
records["dose_amount"] = pd.to_numeric(extracted["dose_amount"],
                                        errors="coerce")
records["dose_unit"] = extracted["dose_unit"].fillna("mg")

print(records["drug_name"].nunique())

Down from 1,847 unique medication texts to 198 unique drug names. But Elena isn't done — she needs to verify these are all legitimate drug names and catch any misspellings.

Step 4: Catching Dangerous Look-Alikes

This is where the work becomes genuinely important for patient safety. Elena creates a reference list of known look-alike/sound-alike drug pairs:

lookalike_pairs = [
    ("metformin", "metoprolol"),
    ("hydroxyzine", "hydralazine"),
    ("clonidine", "klonopin"),
    ("losartan", "valsartan"),
    ("prednisone", "prednisolone"),
    ("glipizide", "glyburide"),
    ("tramadol", "trazodone"),
    ("bupropion", "buspirone"),
]

# For each pair, find records where one might be confused for the other
for drug_a, drug_b in lookalike_pairs:
    # Check for entries that look like neither drug clearly
    ambiguous = records[
        records["drug_name"].str.contains(drug_a[:5], na=False) |
        records["drug_name"].str.contains(drug_b[:5], na=False)
    ]
    if len(ambiguous) > 0:
        variants = ambiguous["drug_name"].value_counts()
        if len(variants) > 2:  # More than expected variants
            print(f"\nPotential confusion: {drug_a} / {drug_b}")
            print(variants.head())

Potential confusion: metformin / metoprolol
metformin                    2627
metoprolol                    608
metformin extended-release    345
metoprolol extended-release   198
metforman                       3
Name: drug_name, dtype: int64

There it is. Three records with "metforman" — a misspelling. Is it a misspelling of "metformin" or "metoprolol"? They're different drugs with different purposes (metformin is for diabetes; metoprolol is for blood pressure). Elena flags these for pharmacist review rather than auto-correcting them — because the wrong assumption could mask a real medication error.

Step 5: Building the Canonical Mapping

Elena creates a master mapping from every observed variation to a canonical name:

# Build the mapping
canonical_map = {}

# Get all unique drug names
all_drugs = records["drug_name"].dropna().unique()

# Reference list of canonical names (from hospital formulary)
formulary = [
    "metformin", "lisinopril", "atorvastatin", "omeprazole",
    "amlodipine", "metoprolol", "hydrochlorothiazide",
    "losartan", "atenolol", "simvastatin", "levothyroxine",
    "pantoprazole", "gabapentin", "prednisone", "furosemide",
    # ... (full formulary would have hundreds of entries)
]

# For each observed drug name, find the closest formulary match
for drug in all_drugs:
    # Exact match
    base_drug = drug.split()[0]  # first word
    if base_drug in formulary:
        canonical_map[drug] = base_drug
    else:
        canonical_map[drug] = f"REVIEW: {drug}"

records["drug_canonical"] = records["drug_name"].map(canonical_map)

Any entry that doesn't match the formulary exactly gets flagged with "REVIEW:" for manual inspection. This is critical — in healthcare data, automated corrections can be dangerous. It's better to flag uncertainties than to silently make wrong assumptions.

Step 6: The Dose Component

Elena also standardizes the dose text to enable numeric comparisons:

# Extract dose from original text
dose_extracted = records["medication_text"].str.extract(
    r"(\d+\.?\d*)\s*(mg|mcg|ml|g|units?|iu)",
    flags=re.IGNORECASE
)

dose_extracted.columns = ["amount", "unit"]
dose_extracted["amount"] = pd.to_numeric(
    dose_extracted["amount"], errors="coerce")
dose_extracted["unit"] = dose_extracted["unit"].str.lower()

# Flag unusual doses (potential data entry errors)
# Each drug has a known dose range
dose_ranges = {
    "metformin": (250, 2000),
    "lisinopril": (2.5, 40),
    "atorvastatin": (10, 80),
    "amlodipine": (2.5, 10),
    "metoprolol": (25, 200),
}

records["dose_amount"] = dose_extracted["amount"]
records["dose_unit"] = dose_extracted["unit"]

# Check for out-of-range doses
for drug, (low, high) in dose_ranges.items():
    mask = (records["drug_canonical"] == drug) & (
        (records["dose_amount"] < low) |
        (records["dose_amount"] > high))
    flagged = records[mask]
    if len(flagged) > 0:
        print(f"{drug}: {len(flagged)} records with "
              f"unusual doses")
        print(flagged[["medication_text", "dose_amount"]]
              .head(3))
        print()

metformin: 7 records with unusual doses
                    medication_text  dose_amount
1234  metformin 5000mg                   5000.0
2891  Metformin 50 mg                      50.0
3102  metformin 5mg                         5.0

lisinopril: 3 records with unusual doses
                    medication_text  dose_amount
456   lisinopril 100mg                   100.0
1678  Lisinopril 0.5mg                     0.5
4023  lisinopril 50 mg                    50.0

These out-of-range doses are potential errors. "Metformin 5000mg" could be a typo for 500mg. "Lisinopril 100mg" doesn't exist as a standard dose. Each of these gets flagged for pharmacist review.

The Results

Elena delivers a cleaned dataset with:

drug_canonical: A standardized drug name matching the hospital formulary
dose_amount: Numeric dose value
dose_unit: Standardized unit
review_flags: A column listing any concerns (misspelling, unusual dose, look-alike match)

summary = records.groupby("drug_canonical").agg(
    total_records=("record_id", "count"),
    unique_texts=("medication_text", "nunique"),
    flagged=("review_flags", lambda x: x.notna().sum())
).sort_values("total_records", ascending=False).head(10)

print(summary)

                    total_records  unique_texts  flagged
drug_canonical
metformin                   2627            15        10
lisinopril                  1019             9         3
atorvastatin                1168            11         2
amlodipine                   530             7         1
metoprolol                   806             8         4
omeprazole                   610             6         2
...

The pharmacy department now has a clean dataset where "metformin" is always "metformin," regardless of whether the technician typed "Metformin HCl 500mg," "METFORMIN HYDROCHLORIDE 500MG," or "metformin 500 mg tablet." And the flagged records — misspellings, unusual doses, look-alike confusions — are queued for human review.

The Takeaway

This case study is different from most data cleaning stories because the stakes are tangible. A misspelled drug name isn't just a messy data point — it could represent a medication error that affected a patient. That reality shaped Elena's approach in several important ways:

1. Conservative automation, aggressive flagging. Elena automated the straightforward cases (case normalization, abbreviation expansion, salt form removal) but flagged anything ambiguous for human review. In healthcare data, a false positive (flagging something that's fine) is much safer than a false negative (silently approving something wrong).

2. Domain knowledge matters. Elena's regex patterns were informed by pharmaceutical naming conventions — she knew that drug names are followed by salt forms, then doses, then formulations. Without that domain knowledge, the patterns would have been less effective.

3. Look-alike/sound-alike awareness. The most dangerous text data errors aren't the obviously wrong ones — they're the plausibly wrong ones. "Metforman" is obviously a typo. But if someone typed "metoprolol" when they meant "metformin," there's no string method or regex that would catch it. That's why Elena built the look-alike pair check.

4. Validation through dose ranges. Text cleaning isn't just about the text. By extracting numeric doses and comparing them to known ranges, Elena caught data entry errors that would have been invisible to purely text-based cleaning.

5. Every cleaning decision is documented. Elena's code includes comments explaining why each pattern was chosen, which abbreviations are expanded, and why certain entries are flagged rather than auto-corrected. When the pharmacy team asks "why was this record flagged?", the code provides the answer.

The tools are the same ones you learned in Chapter 10 — str.lower(), str.replace(), str.extract(), regex patterns with capture groups. But the stakes remind us that data cleaning isn't just a technical exercise. It's a responsibility. The data we clean becomes the basis for decisions that affect real people, and we owe it to them to clean it carefully, transparently, and honestly.