Case Study 1: Parsing Messy Survey Responses — Extracting Meaning from Free Text
Tier 3 — Illustrative/Composite Example: This case study uses a fictional but realistic dataset modeled on the kinds of free-text survey responses that public health departments routinely collect. The survey, respondent names, and all data are invented for pedagogical purposes. The text cleaning challenges described — inconsistent capitalization, abbreviations, misspellings, and mixed-format answers — are representative of real free-text data challenges documented in public health informatics literature.
The Setting
Priya, our sports journalist from Chapter 1, has taken on a side project. Her city's public health department conducted a survey asking residents about their experiences with COVID-19 vaccination, and the results have landed on her desk because — in her editor's words — "you're the one who knows Python."
The survey had a mix of structured questions (multiple choice, yes/no) and free-text questions. The structured questions were easy: the data arrived in clean columns with predictable values. But three questions were open-ended:
- "Which vaccine did you receive?" (free text)
- "Where did you receive your vaccine?" (free text)
- "What concerns, if any, do you have about vaccination?" (free text)
The survey collected 2,400 responses. The free-text fields are a mess. Priya's job is to extract structured, analyzable data from the chaos.
The Data
Priya loads the data and takes her first look:
import pandas as pd
df = pd.read_csv("vaccination_survey.csv")
print(df.shape)
print(df.columns.tolist())
(2400, 12)
['respondent_id', 'age', 'zip_code', 'vaccine_received',
'vaccine_date', 'location_text', 'concerns_text',
'doses_received', 'would_recommend', 'gender',
'income_bracket', 'survey_date']
She examines the vaccine_received column:
df["vaccine_received"].value_counts().head(15)
Pfizer 312
pfizer 287
Moderna 245
moderna 198
PFIZER 142
Pfizer-BioNTech 118
Johnson & Johnson 95
J&J 87
moderna vaccine 76
Pfizer BioNTech 64
pfizer biontech covid vaccine 52
don't remember 48
Johnson and Johnson 45
janssen 38
Moderna COVID-19 35
Name: vaccine_received, dtype: int64
Fifteen rows shown, and the top three — "Pfizer," "pfizer," and "PFIZER" — are clearly the same vaccine. Priya counts unique values:
df["vaccine_received"].nunique()
147
One hundred and forty-seven unique values for what should be, at most, five or six vaccine manufacturers. This is the quintessential text data challenge.
Step 1: Standardize First, Classify Second
Priya knows from Chapter 10 that the first move is always general standardization — strip whitespace, normalize case, collapse spaces:
df["vaccine_clean"] = (df["vaccine_received"]
.str.strip()
.str.lower()
.str.replace(r"\s+", " ", regex=True))
print(f"Unique values after basic cleaning: {df['vaccine_clean'].nunique()}")
Unique values after basic cleaning: 89
From 147 to 89 — a big improvement, but still far from the target of five or six. The remaining variations are substantive: abbreviations ("j&j"), brand names with and without the parent company ("pfizer" vs. "pfizer-biontech"), and extra descriptors ("moderna vaccine," "moderna covid-19 vaccine").
Step 2: Pattern-Based Classification
Priya decides to classify each response into a standard vaccine name using keyword matching. She writes patterns for each manufacturer:
import re
def classify_vaccine(text):
if pd.isna(text):
return "unknown"
if re.search(r"pfizer|biontech|bnt162|comirnaty", text):
return "Pfizer-BioNTech"
elif re.search(r"moderna|mrna-1273|spikevax", text):
return "Moderna"
elif re.search(r"janssen|johnson|j&j|j \& j", text):
return "Johnson & Johnson"
elif re.search(r"astra\s*zeneca|oxford|azd1222|vaxzevria", text):
return "AstraZeneca"
elif re.search(r"sinovac|coronavac", text):
return "Sinovac"
elif re.search(r"don'?t\s*(know|remember)|unsure|forgot|not sure|can'?t recall", text):
return "Unknown (respondent unsure)"
else:
return "Other/Unclassified"
df["vaccine_standard"] = df["vaccine_clean"].apply(classify_vaccine)
print(df["vaccine_standard"].value_counts())
Pfizer-BioNTech 1021
Moderna 682
Johnson & Johnson 312
Unknown (respondent unsure) 158
AstraZeneca 89
Other/Unclassified 78
Sinovac 35
unknown 25
Name: vaccine_standard, dtype: int64
From 147 unique values down to 8 categories. Priya examines the "Other/Unclassified" entries to see if she missed any patterns:
other = df[df["vaccine_standard"] == "Other/Unclassified"]
other["vaccine_clean"].value_counts().head(10)
flu shot 12
covaxin 9
novavax 8
sputnik v 7
sinopharm 6
not vaccinated 5
the one at the pharmacy 4
idk the blue one 3
3
mRNA one 2
Name: vaccine_clean, dtype: int64
Some of these are legitimate vaccine names she hadn't included (Covaxin, Novavax, Sputnik V, Sinopharm). Others are genuinely unhelpful responses ("the one at the pharmacy," "idk the blue one"). Priya adds the missing vaccines to her function and moves the rest to an "Other" category. This is the iterative nature of text cleaning — you write a first pass, inspect the residual, and refine.
Step 3: Tackling the Location Column
The location_text column is equally messy:
df["location_text"].value_counts().head(10)
CVS 234
Walgreens 189
cvs pharmacy 156
walgreens 134
Hospital 112
Local pharmacy 98
Doctor's office 87
CVS Pharmacy 76
Walmart 72
my doctor 65
Name: location_text, dtype: int64
Priya applies the same strategy — standardize, then classify:
df["location_clean"] = (df["location_text"]
.str.strip()
.str.lower()
.str.replace(r"\s+", " ", regex=True))
def classify_location(text):
if pd.isna(text):
return "unknown"
if re.search(r"cvs", text):
return "CVS Pharmacy"
elif re.search(r"walgreen", text):
return "Walgreens"
elif re.search(r"walmart|sam'?s club", text):
return "Walmart"
elif re.search(r"hospital|medical center|med center", text):
return "Hospital"
elif re.search(r"doctor|physician|clinic|primary care", text):
return "Doctor/Clinic"
elif re.search(r"pharmacy|drug\s?store|rite.?aid|kroger", text):
return "Other Pharmacy"
elif re.search(r"work|employer|office", text):
return "Workplace"
elif re.search(r"mass\s?vax|convention|stadium|fair", text):
return "Mass Vaccination Site"
else:
return "Other"
df["location_category"] = df["location_clean"].apply(classify_location)
Step 4: Mining the Concerns Column
The concerns column is the most challenging because responses are longer and more varied:
df["concerns_text"].dropna().head(5).tolist()
["I'm worried about long-term side effects that we don't know about yet",
"None, I think vaccines are great",
"had a bad reaction to my first dose - fever and chills for 2 days",
"concerned about fertility issues, heard rumors on social media",
"no concerns"]
Rather than classifying each response into a single category, Priya creates binary flags for common themes:
concerns_clean = df["concerns_text"].str.strip().str.lower()
df["concern_side_effects"] = concerns_clean.str.contains(
r"side effect|reaction|fever|pain|chills|tired|sore",
na=False)
df["concern_long_term"] = concerns_clean.str.contains(
r"long.?term|future|years|don'?t know (what|the)",
na=False)
df["concern_fertility"] = concerns_clean.str.contains(
r"fertil|pregnan|baby|child|reproductive",
na=False)
df["concern_trust"] = concerns_clean.str.contains(
r"trust|government|rushed|conspiracy|big pharma",
na=False)
df["concern_none"] = concerns_clean.str.contains(
r"^(none|no concerns|n/?a|nothing|nope|all good)",
na=False)
Now she can quantify concerns:
concern_cols = [c for c in df.columns if c.startswith("concern_")]
df[concern_cols].sum().sort_values(ascending=False)
concern_none 892
concern_side_effects 534
concern_long_term 412
concern_trust 287
concern_fertility 156
Name: count, dtype: int64
These numbers add up to more than 2,400 because some responses mention multiple concerns. That's correct and expected — the binary flag approach allows for multi-label categorization.
The Takeaway
Priya started with three columns of free text — 2,400 responses of unstructured, inconsistent human language. Through a systematic process of standardization, pattern matching, and iterative refinement, she extracted structured data that can be analyzed, grouped, charted, and reported.
The key insights:
-
Always standardize first. Stripping whitespace and normalizing case eliminates the majority of "false" unique values.
-
Use regex for classification, not perfection. The goal isn't to parse every possible variation of "Pfizer-BioNTech COVID-19 Vaccine." It's to correctly classify each response into an actionable category. A few "Other" entries are acceptable.
-
Inspect your residuals. After classification, always look at what fell through to "Other/Unclassified." You'll usually find patterns you missed that are easy to add.
-
Binary flags work for multi-topic text. When responses can contain multiple themes, create separate boolean columns instead of forcing each response into a single category.
-
Document your patterns. Priya added comments explaining why each regex pattern was chosen. When a colleague asks "why is 'the blue one' classified as Other?", the documentation answers the question.
Free-text data is messy, but it's also rich. The techniques Priya used here — the same techniques you learned in Chapter 10 — are the bridge between messy human language and clean, analyzable data.