Case Study 1: Parsing Messy Survey Responses — Extracting Meaning from Free Text

Contributors to Introduction to Data Science

Case Study 1: Parsing Messy Survey Responses — Extracting Meaning from Free Text

Tier 3 — Illustrative/Composite Example: This case study uses a fictional but realistic dataset modeled on the kinds of free-text survey responses that public health departments routinely collect. The survey, respondent names, and all data are invented for pedagogical purposes. The text cleaning challenges described — inconsistent capitalization, abbreviations, misspellings, and mixed-format answers — are representative of real free-text data challenges documented in public health informatics literature.

The Setting

Priya, our sports journalist from Chapter 1, has taken on a side project. Her city's public health department conducted a survey asking residents about their experiences with COVID-19 vaccination, and the results have landed on her desk because — in her editor's words — "you're the one who knows Python."

The survey had a mix of structured questions (multiple choice, yes/no) and free-text questions. The structured questions were easy: the data arrived in clean columns with predictable values. But three questions were open-ended:

"Which vaccine did you receive?" (free text)
"Where did you receive your vaccine?" (free text)
"What concerns, if any, do you have about vaccination?" (free text)

The survey collected 2,400 responses. The free-text fields are a mess. Priya's job is to extract structured, analyzable data from the chaos.

The Data

Priya loads the data and takes her first look:

import pandas as pd

df = pd.read_csv("vaccination_survey.csv")
print(df.shape)
print(df.columns.tolist())

(2400, 12)
['respondent_id', 'age', 'zip_code', 'vaccine_received',
 'vaccine_date', 'location_text', 'concerns_text',
 'doses_received', 'would_recommend', 'gender',
 'income_bracket', 'survey_date']

She examines the vaccine_received column:

df["vaccine_received"].value_counts().head(15)

Pfizer                              312
pfizer                              287
Moderna                             245
moderna                             198
PFIZER                              142
Pfizer-BioNTech                     118
Johnson & Johnson                    95
J&J                                  87
moderna vaccine                      76
Pfizer BioNTech                      64
pfizer biontech covid vaccine        52
don't remember                       48
Johnson and Johnson                  45
janssen                              38
Moderna COVID-19                     35
Name: vaccine_received, dtype: int64

Fifteen rows shown, and the top three — "Pfizer," "pfizer," and "PFIZER" — are clearly the same vaccine. Priya counts unique values:

df["vaccine_received"].nunique()

One hundred and forty-seven unique values for what should be, at most, five or six vaccine manufacturers. This is the quintessential text data challenge.

Step 1: Standardize First, Classify Second

Priya knows from Chapter 10 that the first move is always general standardization — strip whitespace, normalize case, collapse spaces:

df["vaccine_clean"] = (df["vaccine_received"]
    .str.strip()
    .str.lower()
    .str.replace(r"\s+", " ", regex=True))

print(f"Unique values after basic cleaning: {df['vaccine_clean'].nunique()}")

Unique values after basic cleaning: 89

From 147 to 89 — a big improvement, but still far from the target of five or six. The remaining variations are substantive: abbreviations ("j&j"), brand names with and without the parent company ("pfizer" vs. "pfizer-biontech"), and extra descriptors ("moderna vaccine," "moderna covid-19 vaccine").

Step 2: Pattern-Based Classification

Priya decides to classify each response into a standard vaccine name using keyword matching. She writes patterns for each manufacturer:

import re

def classify_vaccine(text):
    if pd.isna(text):
        return "unknown"

    if re.search(r"pfizer|biontech|bnt162|comirnaty", text):
        return "Pfizer-BioNTech"
    elif re.search(r"moderna|mrna-1273|spikevax", text):
        return "Moderna"
    elif re.search(r"janssen|johnson|j&j|j \& j", text):
        return "Johnson & Johnson"
    elif re.search(r"astra\s*zeneca|oxford|azd1222|vaxzevria", text):
        return "AstraZeneca"
    elif re.search(r"sinovac|coronavac", text):
        return "Sinovac"
    elif re.search(r"don'?t\s*(know|remember)|unsure|forgot|not sure|can'?t recall", text):
        return "Unknown (respondent unsure)"
    else:
        return "Other/Unclassified"

df["vaccine_standard"] = df["vaccine_clean"].apply(classify_vaccine)
print(df["vaccine_standard"].value_counts())

Pfizer-BioNTech                1021
Moderna                         682
Johnson & Johnson               312
Unknown (respondent unsure)     158
AstraZeneca                      89
Other/Unclassified               78
Sinovac                          35
unknown                          25
Name: vaccine_standard, dtype: int64

From 147 unique values down to 8 categories. Priya examines the "Other/Unclassified" entries to see if she missed any patterns:

other = df[df["vaccine_standard"] == "Other/Unclassified"]
other["vaccine_clean"].value_counts().head(10)

flu shot                   12
covaxin                     9
novavax                     8
sputnik v                   7
sinopharm                   6
not vaccinated              5
the one at the pharmacy     4
idk the blue one            3
                            3
mRNA one                    2
Name: vaccine_clean, dtype: int64

Some of these are legitimate vaccine names she hadn't included (Covaxin, Novavax, Sputnik V, Sinopharm). Others are genuinely unhelpful responses ("the one at the pharmacy," "idk the blue one"). Priya adds the missing vaccines to her function and moves the rest to an "Other" category. This is the iterative nature of text cleaning — you write a first pass, inspect the residual, and refine.

Step 3: Tackling the Location Column

The location_text column is equally messy:

df["location_text"].value_counts().head(10)

CVS                    234
Walgreens              189
cvs pharmacy           156
walgreens              134
Hospital               112
Local pharmacy         98
Doctor's office        87
CVS Pharmacy           76
Walmart                72
my doctor              65
Name: location_text, dtype: int64

Priya applies the same strategy — standardize, then classify:

df["location_clean"] = (df["location_text"]
    .str.strip()
    .str.lower()
    .str.replace(r"\s+", " ", regex=True))

def classify_location(text):
    if pd.isna(text):
        return "unknown"
    if re.search(r"cvs", text):
        return "CVS Pharmacy"
    elif re.search(r"walgreen", text):
        return "Walgreens"
    elif re.search(r"walmart|sam'?s club", text):
        return "Walmart"
    elif re.search(r"hospital|medical center|med center", text):
        return "Hospital"
    elif re.search(r"doctor|physician|clinic|primary care", text):
        return "Doctor/Clinic"
    elif re.search(r"pharmacy|drug\s?store|rite.?aid|kroger", text):
        return "Other Pharmacy"
    elif re.search(r"work|employer|office", text):
        return "Workplace"
    elif re.search(r"mass\s?vax|convention|stadium|fair", text):
        return "Mass Vaccination Site"
    else:
        return "Other"

df["location_category"] = df["location_clean"].apply(classify_location)

Step 4: Mining the Concerns Column

The concerns column is the most challenging because responses are longer and more varied:

df["concerns_text"].dropna().head(5).tolist()

["I'm worried about long-term side effects that we don't know about yet",
 "None, I think vaccines are great",
 "had a bad reaction to my first dose - fever and chills for 2 days",
 "concerned about fertility issues, heard rumors on social media",
 "no concerns"]

Rather than classifying each response into a single category, Priya creates binary flags for common themes:

concerns_clean = df["concerns_text"].str.strip().str.lower()

df["concern_side_effects"] = concerns_clean.str.contains(
    r"side effect|reaction|fever|pain|chills|tired|sore",
    na=False)

df["concern_long_term"] = concerns_clean.str.contains(
    r"long.?term|future|years|don'?t know (what|the)",
    na=False)

df["concern_fertility"] = concerns_clean.str.contains(
    r"fertil|pregnan|baby|child|reproductive",
    na=False)

df["concern_trust"] = concerns_clean.str.contains(
    r"trust|government|rushed|conspiracy|big pharma",
    na=False)

df["concern_none"] = concerns_clean.str.contains(
    r"^(none|no concerns|n/?a|nothing|nope|all good)",
    na=False)

Now she can quantify concerns:

concern_cols = [c for c in df.columns if c.startswith("concern_")]
df[concern_cols].sum().sort_values(ascending=False)

concern_none             892
concern_side_effects     534
concern_long_term        412
concern_trust            287
concern_fertility        156
Name: count, dtype: int64

These numbers add up to more than 2,400 because some responses mention multiple concerns. That's correct and expected — the binary flag approach allows for multi-label categorization.

The Takeaway

Priya started with three columns of free text — 2,400 responses of unstructured, inconsistent human language. Through a systematic process of standardization, pattern matching, and iterative refinement, she extracted structured data that can be analyzed, grouped, charted, and reported.

The key insights:

Always standardize first. Stripping whitespace and normalizing case eliminates the majority of "false" unique values.
Use regex for classification, not perfection. The goal isn't to parse every possible variation of "Pfizer-BioNTech COVID-19 Vaccine." It's to correctly classify each response into an actionable category. A few "Other" entries are acceptable.
Inspect your residuals. After classification, always look at what fell through to "Other/Unclassified." You'll usually find patterns you missed that are easy to add.
Binary flags work for multi-topic text. When responses can contain multiple themes, create separate boolean columns instead of forcing each response into a single category.
Document your patterns. Priya added comments explaining why each regex pattern was chosen. When a colleague asks "why is 'the blue one' classified as Other?", the documentation answers the question.

Free-text data is messy, but it's also rich. The techniques Priya used here — the same techniques you learned in Chapter 10 — are the bridge between messy human language and clean, analyzable data.