Chapter 10 Exercises: Working with Text Data

Contributors to Introduction to Data Science

Chapter 10 Exercises: Working with Text Data

How to use these exercises: Work through the sections in order. Parts A-D focus on Chapter 10 material, building from recall to original analysis. Part E applies your skills to a new dataset. Part M mixes in concepts from earlier chapters to reinforce prior learning. You'll need Python, pandas, and the re module for most problems.

Difficulty key: 1-star: Foundational | 2-star: Intermediate | 3-star: Advanced | 4-star: Extension

Part A: Conceptual Understanding (1-star)

These questions check whether you absorbed the core ideas from the chapter. Write clear, concise answers.

Exercise 10.1 — Text vs. numbers

Explain why text data requires special handling in data analysis. Give three specific examples of how the same entity might appear as different text values in a real dataset, and explain why a computer would treat them as different.

Guidance

A strong answer mentions that computers compare text character-by-character, so "USA" and "United States" have zero characters in common at the same positions. Examples might include: country names ("USA" vs. "United States" vs. "U.S.A."), person names ("Dr. Smith" vs. "Smith, Dr." vs. "smith"), or product codes ("iPhone 14" vs. "iphone14" vs. "IPHONE 14"). Each variation produces a separate category in grouping operations, silently fragmenting your analysis.

Exercise 10.2 — The .str accessor

In your own words, explain what the .str accessor does in pandas and why it exists. What problem does it solve that regular Python string methods don't?

Guidance

The `.str` accessor allows you to apply string methods to every element in a pandas Series at once, without writing a loop. Regular Python string methods work on individual strings — you can call `"hello".upper()` but not `my_series.upper()`. The `.str` accessor bridges this gap: `my_series.str.upper()` applies `.upper()` to each value. It also handles `NaN` values gracefully (skipping them instead of raising an error), which is critical for real-world data with missing values.

Exercise 10.3 — Regex as a mini-language

Explain the threshold concept from this chapter — "regular expressions as a mini-language for describing patterns" — in your own words. Why is it more accurate to think of regex as a language than as a search tool?

Guidance

A search tool finds exact text (like Ctrl+F). Regex is a language because it has its own vocabulary (`\d`, `\w`, `.`), grammar (quantifiers, groups, alternation), and semantics (greedy vs. lazy matching). You don't search for a specific string — you *describe a pattern* of characters. For example, `\d{3}-\d{4}` doesn't search for "555-1234"; it describes the *structure* "three digits, a hyphen, four digits" — matching any string with that structure. This is fundamentally different from literal search.

Exercise 10.4 — When NOT to use regex

For each of the following tasks, state whether you would use a simple string method or a regular expression, and justify your choice:

Converting a column of city names to lowercase
Extracting 5-digit ZIP codes from addresses
Removing leading and trailing whitespace
Finding all entries that contain either "vaccine" or "vaccination"
Replacing "NY" with "New York" in a city column

Guidance

1. **String method** (`.str.lower()`) — no pattern matching needed. 2. **Regex** (`.str.extract(r"(\d{5})")`) — you need to match a pattern (five consecutive digits), not a fixed string. 3. **String method** (`.str.strip()`) — simple, built-in, no pattern needed. 4. **Regex** (`.str.contains(r"vaccin(e|ation)", case=False)`) — you need alternation, though `str.contains("vaccin", case=False)` would also work since both words start with "vaccin". 5. **String method** (`.replace({"NY": "New York"})`) — this is a fixed mapping, not a pattern. However, be careful: `str.replace("NY", "New York")` would also replace "NY" inside words like "ANYTHING." Use word-boundary regex if needed.

Exercise 10.5 — Regex building blocks

Without running any code, describe in plain English what each of these regex patterns matches:

\d{3}-\d{2}-\d{4}
^[A-Z]{2}$
\b[Dd]r\.?\s
\$\d+\.\d{2}
(\w+)@(\w+)\.(\w+)

Guidance

1. Three digits, a hyphen, two digits, a hyphen, four digits (looks like a US Social Security Number format). 2. A string that is exactly two uppercase letters (the `^` and `$` anchors enforce this). 3. The word "Dr" or "dr" (word boundary ensures it's the start of a word), optionally followed by a period, followed by a space. Matches "Dr. " or "Dr " or "dr " etc. 4. A dollar sign, followed by one or more digits, a decimal point, and exactly two digits (a price like $12.99). 5. Three capture groups: one or more word characters, then @, then one or more word characters, then a dot, then one or more word characters. A simplified email pattern. Group 1 is the username, group 2 is the domain name, group 3 is the TLD.

Part B: Applied Problems (2-star)

These problems require you to write code using the techniques from Chapter 10.

Exercise 10.6 — Cleaning company names

You have the following Series of company names. Write a pandas pipeline to standardize them so that equivalent names become identical.

companies = pd.Series([
    "  Apple Inc.  ", "apple", "APPLE INC",
    "Microsoft Corp.", "microsoft corporation",
    "MICROSOFT", "Google LLC", "GOOGLE",
    "google llc", "  Amazon.com, Inc.  ",
    "amazon", "AMAZON.COM"
])

Your cleaned result should have no more than 4 unique values. Show the value_counts() of your result.

Guidance

Start with `.str.strip().str.lower()`. Then remove common suffixes like "inc.", "inc", "corp.", "corporation", "llc", ".com" using either `str.replace()` or a mapping. Finally, strip again to remove any trailing spaces left after suffix removal. The key insight is that standardization often requires multiple passes.

Exercise 10.7 — Extracting numbers from descriptions

Given this Series of product descriptions, extract the numeric value and the unit into separate columns:

products = pd.Series([
    "500mg tablet",
    "2.5mL oral solution",
    "100mcg nasal spray",
    "10 mg capsule",
    "0.5 mL injection",
    "250mg/5mL suspension"
])

Use str.extract() with named capture groups. Your result should have columns amount and unit.

Guidance

The pattern needs to match an optional decimal number followed by optional whitespace and a unit. Something like `r"(?P\d+\.?\d*)\s*(?P[a-zA-Z/]+)"` would work. Note that the fifth entry has a space between the number and unit, so `\s*` (zero or more whitespace characters) is important. The last entry has "mg/5mL" — decide whether to capture just the first measurement or handle the compound format separately.

Exercise 10.8 — Filtering with .str.contains()

Create a DataFrame with 10 rows of fictional patient notes (as strings). Write code to:

Find all notes mentioning fever (case-insensitive)
Find all notes mentioning either "headache" or "migraine"
Find all notes that do NOT mention any pain-related terms
Count how many notes mention a specific medication name of your choice

Handle missing values appropriately in each case.

Guidance

Use `str.contains()` with `case=False` and `na=False`. For task 2, use alternation: `str.contains(r"headache|migraine", case=False, na=False)`. For task 3, use the `~` operator to negate: `~df["notes"].str.contains(r"pain|ache|sore", case=False, na=False)`. For task 4, use `.sum()` on the boolean result.

Exercise 10.9 — Splitting compound columns

A dataset has an address column with entries like:

addresses = pd.Series([
    "123 Main St, Springfield, IL 62704",
    "456 Oak Ave, Portland, OR 97201",
    "789 Pine Rd, Austin, TX 78701",
    "1010 Elm Blvd, Denver, CO 80201",
    "555 Cedar Ln, Miami, FL 33101"
])

Split this column into four separate columns: street, city, state, and zip. You may use a combination of str.split() and str.extract().

Guidance

First split on `", "` with `expand=True` to get street, city, and state+zip. Then split the state+zip column on space. Alternatively, use a single regex: `r"^(.+),\s*(.+),\s*([A-Z]{2})\s*(\d{5})$"` with `str.extract()`.

Exercise 10.10 — Text replacement pipeline

You have a column of survey responses about vaccination experiences:

responses = pd.Series([
    "Got my 1st dose at CVS pharmacy",
    "Received 2nd shot at walgreens",
    "3rd booster dose at Rite-Aid",
    "first vaccine at local clinic",
    "second jab at the hospital",
    "Got vaccinated at walmart pharmacy"
])

Write a normalization pipeline that: 1. Converts to lowercase 2. Replaces "1st"/"first" with "dose_1", "2nd"/"second" with "dose_2", "3rd"/"third" with "dose_3" 3. Replaces "shot"/"jab"/"vaccine"/"vaccinated" with "vaccination" 4. Extracts the location (the text after "at")

Guidance

Chain `.str.lower()` with multiple `.str.replace()` calls. For the ordinal standardization, use regex: `.str.replace(r"\b(1st|first)\b", "dose_1", regex=True)`. For extracting the location, use `.str.extract(r"at\s+(.+)$")`.

Exercise 10.11 — Regex capture groups

Write regex patterns with capture groups to extract the following from each string. Test each pattern with re.search() or re.findall().

Extract the year, month, and day from "2023-03-15" (three groups)
Extract the first name and last name from "Smith, John" (two groups)
Extract the area code and number from "(555) 123-4567" (two groups)
Extract the protocol and domain from "https://www.example.com" (two groups)
Extract the amount and currency from "$1,234.56 USD" (two groups)

Guidance

1. `r"(\d{4})-(\d{2})-(\d{2})"` — three groups of digits separated by hyphens. 2. `r"(\w+),\s*(\w+)"` — word characters before the comma (last name) and after (first name). 3. `r"$(\d{3})$\s*(\d{3}-\d{4})"` — digits inside parentheses (area code), then remaining digits with hyphen. 4. `r"(https?)://(?:www\.)?(.+)"` — "http" or "https" in group 1, domain in group 2. 5. `r"\$([\d,]+\.\d{2})\s+(\w+)"` — digits/commas/decimal in group 1, currency code in group 2.

Exercise 10.12 — Greedy vs. lazy

Given the string:

html = '<div class="title">Hello World</div><div class="body">Content here</div>'

Write a greedy regex that captures everything between the first <div and the last </div>.
Write a lazy regex that captures just the content between the first pair of > and <.
Explain in your own words when greedy matching is useful and when lazy matching is necessary.

Guidance

1. Greedy: `r""` — matches from first ``. 2. Lazy: `r">(.+?)<"` — matches from the first `>` to the nearest `<`, capturing just "Hello World". 3. Greedy is the default and works when there's only one match possible. Lazy is necessary when there are multiple possible endpoints and you want the shortest match — common when extracting text between delimiters that appear multiple times.

Exercise 10.13 — Building a text normalization function

Write a function called normalize_text that takes a pandas Series of strings and:

Strips whitespace
Converts to lowercase
Removes all punctuation except hyphens and apostrophes
Collapses multiple spaces into one
Strips again (in case removing punctuation left edge spaces)

Test it on a Series of 5 messy strings you create. Show the before and after.

Guidance

def normalize_text(series):
    return (series
        .str.strip()
        .str.lower()
        .str.replace(r"[^\w\s'-]", "", regex=True)
        .str.replace(r"\s+", " ", regex=True)
        .str.strip())

Be thoughtful about which punctuation to keep. Hyphens are important in names like "Pfizer-BioNTech" and apostrophes matter in "Côte d'Ivoire." Test edge cases: What happens with `None`? With an empty string? With a string that's all punctuation?

Part C: Real-World Scenarios (3-star)

These problems simulate real data challenges you might encounter professionally.

Exercise 10.14 — Cleaning medication names

A hospital database has these medication entries:

meds = pd.Series([
    "Metformin HCl 500mg",
    "metformin 500 mg tablet",
    "METFORMIN HYDROCHLORIDE 500MG",
    "Lisinopril 10mg",
    "lisinopril 10 mg tablet",
    "LISINOPRIL 10MG TAB",
    "Atorvastatin 20mg",
    "atorvastatin calcium 20 mg",
    "ATORVASTATIN 20MG TABLET",
    "Omeprazole 20mg capsule",
    "omeprazole 20 mg cap",
    "OMEPRAZOLE DR 20MG"
])

Write code to extract: (a) the drug name (just the generic name, no salts or formulations), (b) the dosage amount, and (c) the dosage unit. Create a clean DataFrame with columns drug, dose_amount, and dose_unit.

Guidance

Start by lowercasing. Extract the dose with a regex like `r"(\d+)\s*(mg|mcg|ml)"`. For the drug name, you might take the first word, or use a list of known drug names to match against. The trickiest part is handling "hydrochloride," "hcl," "calcium," "dr" (delayed-release), and "tab"/"tablet"/"cap"/"capsule" — these are all extra information that should be stripped. A pragmatic approach: after lowercasing, take the first word as the drug name, then extract the dose pattern.

Exercise 10.15 — Parsing semi-structured log entries

A web server log has entries like:

logs = pd.Series([
    '192.168.1.1 - - [15/Mar/2023:09:30:45] "GET /api/data HTTP/1.1" 200 1234',
    '10.0.0.5 - - [15/Mar/2023:09:31:02] "POST /api/submit HTTP/1.1" 201 567',
    '192.168.1.1 - - [15/Mar/2023:09:31:15] "GET /api/users HTTP/1.1" 404 89',
    '172.16.0.10 - - [15/Mar/2023:09:32:00] "GET /index.html HTTP/1.1" 200 4567',
    '10.0.0.5 - - [15/Mar/2023:09:32:45] "DELETE /api/item/42 HTTP/1.1" 403 0'
])

Use str.extract() with named capture groups to parse each entry into columns: ip, datetime, method, path, status_code, and response_size.

Guidance

The pattern should capture: an IP address (`[\d.]+`), the date in brackets (`\[([^\]]+)\]`), the HTTP method (`[A-Z]+`), the path (`/\S+`), the status code (`\d{3}`), and the response size (`\d+`). A complete pattern might look like: `r"(?P[\d.]+).+\[(?P[^\]]+)\]\s\"(?P[A-Z]+)\s(?P\S+).+\"\s(?P\d{3})\s(?P\d+)"`.

Exercise 10.16 — Survey response categorization

You have free-text survey responses to the question "Why haven't you been vaccinated?":

reasons = pd.Series([
    "I don't trust the vaccine safety",
    "Can't afford to miss work for side effects",
    "No transportation to vaccination site",
    "My doctor advised against it due to allergies",
    "I don't believe COVID is dangerous",
    "The nearest clinic is 50 miles away",
    "I'm worried about long-term effects",
    "Religious beliefs prevent me",
    "I already had COVID so I have natural immunity",
    "Too busy with work to schedule an appointment",
    "I'm afraid of needles",
    "Don't trust the government",
    "Vaccine was not available in my area",
    "I heard it causes infertility",
    "My insurance doesn't cover it"
])

Using str.contains() with regex, categorize each response into one or more of these categories: "trust/safety concerns", "access/logistics", "medical reasons", "misinformation", "other". A response can fall into multiple categories. Create a DataFrame showing the response and its categories.

Guidance

Exercise 10.17 — Address parsing and standardization

Write a comprehensive address standardization function that handles these variations:

addresses = pd.Series([
    "123 Main Street, Apt 4B, Springfield, IL 62704",
    "456 oak ave., suite 200, portland, or 97201",
    "789 PINE RD, AUSTIN TX 78701",
    "1010 elm boulevard, #302, denver, co 80201",
    "555 Cedar Lane, Miami FL  33101"
])

Your function should: 1. Standardize street suffixes ("Street" -> "St", "Avenue" -> "Ave", "Road" -> "Rd", "Boulevard" -> "Blvd", "Lane" -> "Ln") 2. Standardize apartment/suite notation ("Apt", "Suite", "#" -> "Unit") 3. Extract the ZIP code 4. Capitalize properly (title case for street/city, uppercase for state)

Guidance

This is a multi-step problem. First, clean and standardize case. Then use regex replacements for street suffixes: `.str.replace(r"\b(street|st\.?)\b", "St", case=False, regex=True)`. Extract the ZIP code with `str.extract(r"(\d{5})")`. For the state, extract the two-letter code before the ZIP: `str.extract(r"\b([A-Z]{2})\s*\d{5}")` (after uppercasing). This is a realistic problem — address standardization is one of the most common text data challenges in practice.

Exercise 10.18 — Finding patterns in social media text

Create a Series of 10 fictional social media posts about health topics. Then write code to:

Extract all hashtags (words starting with #)
Extract all mentions (words starting with @)
Find all URLs (simplified: anything starting with http or https)
Count the number of emoji-like patterns (sequences of punctuation like ":)", ":(", "<3")
Remove all hashtags, mentions, and URLs to get just the "clean" text

Guidance

Hashtags: `str.findall(r"#\w+")`. Mentions: `str.findall(r"@\w+")`. URLs: `str.findall(r"https?://\S+")`. For cleaning, use `str.replace()` to remove each pattern type. Remember that `str.findall()` returns a list for each row — to get a flat count, use `str.findall(...).str.len()`.

Part D: Synthesis Problems (3-star / 4-star)

These problems require combining multiple concepts or thinking beyond the chapter material.

Exercise 10.19 — Regex crossword (3-star)

For each pair of regex patterns, find a string that matches BOTH patterns:

r"^\d{3}" AND r"\d{3}$" (but the string is longer than 3 characters)
r"[aeiou]{2}" AND r"^[^aeiou]"
r"\bdata\b" AND r"science$"
r"^[A-Z]" AND r"[a-z]$" AND r"\d"

Guidance

1. "123-456" — starts with 3 digits and ends with 3 digits, with other characters in between. 2. "beautiful" — starts with a consonant ('b') and contains two consecutive vowels ('ea'). 3. "data science" — contains "data" as a whole word and ends with "science". 4. "Room42a" — starts with uppercase 'R', ends with lowercase 'a', contains a digit '4'.

Exercise 10.20 — Building a data validation function (3-star)

Write a function validate_column(series, rules) where rules is a dictionary mapping rule names to regex patterns. The function should return a DataFrame showing, for each value, which rules pass and which fail.

Test it with a column of email addresses and rules for: - "has_at_sign": must contain @ - "has_domain": must have text after @ - "valid_tld": must end with .com, .org, .edu, or .net - "no_spaces": must not contain any spaces

Guidance

def validate_column(series, rules):
    results = pd.DataFrame({"value": series})
    for rule_name, pattern in rules.items():
        results[rule_name] = series.str.contains(pattern, na=False)
    return results

For the "no_spaces" rule, use a negated approach: `results["no_spaces"] = ~series.str.contains(r"\s", na=True)`. This exercise combines text searching with DataFrame construction.

Exercise 10.21 — Reverse engineering regex (4-star)

For each of these regex patterns, provide three strings that MATCH and three that DON'T. Then explain in plain English what the pattern is designed to validate.

r"^[A-Z]\d[A-Z]\s\d[A-Z]\d$"
r"^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$"
r"^(?:ISBN[- ]?)?(\d{9}[\dX])$"

Guidance

1. Canadian postal codes (e.g., "K1A 0B1", "M5V 2T6"). Matches: "A1A 1A1", "K1A 0B1", "M5V 2T6". Doesn't match: "12345", "a1a 1a1", "K1A0B1" (no space). 2. IPv4 addresses (e.g., "192.168.1.1"). Matches: "192.168.1.1", "10.0.0.1", "255.255.255.0". Doesn't match: "999.999.999.999" (syntactically matches but semantically invalid), "192.168.1" (only 3 octets), "abc.def.ghi.jkl". 3. ISBN-10 numbers. The `(?:ISBN[- ]?)?` is an optional non-capturing group for the "ISBN" prefix. Matches: "0123456789", "ISBN 0123456789", "012345678X". Doesn't match: "12345" (too short), "ISBN ABCDEFGHIJ" (not digits), "01234567890" (11 digits).

Exercise 10.22 — Complete text cleaning pipeline (4-star)

You are given a messy CSV file with 200 rows of product listings from an online marketplace. Each row has a title column with entries like:

"  BRAND NEW!! Apple iPhone 14 Pro Max - 256GB - Space Black (Unlocked)!!  "
"samsung Galaxy S23 ultra 512 gb phantom black SEALED"
"Google Pixel 7 Pro, 128gb, Obsidian, Factory Unlocked, NIB"

Design and implement a complete text cleaning pipeline that extracts: - brand (Apple, Samsung, Google, etc.) - model (iPhone 14 Pro Max, Galaxy S23 Ultra, etc.) - storage (256GB, 512GB, etc. — standardized format) - color (standardized) - condition_tags (list of tags like "unlocked", "sealed", "new")

Write the pipeline as a set of functions, each handling one extraction task. Document your regex patterns with comments. Create at least 10 test entries and show the cleaned output.

Guidance

This is a substantial project. Break it into steps: (1) normalize text (strip, lowercase, collapse spaces), (2) extract storage with `r"(\d+)\s*gb"`, (3) identify brand with a dictionary of known brands, (4) extract condition tags with alternation, (5) for the model, this is the hardest part — you may need brand-specific patterns. The key learning: real text extraction is iterative. You'll write a pattern, test it, find edge cases, and refine. Document which cases your pipeline handles and which it doesn't.

Part M: Mixed Review (2-star)

These problems revisit concepts from earlier chapters, mixed with Chapter 10 material.

Exercise 10.23 — Combining text cleaning with data types (Ch. 3, 8, 10)

A column contains values like "$1,234.56"`, `"$89.00", " $12,345 "`, `"N/A"`, and `"$0.00". Write code that:

Strips whitespace (Ch. 10)
Identifies which values are "N/A" and replaces them with NaN (Ch. 8)
Removes the dollar sign and commas (Ch. 10)
Converts to float (Ch. 3/8)

Chain these operations and show the resulting Series with its dtype.

Guidance

s.str.strip().replace("N/A", pd.NA).str.replace(r"[\$,]", "", regex=True).astype(float)

The order matters: you must handle "N/A" before trying to convert to float, and you must remove `$` and `,` before conversion. This exercise connects text cleaning (Ch. 10) with type coercion ([Ch. 8](../chapter-08-cleaning-messy-data/index.md)) and data types ([Ch. 3](../../part-01-welcome/chapter-03-python-fundamentals-i/index.md)).

Exercise 10.24 — Text cleaning meets groupby (Ch. 9, 10)

Create a DataFrame with a messy region column containing values like "North America", "north america", "N. America", "NORTH AMERICA", "South America", "south america", "S. America", "Europe", "EUROPE", "europe", and a sales column with random values. First clean the region column, then use groupby (Ch. 9) to compute the total sales per region. Show how the results differ before and after text cleaning.

Guidance

Before cleaning, groupby will produce many small groups ("North America" and "north america" are separate). After cleaning with `.str.strip().str.lower()` and mapping abbreviations, you'll have three clean groups with properly aggregated totals. The lesson: text cleaning isn't just cosmetic — it directly affects the correctness of your analysis.

Exercise 10.25 — Regex and missing data (Ch. 8, 10)

You have a DataFrame where some entries in a phone column are valid phone numbers and others are placeholder text:

phones = pd.Series([
    "555-123-4567", "N/A", "000-000-0000",
    "(555) 987-6543", "not provided",
    "555.111.2222", None, "TBD",
    "555-000-1111", "unknown"
])

Write code that: 1. Flags which entries are valid phone numbers (match a phone number pattern) 2. Flags which are explicit "missing" indicators ("N/A", "not provided", "TBD", "unknown") 3. Flags which are "suspicious" (all-zero patterns like "000-000-0000") 4. Creates a status column with values "valid", "missing", "suspicious", or "null"

Guidance

Use `str.match()` or `str.contains()` for each category. Valid phone: `r"^[$\d][\d\s\($.-]{8,}\d$"` or similar. Missing indicators: `str.lower().isin(["n/a", "not provided", "tbd", "unknown"])`. Suspicious: `str.contains(r"0{3}.*0{3}.*0{4}")`. Null: use `.isna()`. Assign status with `np.select()` or conditional logic. This combines [Chapter 8](../chapter-08-cleaning-messy-data/index.md)'s missing data concepts with Chapter 10's pattern matching.