Chapter 10 Exercises: Working with Text Data
How to use these exercises: Work through the sections in order. Parts A-D focus on Chapter 10 material, building from recall to original analysis. Part E applies your skills to a new dataset. Part M mixes in concepts from earlier chapters to reinforce prior learning. You'll need Python, pandas, and the
remodule for most problems.Difficulty key: 1-star: Foundational | 2-star: Intermediate | 3-star: Advanced | 4-star: Extension
Part A: Conceptual Understanding (1-star)
These questions check whether you absorbed the core ideas from the chapter. Write clear, concise answers.
Exercise 10.1 — Text vs. numbers
Explain why text data requires special handling in data analysis. Give three specific examples of how the same entity might appear as different text values in a real dataset, and explain why a computer would treat them as different.
Guidance
A strong answer mentions that computers compare text character-by-character, so "USA" and "United States" have zero characters in common at the same positions. Examples might include: country names ("USA" vs. "United States" vs. "U.S.A."), person names ("Dr. Smith" vs. "Smith, Dr." vs. "smith"), or product codes ("iPhone 14" vs. "iphone14" vs. "IPHONE 14"). Each variation produces a separate category in grouping operations, silently fragmenting your analysis.Exercise 10.2 — The .str accessor
In your own words, explain what the .str accessor does in pandas and why it exists. What problem does it solve that regular Python string methods don't?
Guidance
The `.str` accessor allows you to apply string methods to every element in a pandas Series at once, without writing a loop. Regular Python string methods work on individual strings — you can call `"hello".upper()` but not `my_series.upper()`. The `.str` accessor bridges this gap: `my_series.str.upper()` applies `.upper()` to each value. It also handles `NaN` values gracefully (skipping them instead of raising an error), which is critical for real-world data with missing values.Exercise 10.3 — Regex as a mini-language
Explain the threshold concept from this chapter — "regular expressions as a mini-language for describing patterns" — in your own words. Why is it more accurate to think of regex as a language than as a search tool?
Guidance
A search tool finds exact text (like Ctrl+F). Regex is a language because it has its own vocabulary (`\d`, `\w`, `.`), grammar (quantifiers, groups, alternation), and semantics (greedy vs. lazy matching). You don't search for a specific string — you *describe a pattern* of characters. For example, `\d{3}-\d{4}` doesn't search for "555-1234"; it describes the *structure* "three digits, a hyphen, four digits" — matching any string with that structure. This is fundamentally different from literal search.Exercise 10.4 — When NOT to use regex
For each of the following tasks, state whether you would use a simple string method or a regular expression, and justify your choice:
- Converting a column of city names to lowercase
- Extracting 5-digit ZIP codes from addresses
- Removing leading and trailing whitespace
- Finding all entries that contain either "vaccine" or "vaccination"
- Replacing "NY" with "New York" in a city column
Guidance
1. **String method** (`.str.lower()`) — no pattern matching needed. 2. **Regex** (`.str.extract(r"(\d{5})")`) — you need to match a pattern (five consecutive digits), not a fixed string. 3. **String method** (`.str.strip()`) — simple, built-in, no pattern needed. 4. **Regex** (`.str.contains(r"vaccin(e|ation)", case=False)`) — you need alternation, though `str.contains("vaccin", case=False)` would also work since both words start with "vaccin". 5. **String method** (`.replace({"NY": "New York"})`) — this is a fixed mapping, not a pattern. However, be careful: `str.replace("NY", "New York")` would also replace "NY" inside words like "ANYTHING." Use word-boundary regex if needed.Exercise 10.5 — Regex building blocks
Without running any code, describe in plain English what each of these regex patterns matches:
\d{3}-\d{2}-\d{4}^[A-Z]{2}$\b[Dd]r\.?\s\$\d+\.\d{2}(\w+)@(\w+)\.(\w+)
Guidance
1. Three digits, a hyphen, two digits, a hyphen, four digits (looks like a US Social Security Number format). 2. A string that is exactly two uppercase letters (the `^` and `$` anchors enforce this). 3. The word "Dr" or "dr" (word boundary ensures it's the start of a word), optionally followed by a period, followed by a space. Matches "Dr. " or "Dr " or "dr " etc. 4. A dollar sign, followed by one or more digits, a decimal point, and exactly two digits (a price like $12.99). 5. Three capture groups: one or more word characters, then @, then one or more word characters, then a dot, then one or more word characters. A simplified email pattern. Group 1 is the username, group 2 is the domain name, group 3 is the TLD.Part B: Applied Problems (2-star)
These problems require you to write code using the techniques from Chapter 10.
Exercise 10.6 — Cleaning company names
You have the following Series of company names. Write a pandas pipeline to standardize them so that equivalent names become identical.
companies = pd.Series([
" Apple Inc. ", "apple", "APPLE INC",
"Microsoft Corp.", "microsoft corporation",
"MICROSOFT", "Google LLC", "GOOGLE",
"google llc", " Amazon.com, Inc. ",
"amazon", "AMAZON.COM"
])
Your cleaned result should have no more than 4 unique values. Show the value_counts() of your result.
Guidance
Start with `.str.strip().str.lower()`. Then remove common suffixes like "inc.", "inc", "corp.", "corporation", "llc", ".com" using either `str.replace()` or a mapping. Finally, strip again to remove any trailing spaces left after suffix removal. The key insight is that standardization often requires multiple passes.Exercise 10.7 — Extracting numbers from descriptions
Given this Series of product descriptions, extract the numeric value and the unit into separate columns:
products = pd.Series([
"500mg tablet",
"2.5mL oral solution",
"100mcg nasal spray",
"10 mg capsule",
"0.5 mL injection",
"250mg/5mL suspension"
])
Use str.extract() with named capture groups. Your result should have columns amount and unit.
Guidance
The pattern needs to match an optional decimal number followed by optional whitespace and a unit. Something like `r"(?PExercise 10.8 — Filtering with .str.contains()
Create a DataFrame with 10 rows of fictional patient notes (as strings). Write code to:
- Find all notes mentioning fever (case-insensitive)
- Find all notes mentioning either "headache" or "migraine"
- Find all notes that do NOT mention any pain-related terms
- Count how many notes mention a specific medication name of your choice
Handle missing values appropriately in each case.
Guidance
Use `str.contains()` with `case=False` and `na=False`. For task 2, use alternation: `str.contains(r"headache|migraine", case=False, na=False)`. For task 3, use the `~` operator to negate: `~df["notes"].str.contains(r"pain|ache|sore", case=False, na=False)`. For task 4, use `.sum()` on the boolean result.Exercise 10.9 — Splitting compound columns
A dataset has an address column with entries like:
addresses = pd.Series([
"123 Main St, Springfield, IL 62704",
"456 Oak Ave, Portland, OR 97201",
"789 Pine Rd, Austin, TX 78701",
"1010 Elm Blvd, Denver, CO 80201",
"555 Cedar Ln, Miami, FL 33101"
])
Split this column into four separate columns: street, city, state, and zip. You may use a combination of str.split() and str.extract().
Guidance
First split on `", "` with `expand=True` to get street, city, and state+zip. Then split the state+zip column on space. Alternatively, use a single regex: `r"^(.+),\s*(.+),\s*([A-Z]{2})\s*(\d{5})$"` with `str.extract()`.Exercise 10.10 — Text replacement pipeline
You have a column of survey responses about vaccination experiences:
responses = pd.Series([
"Got my 1st dose at CVS pharmacy",
"Received 2nd shot at walgreens",
"3rd booster dose at Rite-Aid",
"first vaccine at local clinic",
"second jab at the hospital",
"Got vaccinated at walmart pharmacy"
])
Write a normalization pipeline that: 1. Converts to lowercase 2. Replaces "1st"/"first" with "dose_1", "2nd"/"second" with "dose_2", "3rd"/"third" with "dose_3" 3. Replaces "shot"/"jab"/"vaccine"/"vaccinated" with "vaccination" 4. Extracts the location (the text after "at")
Guidance
Chain `.str.lower()` with multiple `.str.replace()` calls. For the ordinal standardization, use regex: `.str.replace(r"\b(1st|first)\b", "dose_1", regex=True)`. For extracting the location, use `.str.extract(r"at\s+(.+)$")`.Exercise 10.11 — Regex capture groups
Write regex patterns with capture groups to extract the following from each string. Test each pattern with re.search() or re.findall().
- Extract the year, month, and day from "2023-03-15" (three groups)
- Extract the first name and last name from "Smith, John" (two groups)
- Extract the area code and number from "(555) 123-4567" (two groups)
- Extract the protocol and domain from "https://www.example.com" (two groups)
- Extract the amount and currency from "$1,234.56 USD" (two groups)
Guidance
1. `r"(\d{4})-(\d{2})-(\d{2})"` — three groups of digits separated by hyphens. 2. `r"(\w+),\s*(\w+)"` — word characters before the comma (last name) and after (first name). 3. `r"\((\d{3})\)\s*(\d{3}-\d{4})"` — digits inside parentheses (area code), then remaining digits with hyphen. 4. `r"(https?)://(?:www\.)?(.+)"` — "http" or "https" in group 1, domain in group 2. 5. `r"\$([\d,]+\.\d{2})\s+(\w+)"` — digits/commas/decimal in group 1, currency code in group 2.Exercise 10.12 — Greedy vs. lazy
Given the string:
html = '<div class="title">Hello World</div><div class="body">Content here</div>'
- Write a greedy regex that captures everything between the first
<divand the last</div>. - Write a lazy regex that captures just the content between the first pair of
>and<. - Explain in your own words when greedy matching is useful and when lazy matching is necessary.
Guidance
1. Greedy: `r"Exercise 10.13 — Building a text normalization function
Write a function called normalize_text that takes a pandas Series of strings and:
- Strips whitespace
- Converts to lowercase
- Removes all punctuation except hyphens and apostrophes
- Collapses multiple spaces into one
- Strips again (in case removing punctuation left edge spaces)
Test it on a Series of 5 messy strings you create. Show the before and after.
Guidance
def normalize_text(series):
return (series
.str.strip()
.str.lower()
.str.replace(r"[^\w\s'-]", "", regex=True)
.str.replace(r"\s+", " ", regex=True)
.str.strip())
Be thoughtful about which punctuation to keep. Hyphens are important in names like "Pfizer-BioNTech" and apostrophes matter in "Côte d'Ivoire." Test edge cases: What happens with `None`? With an empty string? With a string that's all punctuation?
Part C: Real-World Scenarios (3-star)
These problems simulate real data challenges you might encounter professionally.
Exercise 10.14 — Cleaning medication names
A hospital database has these medication entries:
meds = pd.Series([
"Metformin HCl 500mg",
"metformin 500 mg tablet",
"METFORMIN HYDROCHLORIDE 500MG",
"Lisinopril 10mg",
"lisinopril 10 mg tablet",
"LISINOPRIL 10MG TAB",
"Atorvastatin 20mg",
"atorvastatin calcium 20 mg",
"ATORVASTATIN 20MG TABLET",
"Omeprazole 20mg capsule",
"omeprazole 20 mg cap",
"OMEPRAZOLE DR 20MG"
])
Write code to extract: (a) the drug name (just the generic name, no salts or formulations), (b) the dosage amount, and (c) the dosage unit. Create a clean DataFrame with columns drug, dose_amount, and dose_unit.
Guidance
Start by lowercasing. Extract the dose with a regex like `r"(\d+)\s*(mg|mcg|ml)"`. For the drug name, you might take the first word, or use a list of known drug names to match against. The trickiest part is handling "hydrochloride," "hcl," "calcium," "dr" (delayed-release), and "tab"/"tablet"/"cap"/"capsule" — these are all extra information that should be stripped. A pragmatic approach: after lowercasing, take the first word as the drug name, then extract the dose pattern.Exercise 10.15 — Parsing semi-structured log entries
A web server log has entries like:
logs = pd.Series([
'192.168.1.1 - - [15/Mar/2023:09:30:45] "GET /api/data HTTP/1.1" 200 1234',
'10.0.0.5 - - [15/Mar/2023:09:31:02] "POST /api/submit HTTP/1.1" 201 567',
'192.168.1.1 - - [15/Mar/2023:09:31:15] "GET /api/users HTTP/1.1" 404 89',
'172.16.0.10 - - [15/Mar/2023:09:32:00] "GET /index.html HTTP/1.1" 200 4567',
'10.0.0.5 - - [15/Mar/2023:09:32:45] "DELETE /api/item/42 HTTP/1.1" 403 0'
])
Use str.extract() with named capture groups to parse each entry into columns: ip, datetime, method, path, status_code, and response_size.
Guidance
The pattern should capture: an IP address (`[\d.]+`), the date in brackets (`\[([^\]]+)\]`), the HTTP method (`[A-Z]+`), the path (`/\S+`), the status code (`\d{3}`), and the response size (`\d+`). A complete pattern might look like: `r"(?PExercise 10.16 — Survey response categorization
You have free-text survey responses to the question "Why haven't you been vaccinated?":
reasons = pd.Series([
"I don't trust the vaccine safety",
"Can't afford to miss work for side effects",
"No transportation to vaccination site",
"My doctor advised against it due to allergies",
"I don't believe COVID is dangerous",
"The nearest clinic is 50 miles away",
"I'm worried about long-term effects",
"Religious beliefs prevent me",
"I already had COVID so I have natural immunity",
"Too busy with work to schedule an appointment",
"I'm afraid of needles",
"Don't trust the government",
"Vaccine was not available in my area",
"I heard it causes infertility",
"My insurance doesn't cover it"
])
Using str.contains() with regex, categorize each response into one or more of these categories: "trust/safety concerns", "access/logistics", "medical reasons", "misinformation", "other". A response can fall into multiple categories. Create a DataFrame showing the response and its categories.
Guidance
Define regex patterns for each category: - trust/safety: `r"trust|safety|afraid|worried|long-term|government|scared"` - access/logistics: `r"transport|miles|away|busy|work|schedule|available|afford|insurance"` - medical: `r"doctor|allerg|medical|health condition"` - misinformation: `r"infertil|dangerous|natural immunity|don't believe"` Create a boolean column for each category using `str.contains()`. Some responses may match multiple categories (e.g., "can't afford" is both access and financial). Others may match none — those get "other."Exercise 10.17 — Address parsing and standardization
Write a comprehensive address standardization function that handles these variations:
addresses = pd.Series([
"123 Main Street, Apt 4B, Springfield, IL 62704",
"456 oak ave., suite 200, portland, or 97201",
"789 PINE RD, AUSTIN TX 78701",
"1010 elm boulevard, #302, denver, co 80201",
"555 Cedar Lane, Miami FL 33101"
])
Your function should: 1. Standardize street suffixes ("Street" -> "St", "Avenue" -> "Ave", "Road" -> "Rd", "Boulevard" -> "Blvd", "Lane" -> "Ln") 2. Standardize apartment/suite notation ("Apt", "Suite", "#" -> "Unit") 3. Extract the ZIP code 4. Capitalize properly (title case for street/city, uppercase for state)
Guidance
This is a multi-step problem. First, clean and standardize case. Then use regex replacements for street suffixes: `.str.replace(r"\b(street|st\.?)\b", "St", case=False, regex=True)`. Extract the ZIP code with `str.extract(r"(\d{5})")`. For the state, extract the two-letter code before the ZIP: `str.extract(r"\b([A-Z]{2})\s*\d{5}")` (after uppercasing). This is a realistic problem — address standardization is one of the most common text data challenges in practice.Exercise 10.18 — Finding patterns in social media text
Create a Series of 10 fictional social media posts about health topics. Then write code to:
- Extract all hashtags (words starting with #)
- Extract all mentions (words starting with @)
- Find all URLs (simplified: anything starting with http or https)
- Count the number of emoji-like patterns (sequences of punctuation like ":)", ":(", "<3")
- Remove all hashtags, mentions, and URLs to get just the "clean" text
Guidance
Hashtags: `str.findall(r"#\w+")`. Mentions: `str.findall(r"@\w+")`. URLs: `str.findall(r"https?://\S+")`. For cleaning, use `str.replace()` to remove each pattern type. Remember that `str.findall()` returns a list for each row — to get a flat count, use `str.findall(...).str.len()`.Part D: Synthesis Problems (3-star / 4-star)
These problems require combining multiple concepts or thinking beyond the chapter material.
Exercise 10.19 — Regex crossword (3-star)
For each pair of regex patterns, find a string that matches BOTH patterns:
r"^\d{3}"ANDr"\d{3}$"(but the string is longer than 3 characters)r"[aeiou]{2}"ANDr"^[^aeiou]"r"\bdata\b"ANDr"science$"r"^[A-Z]"ANDr"[a-z]$"ANDr"\d"
Guidance
1. "123-456" — starts with 3 digits and ends with 3 digits, with other characters in between. 2. "beautiful" — starts with a consonant ('b') and contains two consecutive vowels ('ea'). 3. "data science" — contains "data" as a whole word and ends with "science". 4. "Room42a" — starts with uppercase 'R', ends with lowercase 'a', contains a digit '4'.Exercise 10.20 — Building a data validation function (3-star)
Write a function validate_column(series, rules) where rules is a dictionary mapping rule names to regex patterns. The function should return a DataFrame showing, for each value, which rules pass and which fail.
Test it with a column of email addresses and rules for: - "has_at_sign": must contain @ - "has_domain": must have text after @ - "valid_tld": must end with .com, .org, .edu, or .net - "no_spaces": must not contain any spaces
Guidance
def validate_column(series, rules):
results = pd.DataFrame({"value": series})
for rule_name, pattern in rules.items():
results[rule_name] = series.str.contains(pattern, na=False)
return results
For the "no_spaces" rule, use a negated approach: `results["no_spaces"] = ~series.str.contains(r"\s", na=True)`. This exercise combines text searching with DataFrame construction.
Exercise 10.21 — Reverse engineering regex (4-star)
For each of these regex patterns, provide three strings that MATCH and three that DON'T. Then explain in plain English what the pattern is designed to validate.
r"^[A-Z]\d[A-Z]\s\d[A-Z]\d$"r"^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$"r"^(?:ISBN[- ]?)?(\d{9}[\dX])$"
Guidance
1. Canadian postal codes (e.g., "K1A 0B1", "M5V 2T6"). Matches: "A1A 1A1", "K1A 0B1", "M5V 2T6". Doesn't match: "12345", "a1a 1a1", "K1A0B1" (no space). 2. IPv4 addresses (e.g., "192.168.1.1"). Matches: "192.168.1.1", "10.0.0.1", "255.255.255.0". Doesn't match: "999.999.999.999" (syntactically matches but semantically invalid), "192.168.1" (only 3 octets), "abc.def.ghi.jkl". 3. ISBN-10 numbers. The `(?:ISBN[- ]?)?` is an optional non-capturing group for the "ISBN" prefix. Matches: "0123456789", "ISBN 0123456789", "012345678X". Doesn't match: "12345" (too short), "ISBN ABCDEFGHIJ" (not digits), "01234567890" (11 digits).Exercise 10.22 — Complete text cleaning pipeline (4-star)
You are given a messy CSV file with 200 rows of product listings from an online marketplace. Each row has a title column with entries like:
" BRAND NEW!! Apple iPhone 14 Pro Max - 256GB - Space Black (Unlocked)!! "
"samsung Galaxy S23 ultra 512 gb phantom black SEALED"
"Google Pixel 7 Pro, 128gb, Obsidian, Factory Unlocked, NIB"
Design and implement a complete text cleaning pipeline that extracts:
- brand (Apple, Samsung, Google, etc.)
- model (iPhone 14 Pro Max, Galaxy S23 Ultra, etc.)
- storage (256GB, 512GB, etc. — standardized format)
- color (standardized)
- condition_tags (list of tags like "unlocked", "sealed", "new")
Write the pipeline as a set of functions, each handling one extraction task. Document your regex patterns with comments. Create at least 10 test entries and show the cleaned output.
Guidance
This is a substantial project. Break it into steps: (1) normalize text (strip, lowercase, collapse spaces), (2) extract storage with `r"(\d+)\s*gb"`, (3) identify brand with a dictionary of known brands, (4) extract condition tags with alternation, (5) for the model, this is the hardest part — you may need brand-specific patterns. The key learning: real text extraction is iterative. You'll write a pattern, test it, find edge cases, and refine. Document which cases your pipeline handles and which it doesn't.Part M: Mixed Review (2-star)
These problems revisit concepts from earlier chapters, mixed with Chapter 10 material.
Exercise 10.23 — Combining text cleaning with data types (Ch. 3, 8, 10)
A column contains values like "$1,234.56"`, `"$89.00", " $12,345 "`, `"N/A"`, and `"$0.00". Write code that:
- Strips whitespace (Ch. 10)
- Identifies which values are "N/A" and replaces them with
NaN(Ch. 8) - Removes the dollar sign and commas (Ch. 10)
- Converts to float (Ch. 3/8)
Chain these operations and show the resulting Series with its dtype.
Guidance
s.str.strip().replace("N/A", pd.NA).str.replace(r"[\$,]", "", regex=True).astype(float)
The order matters: you must handle "N/A" before trying to convert to float, and you must remove `$` and `,` before conversion. This exercise connects text cleaning (Ch. 10) with type coercion (Ch. 8) and data types (Ch. 3).
Exercise 10.24 — Text cleaning meets groupby (Ch. 9, 10)
Create a DataFrame with a messy region column containing values like "North America", "north america", "N. America", "NORTH AMERICA", "South America", "south america", "S. America", "Europe", "EUROPE", "europe", and a sales column with random values. First clean the region column, then use groupby (Ch. 9) to compute the total sales per region. Show how the results differ before and after text cleaning.
Guidance
Before cleaning, groupby will produce many small groups ("North America" and "north america" are separate). After cleaning with `.str.strip().str.lower()` and mapping abbreviations, you'll have three clean groups with properly aggregated totals. The lesson: text cleaning isn't just cosmetic — it directly affects the correctness of your analysis.Exercise 10.25 — Regex and missing data (Ch. 8, 10)
You have a DataFrame where some entries in a phone column are valid phone numbers and others are placeholder text:
phones = pd.Series([
"555-123-4567", "N/A", "000-000-0000",
"(555) 987-6543", "not provided",
"555.111.2222", None, "TBD",
"555-000-1111", "unknown"
])
Write code that:
1. Flags which entries are valid phone numbers (match a phone number pattern)
2. Flags which are explicit "missing" indicators ("N/A", "not provided", "TBD", "unknown")
3. Flags which are "suspicious" (all-zero patterns like "000-000-0000")
4. Creates a status column with values "valid", "missing", "suspicious", or "null"