Chapter 10 Quiz: Working with Text Data

Contributors to Introduction to Data Science

Chapter 10 Quiz: Working with Text Data

Instructions: This quiz tests your understanding of Chapter 10. Answer all questions before checking the solutions. For multiple choice, select the best answer — some options may be partially correct. For short answer questions, aim for 2-4 clear sentences. Total points: 100.

Section 1: Multiple Choice (8 questions, 5 points each)

Question 1. What does the .str accessor in pandas allow you to do?

(A) Convert a DataFrame column to string type
(B) Apply string methods to every element in a Series at once, handling NaN gracefully
(C) Access individual characters in a string by position
(D) Create regular expression patterns from strings

Answer

**Correct: (B)** - **(A)** describes `astype(str)`, not the `.str` accessor. The accessor doesn't convert types — it provides access to string methods. - **(B)** is correct. The `.str` accessor is a gateway to vectorized string operations. It applies the method to each element and returns `NaN` for missing values instead of raising errors. - **(C)** describes Python's string indexing (`"hello"[0]`). While `.str[0]` does work on a Series, that's just one of many capabilities. - **(D)** is not what `.str` does. Regex patterns are created as raw strings.

Question 2. Which of the following is the correct way to find all rows in a DataFrame column df["name"] that contain the word "Smith" (case-insensitive), handling missing values safely?

(A) df["name"].contains("Smith")
(B) df["name"].str.contains("Smith", case=False, na=False)
(C) df["name"].str.find("Smith", ignore_case=True)
(D) df[df["name"] == "Smith"]

Answer

**Correct: (B)** - **(A)** is missing the `.str` accessor — `contains` is not a method on a pandas Series directly. - **(B)** correctly uses `.str.contains()` with `case=False` for case-insensitive matching and `na=False` to treat NaN values as non-matches (returning `False` instead of `NaN`). - **(C)** `.str.find()` returns position indices, not boolean values, and doesn't have an `ignore_case` parameter. - **(D)** checks for exact equality with "Smith," not containment. "John Smith" would not match.

Question 3. In regular expressions, what does the pattern \d+ match?

(A) Exactly one digit
(B) One or more digits
(C) Zero or more digits
(D) The literal characters \d+

Answer

**Correct: (B)** - **(A)** describes `\d` without the quantifier — a single digit. - **(B)** is correct. `\d` matches a single digit (0-9), and `+` means "one or more of the preceding." Together, `\d+` matches any sequence of one or more consecutive digits: "5", "42", "12345". - **(C)** describes `\d*` — the `*` quantifier means "zero or more." - **(D)** would only be true if you didn't use a raw string and were matching literal text.

Question 4. What is a capture group in a regular expression?

(A) A way to name your regex pattern for reuse
(B) A portion of the pattern enclosed in parentheses that extracts specific parts of a match
(C) A character class that captures all possible characters
(D) A technique for matching text that spans multiple lines

Answer

**Correct: (B)** - **(A)** describes `re.compile()`, not capture groups. - **(B)** is correct. Parentheses `()` in a regex define a capture group. When the pattern matches, you can extract just the content that matched inside the parentheses. In pandas, `.str.extract()` turns each capture group into a separate column. - **(C)** describes character classes like `[A-Z]`, not capture groups. - **(D)** describes multiline mode (`re.MULTILINE`), not capture groups.

Question 5. Given s = pd.Series(["Hello World", None, "Foo Bar"]), what does s.str.lower() return?

(A) ["hello world", None, "foo bar"] — NaN values are skipped
(B) An error because one value is None
(C) ["hello world", "none", "foo bar"] — None is converted to the string "none"
(D) ["hello world", "", "foo bar"] — None is converted to an empty string

Answer

**Correct: (A)** - **(A)** is correct. The `.str` accessor gracefully handles missing values by propagating `NaN` (displayed as `None` or `NaN`) without raising errors. This is one of its key advantages over writing a manual loop. - **(B)** would happen if you tried to call `.lower()` directly on `None` in regular Python, but the `.str` accessor protects against this. - **(C)** and **(D)** are incorrect — `None` is not converted to any string; it remains as a missing value.

Question 6. What is the difference between re.findall(r"<.+>", text) and re.findall(r"<.+?>", text) when applied to "hello"?

(A) They produce identical results
(B) The first (greedy) matches "hello" as one match; the second (lazy) matches "" and "" separately
(C) The first matches only ""; the second matches "hello"
(D) The first raises an error; the second works correctly

Answer

**Correct: (B)** - **(A)** is incorrect — greedy and lazy matching produce different results when there are multiple possible endpoints. - **(B)** is correct. The greedy `.+` matches as much as possible: from the first `<` to the last `>`, swallowing everything in between. The lazy `.+?` matches as little as possible: from `<` to the nearest `>`, producing two separate matches. - **(C)** is backwards. - **(D)** Both are valid patterns.

Question 7. Which of the following regex patterns correctly matches a US ZIP code that may or may not have the +4 extension (e.g., "90210" or "90210-1234")?

(A) r"\d{5}-\d{4}"
(B) r"\d{5}(-\d{4})?"
(C) r"\d{5}-?\d{4}?"
(D) r"\d{5,9}"

Answer

**Correct: (B)** - **(A)** only matches ZIP+4 format (requires the hyphen and four extra digits). "90210" alone would not match. - **(B)** is correct. `\d{5}` matches the base ZIP, and `(-\d{4})?` makes the hyphen-plus-four-digits group optional (the `?` after the group means zero or one occurrences). Both "90210" and "90210-1234" match. - **(C)** makes the hyphen and each of the last four digits individually optional, which would match malformed strings like "90210-123" or "902101234". - **(D)** matches any sequence of 5 to 9 digits, which would match "123456789" — clearly not a ZIP code format.

Question 8. When should you prefer simple string methods over regular expressions?

(A) Always — regex is never the right choice for data cleaning
(B) When the operation involves a fixed, known substring rather than a variable pattern
(C) When working with large datasets — regex is too slow for big data
(D) When the data has missing values — regex can't handle NaN

Answer

**Correct: (B)** - **(A)** is too extreme. Regex is the right tool when you need pattern matching (extracting dates, validating formats, matching alternatives). - **(B)** is correct. If you're replacing "USA" with "United States," you don't need regex — `str.replace("USA", "United States", regex=False)` is simpler and clearer. Save regex for when you need to describe a *pattern* rather than a *literal string*. - **(C)** is misleading. Regex can be slower than literal string operations, but the speed difference is rarely significant in pandas workflows. Clarity is the primary reason to prefer string methods. - **(D)** is incorrect. Both `.str.replace()` and `.str.contains()` handle NaN gracefully regardless of whether regex is used.

Section 2: True/False (3 questions, 5 points each)

Question 9. True or False: The regex pattern \b matches a literal backspace character.

Answer

**False.** In a regular expression (when using a raw string `r"\b"`), `\b` matches a **word boundary** — the position between a word character (`\w`) and a non-word character. It matches a *position*, not a character. This is why `r"\bcat\b"` matches "cat" as a whole word but not "catfish" or "concatenate." Note: outside of regex, `\b` does mean backspace in regular Python strings, which is another reason to always use raw strings for regex patterns.

Question 10. True or False: df["col"].str.replace(".", "", regex=False) and df["col"].str.replace(".", "") behave identically.

Answer

**True (in current pandas) — but being explicit still matters.** Since pandas 2.0, `str.replace` defaults to `regex=False`, so `str.replace(".", "")` treats `"."` as a literal period — exactly the same as passing `regex=False` explicitly. The two calls behave identically today. The reason to *still* write `regex=False` is that the default flipped between pandas versions (older pandas treated `.` as the regex "any character," which would remove ALL characters), and stating it explicitly documents your intent and keeps code robust across versions. To be safe and explicit, always specify `regex=True` or `regex=False`.

Question 11. True or False: Regular expressions can effectively parse and validate arbitrarily nested structures like HTML or JSON.

Answer

**False.** Regular expressions cannot handle arbitrarily nested structures because they lack the ability to count matching pairs of delimiters (like nested `

` tags or nested `{}` braces). This is a fundamental limitation of regular languages in computer science theory. For parsing HTML, use a parser like BeautifulSoup. For JSON, use `json.loads()`. Regex can extract simple patterns from HTML/JSON, but it cannot validate the overall structure.

Section 3: Short Answer (4 questions, 5 points each)

Question 12. Explain the difference between .str.contains(), .str.match(), and .str.extract() in pandas. When would you use each?

Answer

- **`.str.contains(pattern)`** returns a Boolean Series: `True` if the pattern appears *anywhere* in the string, `False` otherwise. Use it for filtering rows (e.g., "find all entries mentioning 'Pfizer'"). - **`.str.match(pattern)`** returns a Boolean Series: `True` if the pattern matches *from the beginning* of the string. It's like `str.contains` with an implicit `^` anchor. Use it for validating formats (e.g., "does this look like a valid ID code?"). - **`.str.extract(pattern)`** returns a DataFrame with the content captured by parenthetical groups in the pattern. Use it when you want to *pull out* specific parts of a string (e.g., "extract the year, month, and day from a date string"). The key distinction: `contains` and `match` answer "does this match?", while `extract` answers "what did it capture?"

Question 13. What is text normalization? Describe the typical steps in a text normalization pipeline for cleaning a categorical column.

Answer

Text normalization is the process of transforming equivalent text values into a consistent, standardized form so that the same entity is always represented the same way. The typical pipeline is: 1. **Strip whitespace** — remove leading/trailing spaces with `.str.strip()` 2. **Standardize case** — convert to lowercase (or uppercase) with `.str.lower()` 3. **Remove or standardize punctuation** — remove unnecessary dots, commas, etc. 4. **Collapse whitespace** — replace multiple spaces with a single space 5. **Map known variants** — use a dictionary to replace abbreviations and synonyms with standard forms 6. **Verify** — check `.value_counts()` to confirm the normalization worked The order matters: you should standardize case before mapping variants, so your mapping dictionary only needs lowercase keys.

Question 14. Why is the na=False parameter important when using .str.contains() as a filter mask? What happens if you omit it?

Answer

When a column contains missing values (`NaN`), `.str.contains()` returns `NaN` for those rows by default — not `True` or `False`. If you then try to use this Series as a boolean mask to filter a DataFrame (e.g., `df[df["col"].str.contains("pattern")]`), pandas will raise a `ValueError` because it can't interpret `NaN` as a boolean. Setting `na=False` tells pandas to treat missing values as non-matches (return `False`), which produces a clean boolean Series suitable for filtering. You could also use `na=True` if you want missing values to be included in your filtered result.

Question 15. Explain in plain English what the regex pattern ^(?P<first>[A-Z][a-z]+)\s(?P<last>[A-Z][a-z]+)$ matches and what each part does.

Answer

This pattern matches a string containing a first name and last name, both starting with an uppercase letter followed by one or more lowercase letters, separated by a space. Breaking it down: - `^` — the match must start at the beginning of the string - `(?P...)` — a named capture group called "first" - `[A-Z]` — one uppercase letter (the capital letter that starts the name) - `[a-z]+` — one or more lowercase letters (the rest of the name) - `\s` — exactly one whitespace character (the space between names) - `(?P...)` — a named capture group called "last" with the same letter pattern - `$` — the match must end at the end of the string It would match "John Smith" (capturing first="John", last="Smith") but not "john smith" (no capital letters), "John O'Brien" (apostrophe not allowed), or "John Michael Smith" (three words, not two).

Section 4: Applied Scenarios (3 questions, 5 points each)

Question 16. You receive a dataset of vaccination records where the vaccine_name column contains entries like "Pfizer-BioNTech COVID-19 Vaccine", "PFIZER", "pfizer biontech", "Moderna COVID-19 Vaccine", "moderna", and "MODERNA mRNA". Write the pandas code to standardize this column so that all Pfizer entries become "Pfizer" and all Moderna entries become "Moderna". Handle missing values.

Answer

# Step 1: Create a lowercase working copy
clean = df["vaccine_name"].str.strip().str.lower()

# Step 2: Classify using str.contains
import numpy as np
df["vaccine_standard"] = np.where(
    clean.str.contains("pfizer|biontech", na=False), "Pfizer",
    np.where(
        clean.str.contains("moderna|mrna", na=False), "Moderna",
        "Other"
    )
)

Alternative approach using a function with `apply()`:

def standardize_vaccine(name):
    if pd.isna(name):
        return "Unknown"
    name = name.strip().lower()
    if "pfizer" in name or "biontech" in name:
        return "Pfizer"
    elif "moderna" in name or "mrna" in name:
        return "Moderna"
    return "Other"

df["vaccine_standard"] = df["vaccine_name"].apply(standardize_vaccine)

Both approaches are valid. The key is: (1) normalize case first, (2) use pattern matching (either `str.contains` or `in`) to classify, (3) handle NaN explicitly.

Question 17. A colleague writes this regex to extract phone numbers: r"\d{3}-\d{3}-\d{4}". Their dataset has phone numbers in formats like "555-123-4567", "(555) 123-4567", "555.123.4567", and "5551234567". Explain why their pattern only matches the first format, and write an improved pattern that matches all four.

Answer

The pattern `r"\d{3}-\d{3}-\d{4}"` requires literal hyphens between the digit groups. It fails because: - "(555) 123-4567" uses parentheses and a space instead of the first hyphen - "555.123.4567" uses dots instead of hyphens - "5551234567" has no separators at all An improved pattern:

r"\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}"

Breakdown: - `$?` — optional opening parenthesis - `\d{3}` — three digits - `$?` — optional closing parenthesis - `[-.\s]?` — optional separator (hyphen, dot, or space) - `\d{3}` — three digits - `[-.\s]?` — optional separator - `\d{4}` — four digits This matches all four formats. For production use, you'd also want anchors (`^...$`) to ensure the entire string is a phone number, not just a substring.

Question 18. You have a column of product descriptions: ["250mL bottle", "500 mL jug", "1L container", "750ml flask", "2 L tank"]. Write code using str.extract() to create two new columns: amount_ml (all amounts converted to milliliters as floats) and original_unit. Show the expected output.

Answer

import pandas as pd

products = pd.Series(["250mL bottle", "500 mL jug",
                       "1L container", "750ml flask",
                       "2 L tank"])

# Extract amount and unit
extracted = products.str.extract(
    r"(?P<amount>\d+\.?\d*)\s*(?P<unit>[mM]?[lL])"
)

# Convert amount to float
extracted["amount"] = extracted["amount"].astype(float)

# Standardize unit to lowercase
extracted["unit"] = extracted["unit"].str.lower()

# Convert L to mL
extracted["amount_ml"] = extracted.apply(
    lambda row: row["amount"] * 1000 if row["unit"] == "l"
    else row["amount"], axis=1
)

Expected output:

   amount unit  amount_ml
0   250.0   ml      250.0
1   500.0   ml      500.0
2     1.0    l     1000.0
3   750.0   ml      750.0
4     2.0    l     2000.0

The key challenge is handling both "mL"/"ml" and "L"/"l" units and performing the conversion.

Section 5: Code Analysis (2 questions, 5 points each)

Question 19. What does the following code produce? Trace through each step and show the intermediate results.

import pandas as pd

s = pd.Series(["  John Smith (MD)  ",
               "Jane Doe (PhD)",
               "Bob Wilson",
               None,
               "Alice Brown (RN)"])

result = (s
    .str.strip()
    .str.extract(r"^(\w+)\s(\w+)(?:\s\((\w+)\))?$"))

result.columns = ["first", "last", "title"]
print(result)

Answer

Step-by-step: 1. `.str.strip()` removes whitespace: ``` 0 "John Smith (MD)" 1 "Jane Doe (PhD)" 2 "Bob Wilson" 3 NaN 4 "Alice Brown (RN)" ``` 2. `.str.extract()` applies the pattern with three groups: - `(\w+)` — first name - `(\w+)` — last name - `(?:\s$(\w+)$)?` — optional space + parenthesized title (non-capturing outer group, capturing inner group) Result:

   first   last title
0   John  Smith    MD
1   Jane    Doe   PhD
2    Bob Wilson   NaN
3    NaN    NaN   NaN
4  Alice  Brown    RN

Row 2 has no title (parenthesized part is optional, so `NaN`). Row 3 is all `NaN` because the input was `None`.

Question 20. The following code has a bug. Identify the bug, explain why it causes a problem, and fix it.

import pandas as pd

df = pd.DataFrame({
    "price": ["$12.99", "$8.50", "N/A", "$25.00", None]
})

# Goal: convert to numeric, treating N/A as missing
df["price_clean"] = (df["price"]
    .str.replace("$", "")
    .str.replace("N/A", "")
    .astype(float))

Answer

**The bug:** `str.replace("$", "")` without `regex=False` treats `$` as a regex special character (end-of-string anchor), so it doesn't actually remove the dollar sign. Additionally, even if it did, replacing "N/A" with an empty string `""` and then calling `.astype(float)` would fail because an empty string can't be converted to float. **Fixed code:**

df["price_clean"] = (df["price"]
    .str.replace("$", "", regex=False)  # literal dollar sign
    .replace("N/A", pd.NA)              # replace N/A with proper NaN
    .astype(float))                      # now conversion works

Or alternatively:

df["price_clean"] = (df["price"]
    .str.replace(r"[\$,]", "", regex=True)  # remove $ and commas
    .replace({"N/A": pd.NA, "": pd.NA})     # handle non-numeric values
    .astype(float))

Two key fixes: (1) escape the dollar sign or use `regex=False`, and (2) convert "N/A" to a proper pandas missing value (`pd.NA` or `np.nan`) rather than an empty string.