Chapter 10: Working with Text Data — String Methods, Regular Expressions, and Extracting Meaning

Contributors to Introduction to Data Science

25 min read

> "Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems."

Learning Objectives

Apply pandas string methods (.str.lower, .str.contains, .str.split, .str.replace) to clean and transform text columns
Construct regular expressions for common data extraction tasks (dates, numbers, codes, emails) using re module and pandas .str.extract
Extract structured information from unstructured text fields using capture groups
Standardize messy categorical text data (inconsistent capitalization, abbreviations, misspellings)
Evaluate when regex is the right tool versus simpler string methods, avoiding over-engineering

In This Chapter

Chapter Overview
10.1 Why Text Data Is Different (And Why It Matters)
10.2 The .str Accessor: String Methods for Entire Columns
10.3 Searching Text with .str.contains()
10.4 Splitting Text with .str.split()
10.5 Replacing Text with .str.replace()
10.6 Introduction to Regular Expressions: A Mini-Language for Patterns
10.7 Quantifiers: How Many Times Should a Pattern Repeat?
10.8 Character Classes: Matching Categories of Characters
10.9 Capture Groups: Extracting the Parts You Care About
10.10 Threshold Concept: Regular Expressions as a Mini-Language for Describing Patterns
10.11 Putting It Together: Alternation, Anchors, and Escaping
10.12 Greedy vs. Lazy Matching
10.13 Text Normalization: A Systematic Approach
10.14 Tokenization: Breaking Text into Words
10.15 When to Use Regex vs. String Methods: A Decision Guide
10.16 Project Checkpoint: Extracting Vaccine Manufacturers from Messy Text
10.17 Common Regex Patterns for Data Science
10.18 The re Module: Beyond findall
10.19 Debugging Regex: When Your Pattern Doesn't Match
10.20 Real-World Application: Text Data in Public Health
10.21 Spaced Review: Concepts from Chapters 1-9
Chapter Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 10: Working with Text Data — String Methods, Regular Expressions, and Extracting Meaning

"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems." — Jamie Zawinski, early Netscape developer

Chapter Overview

Here's something nobody tells you in the first week of a data science course: a huge amount of real-world data is text. Not neat numbers in tidy columns. Text. Messy, inconsistent, riddled-with-typos text.

Survey responses where someone typed "new york" and someone else typed "New York City" and a third person typed "NYC." Medical records where a drug is spelled "Metformin," "metformin," "METFORMIN," and — alarmingly — "metforman." Product listings where the size is buried inside a sentence like "Available in 250mL and 500mL bottles." Vaccination records where the manufacturer field reads "Pfizer-BioNTech COVID-19 Vaccine" in one row and "PFIZER" in the next.

If you can't wrangle text, you can't wrangle most real data.

The good news is that pandas gives you a powerful set of tools for working with text — the .str accessor, which lets you apply string operations to an entire column at once. And for the really tricky patterns, there's an ancient and powerful tool called regular expressions (regex), which is essentially a mini-language for describing patterns in text.

Regular expressions have a fearsome reputation. That quote at the top of this chapter? It's one of the most famous jokes in programming. But here's the thing: regex earned that reputation because people try to learn it all at once, or use it for problems where a simple string method would suffice. We're going to learn it gradually, starting with problems where regex is genuinely the right tool, and we're going to learn when not to use it just as carefully as we learn how to use it.

In this chapter, you will learn to:

Apply pandas string methods (.str.lower, .str.contains, .str.split, .str.replace) to clean and transform text columns (all paths)
Construct regular expressions for common data extraction tasks (dates, numbers, codes) using the re module and pandas .str.extract (all paths)
Extract structured information from unstructured text fields using capture groups (all paths)
Standardize messy categorical text data (inconsistent capitalization, abbreviations, misspellings) (all paths)
Evaluate when regex is the right tool versus simpler string methods, avoiding over-engineering (all paths)

10.1 Why Text Data Is Different (And Why It Matters)

Let's start with a question: why can't we just treat text like any other column?

Try this in your notebook:

import pandas as pd

countries = pd.Series(["United States", "united states",
                       "USA", "U.S.A.", "US"])
countries.nunique()

Five "unique" values — but they all refer to the same country. If you tried to group vaccination data by country using this column, you'd get five separate groups for the United States. Your analysis would be silently, catastrophically wrong.

This is the fundamental challenge of text data: computers see text as sequences of characters, not as meaning. To a computer, "USA" and "United States" have nothing in common. They share zero characters in the same positions. Getting a computer to understand that these refer to the same entity requires you to explicitly tell it — through cleaning, standardization, and pattern matching.

The Three Pillars of Text Wrangling

Every text data problem falls into one of three categories:

Standardization — Making equivalent values look the same ("NYC" and "New York City" should become a single value)
Extraction — Pulling structured information out of unstructured text (getting the number "250" out of "250mL bottle")
Searching — Finding rows that match certain patterns (all entries containing a phone number)

Pandas string methods handle most standardization tasks. Regular expressions are the power tool for extraction and complex searching. Knowing which pillar your problem falls into tells you which tool to reach for.

A Quick Refresher: Python String Methods

Before we dive into pandas, let's remember that Python strings already have useful methods. You met some of these back in Chapter 3:

name = "  Elena Rodriguez  "
name.strip()          # 'Elena Rodriguez'
name.lower()          # '  elena rodriguez  '
name.upper()          # '  ELENA RODRIGUEZ  '
name.replace("e", "X")  # '  XlXna RodriguXz  '
"Elena" in name       # True
name.startswith(" ")  # True
name.split()          # ['Elena', 'Rodriguez']

These methods work great on individual strings. But what about a column of 50,000 strings? You could write a loop:

# This works, but it's slow and clunky
cleaned = []
for country in df["country"]:
    cleaned.append(country.strip().lower())
df["country_clean"] = cleaned

That loop processes one string at a time. It's slow on large datasets and it's verbose. Pandas has a better way.

Check Your Understanding

Why does "USA" == "United States" evaluate to False in Python?

If a column has values "male", "Male", "MALE", and "M", how many unique values would pandas count?

What Python string method would you use to remove leading and trailing spaces from a name?

10.2 The `.str` Accessor: String Methods for Entire Columns

The .str accessor is one of pandas' most practical features. It lets you call string methods on every value in a Series at once — no loop required.

Your First `.str` Operations

countries = pd.Series(["United States", "  Brazil  ",
                        "GERMANY", "united kingdom"])

countries.str.lower()

0     united states
1       brazil
2           germany
3    united kingdom
dtype: object

countries.str.strip()

0    United States
1           Brazil
2          GERMANY
3    united kingdom
dtype: object

countries.str.upper()

0    UNITED STATES
1       BRAZIL
2          GERMANY
3    UNITED KINGDOM
dtype: object

You can chain them, just like regular Python string methods:

countries.str.strip().str.lower()

0    united states
1           brazil
2          germany
3    united kingdom
dtype: object

That single line does what our four-line loop did before. Every value gets stripped of whitespace, then converted to lowercase. No loop needed.

How `.str` Handles Missing Values

Here's something important: real data has missing values, and the .str accessor handles them gracefully.

messy = pd.Series(["Pfizer", None, "MODERNA", "  janssen  "])
messy.str.lower().str.strip()

0      pfizer
1        None
2     moderna
3     janssen
dtype: object

The None stays as None (technically NaN) instead of crashing with an error. If you tried to call .lower() on None in regular Python, you'd get an AttributeError. The .str accessor just skips missing values. This is exactly the behavior you want when cleaning a messy column.

The Essential `.str` Methods

Here's your toolkit. You don't need to memorize all of these right now — but knowing they exist means you'll recognize when to use them.

Case conversion:

s.str.lower()      # all lowercase
s.str.upper()      # ALL UPPERCASE
s.str.title()      # Title Case
s.str.capitalize() # First letter capitalized

Whitespace handling:

s.str.strip()      # remove leading/trailing whitespace
s.str.lstrip()     # remove leading whitespace only
s.str.rstrip()     # remove trailing whitespace only

Searching:

s.str.contains("pattern")    # True/False for each row
s.str.startswith("prefix")   # True/False
s.str.endswith("suffix")     # True/False
s.str.find("substring")      # position of first match (-1 if none)

Replacing:

s.str.replace("old", "new")  # replace substring

Splitting and joining:

s.str.split(",")       # split each value into a list
s.str.split(",", expand=True)  # split into separate columns
s.str.cat(sep=", ")    # join all values into one string

Length and slicing:

s.str.len()        # length of each string
s.str[0]           # first character
s.str[:3]          # first three characters
s.str[-4:]         # last four characters

Let's see some of these in action with a realistic example.

Practical Example: Cleaning Country Names

Elena's vaccination dataset has a column called country that's a mess. Let's clean it step by step:

df = pd.DataFrame({
    "country": ["  United States  ", "united states",
                 "USA", "U.S.A.", "Brazil", "BRAZIL",
                 "brazil ", "United Kingdom", "UK",
                 "Côte d'Ivoire", None, "germany"]
})

# Step 1: Strip whitespace and standardize case
df["clean"] = df["country"].str.strip().str.lower()
print(df["clean"].unique())

['united states' 'usa' 'u.s.a.' 'brazil' 'united kingdom'
 'uk' nan "côte d'ivoire" 'germany']

That collapsed 12 entries down to 9 unique values — but "united states," "usa," and "u.s.a." are still separate. For those, we need to map abbreviations to standard names:

# Step 2: Replace known abbreviations
replacements = {
    "usa": "united states",
    "u.s.a.": "united states",
    "uk": "united kingdom"
}
df["clean"] = df["clean"].replace(replacements)
print(df["clean"].unique())

['united states' 'brazil' 'united kingdom' nan
 "côte d'ivoire" 'germany']

Down to 6 unique countries (plus one missing value). That's clean.

Notice the workflow: we used .str methods for general standardization (strip, lower), then used .replace() with a dictionary for specific mappings. This two-step approach — general standardization first, then specific fixes — is the standard recipe for cleaning categorical text.

Check Your Understanding

What's the difference between s.str.strip() and s.str.replace(" ", "")?

Why do we call .str.strip() before .str.lower() rather than the other way around? (Hint: does the order actually matter here?)

What would happen if we used df["country"].lower() instead of df["country"].str.lower()?

10.3 Searching Text with `.str.contains()`

One of the most common text operations is searching: finding all rows where a column contains a certain word or pattern.

products = pd.Series([
    "Pfizer-BioNTech COVID-19 Vaccine",
    "Moderna COVID-19 Vaccine",
    "Janssen (Johnson & Johnson) Vaccine",
    "AstraZeneca COVID-19 Vaccine",
    "Sinovac COVID-19 Vaccine",
    "Flu Vaccine (seasonal)",
    "MMR Vaccine"
])

# Find all COVID-19 vaccines
products.str.contains("COVID-19")

0     True
1     True
2    False
3     True
4     True
5    False
6    False
dtype: bool

This returns a Boolean Series — perfect for filtering a DataFrame:

df = pd.DataFrame({"product": products, "doses": [100, 80, 50, 70, 60, 40, 30]})
covid_vaccines = df[df["product"].str.contains("COVID-19")]

Case-Insensitive Searching

What if the data has inconsistent capitalization?

notes = pd.Series(["Patient received Pfizer vaccine",
                    "PFIZER administered",
                    "pfizer booster given",
                    "Moderna first dose"])

notes.str.contains("Pfizer")

0     True
1    False
2    False
3    False
dtype: bool

Only the first row matched because str.contains is case-sensitive by default. Fix it with case=False:

notes.str.contains("Pfizer", case=False)

0     True
1     True
2     True
3    False
dtype: bool

Now all three Pfizer entries match, regardless of capitalization.

Handling NaN in `.str.contains()`

If your column has missing values, str.contains will return NaN for those rows by default, which can cause trouble when you try to use the result as a filter:

messy = pd.Series(["Pfizer vaccine", None, "Moderna vaccine"])
messy.str.contains("Pfizer")

0     True
1      NaN
2    False
dtype: object

That NaN will cause an error if you try to use it as a boolean mask. Use na=False to treat missing values as non-matches:

messy.str.contains("Pfizer", na=False)

0     True
1    False
2    False
dtype: bool

This is one of those small details that will save you twenty minutes of debugging. Make na=False a habit.

10.4 Splitting Text with `.str.split()`

Sometimes useful information is packed into a single column, separated by a delimiter. str.split breaks it apart.

locations = pd.Series([
    "New York, NY",
    "Los Angeles, CA",
    "Chicago, IL",
    "Houston, TX"
])

# Split on comma
locations.str.split(", ")

0     [New York, NY]
1    [Los Angeles, CA]
2       [Chicago, IL]
3       [Houston, TX]
dtype: object

Each value becomes a list. That's useful, but usually you want separate columns. Use expand=True:

locations.str.split(", ", expand=True)

            0   1
0    New York  NY
1  Los Angeles  CA
2      Chicago  IL
3      Houston  TX

Now you have two columns. You can assign them to your DataFrame:

df = pd.DataFrame({"location": locations})
df[["city", "state"]] = df["location"].str.split(", ", expand=True)

Splitting with a Limit

Sometimes you only want to split on the first occurrence:

entries = pd.Series([
    "Smith, John, MD",
    "Garcia, Maria, PhD",
    "Chen, Wei, DDS"
])

# Split into exactly 2 parts (name, rest)
entries.str.split(", ", n=1, expand=True)

        0            1
0   Smith     John, MD
1  Garcia   Maria, PhD
2    Chen     Wei, DDS

The n=1 parameter means "split at most once," giving you two columns. The second column contains everything after the first comma.

Getting a Specific Part After Splitting

If you just need one part, use .str.get() or indexing:

locations.str.split(", ").str[0]   # city names
locations.str.split(", ").str[1]   # state codes

0      New York
1    Los Angeles
2        Chicago
3        Houston
dtype: object

Check Your Understanding

What's the difference between str.split(",") and str.split(", ")? When does it matter?

If a value is "Red, Green, Blue" and you use str.split(", ", n=1), what will the result be?

Why might expand=True be more useful than the default behavior?

10.5 Replacing Text with `.str.replace()`

The .str.replace() method substitutes one substring for another across an entire column.

vaccines = pd.Series([
    "Pfizer-BioNTech COVID-19 Vaccine",
    "Moderna COVID-19 Vaccine (mRNA-1273)",
    "Johnson & Johnson (Janssen) Vaccine"
])

# Remove "COVID-19" from all entries
vaccines.str.replace("COVID-19 ", "")

0          Pfizer-BioNTech Vaccine
1    Moderna Vaccine (mRNA-1273)
2    Johnson & Johnson (Janssen) Vaccine
dtype: object

You can chain replacements:

# Standardize to just manufacturer names
(vaccines
 .str.replace("COVID-19 ", "")
 .str.replace("Vaccine", "")
 .str.replace(r"\(.*?\)", "", regex=True)
 .str.strip())

0    Pfizer-BioNTech
1            Moderna
2    Johnson & Johnson
dtype: object

Wait — what was that regex=True about? That's a sneak preview of what's coming next. The pattern $.*?$ is a regular expression that matches anything inside parentheses. We'll learn exactly how it works in the next section.

Important note: Starting with pandas 2.0, str.replace uses regex by default when the pattern looks like a regex. To be safe and explicit, always pass regex=False when you mean a literal replacement, and regex=True when you mean a pattern:

# Literal replacement — no regex
s.str.replace(".", "", regex=False)

# Pattern replacement — uses regex
s.str.replace(r"\d+", "NUM", regex=True)

10.6 Introduction to Regular Expressions: A Mini-Language for Patterns

This is the big one. Regular expressions — universally abbreviated as regex — are a pattern-matching language that has been part of computing since the 1960s. They work in Python, JavaScript, Java, Ruby, SQL, grep, sed, and dozens of other tools. Learning regex once means you can use it everywhere.

But regex has a reputation for being cryptic. A pattern like ^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$ is, honestly, not pretty. It matches email addresses, but reading it feels like decoding ancient glyphs.

Here's our approach: we're going to start small. Really small. And we're going to build up piece by piece, always with practical examples, always asking "why would I need this?"

What Is a Regular Expression?

A regular expression is a string that describes a pattern of characters. You use it to say things like:

"Find any string that starts with a digit"
"Find any string that contains a phone number"
"Extract the part of this string that looks like a date"

Think of regex as a way to describe the shape of text without knowing the exact text.

Literal Characters: The Simplest Patterns

The simplest regex is just a string of normal characters:

import re

text = "The vaccine was administered on 2023-03-15"
re.search("vaccine", text)

<re.Match object; span=(4, 11), match='vaccine'>

The pattern "vaccine" matches the literal word "vaccine." This is no different from using in or str.contains. But regex can do much more.

The `re` Module Basics

Python's re module has four essential functions:

import re

text = "Patient ID: 12345, Date: 2023-03-15"

# Does the pattern appear anywhere in the text?
re.search(r"\d+", text)     # finds '12345'

# Does the text start with this pattern?
re.match(r"Patient", text)  # matches 'Patient'

# Find ALL occurrences
re.findall(r"\d+", text)    # ['12345', '2023', '03', '15']

# Replace pattern with something else
re.sub(r"\d+", "X", text)   # 'Patient ID: X, Date: X-X-X'

Notice the r before each pattern string. This is a raw string — it tells Python not to interpret backslashes as escape sequences. Always use raw strings for regex patterns. Always. It will save you from mysterious bugs.

Your First Special Characters: The Dot and the Backslash-d

Now let's learn two special characters that make regex more than just literal matching.

The dot (.) matches any single character (except a newline):

re.findall(r"c.t", "cat cot cut chart coat")

['cat', 'cot', 'cut']

The pattern c.t means "a c, then any character, then a t." It matches "cat," "cot," and "cut" but not "chart" (too many characters between c and t) or "coat" (same reason).

The \d matches any digit (0-9):

re.findall(r"\d", "Room 42, Floor 3")

['4', '2', '3']

Each \d matches one digit. But what if you want to match a multi-digit number?

10.7 Quantifiers: How Many Times Should a Pattern Repeat?

Quantifiers tell regex how many times the preceding character or group should appear.

Quantifier	Meaning	Example	Matches
`+`	One or more	`\d+`	"1", "42", "12345"
`*`	Zero or more	`\d*`	"", "1", "42"
`?`	Zero or one	`\d?`	"", "5"
`{n}`	Exactly n	`\d{3}`	"123", "456"
`{n,m}`	Between n and m	`\d{2,4}`	"12", "123", "1234"

The most useful quantifier for data work is + (one or more):

text = "Patient ID: 12345, Date: 2023-03-15"

# \d+ means "one or more digits"
re.findall(r"\d+", text)

['12345', '2023', '03', '15']

Compare this to \d alone, which would give you individual digits. The + says "keep matching digits until you hit something that isn't a digit."

Combining Dots and Quantifiers

# .+ means "one or more of any character"
re.search(r"ID: .+,", "Patient ID: 12345, Date: 2023")

<re.Match object; span=(8, 22), match='ID: 12345, Da'>

Hmm, that matched more than expected. The .+ is greedy — it matches as much as possible. We'll learn about greedy versus lazy matching later in this chapter.

A Practical Example: Extracting Numbers from Text

Suppose you have a column of medication dosages written in free text:

dosages = pd.Series([
    "Take 500mg twice daily",
    "Apply 2.5mL topically",
    "Inject 10 units subcutaneously",
    "250 mcg nasal spray"
])

# Extract the first number from each entry
dosages.str.extract(r"(\d+\.?\d*)")

Wait — there's a new method here. Let's talk about .str.extract() and capture groups.

Check Your Understanding

What's the difference between \d and \d+?

What would re.findall(r"\d{4}", "Phone: 555-1234, Ext 42") return?

Why do we use r"..." (raw strings) for regex patterns?

10.8 Character Classes: Matching Categories of Characters

Sometimes you need to match a specific set of characters, not just "any character" (.) or "any digit" (\d). That's what character classes are for.

Built-in Character Classes

Pattern	Matches	Equivalent To
`\d`	Any digit	`[0-9]`
`\D`	Any non-digit	`[^0-9]`
`\w`	Any "word" character	`[A-Za-z0-9_]`
`\W`	Any non-word character	`[^A-Za-z0-9_]`
`\s`	Any whitespace	`[ \t\n\r\f]`
`\S`	Any non-whitespace	`[^ \t\n\r\f]`

The uppercase versions are always the opposite of the lowercase ones. \d matches digits; \D matches everything that isn't a digit.

Custom Character Classes with Square Brackets

Square brackets let you define your own set of allowed characters:

# Match any vowel
re.findall(r"[aeiou]", "hello world")

['e', 'o', 'o']

# Match any character that's a letter or a hyphen
re.findall(r"[A-Za-z-]+", "Pfizer-BioNTech COVID-19")

['Pfizer-BioNTech', 'COVID-']

You can use ranges inside brackets: [A-Z] means any uppercase letter, [a-z] any lowercase letter, [0-9] any digit.

Negated Character Classes

A caret (^) at the start of a character class means "anything except these characters":

# Match anything that's not a digit or hyphen
re.findall(r"[^0-9-]+", "2023-03-15 vaccine administered")

[' vaccine administered']

A Practical Example: Validating Format Codes

Suppose your data has country codes that should be exactly two uppercase letters:

codes = pd.Series(["US", "uk", "FR", "123", "DE", "X", "BR"])

# Check which ones match the pattern: exactly 2 uppercase letters
codes.str.match(r"^[A-Z]{2}$")

0     True
1    False
2     True
3    False
4     True
5    False
6     True
dtype: bool

Let's unpack that pattern: ^[A-Z]{2}$ - ^ — start of string - [A-Z] — one uppercase letter - {2} — exactly two times - $ — end of string

Together: "The entire string must be exactly two uppercase letters." The anchors ^ and $ are important — without them, the pattern would match any string that contains two uppercase letters, even "123AB456".

10.9 Capture Groups: Extracting the Parts You Care About

Here's where regex becomes truly powerful for data science. Capture groups let you extract specific parts of a match, not just find whether a pattern exists.

A capture group is created by wrapping part of a pattern in parentheses:

text = "Date: 2023-03-15"
match = re.search(r"(\d{4})-(\d{2})-(\d{2})", text)

match.group(0)   # '2023-03-15' (entire match)
match.group(1)   # '2023' (first group)
match.group(2)   # '03' (second group)
match.group(3)   # '15' (third group)

The pattern (\d{4})-(\d{2})-(\d{2}) says "match four digits, a hyphen, two digits, a hyphen, two digits." The parentheses mark three capture groups: year, month, and day.

Using Capture Groups with pandas `.str.extract()`

This is where capture groups shine in data science. The .str.extract() method pulls out the captured groups into separate DataFrame columns:

dates = pd.Series([
    "Administered on 2023-03-15",
    "Scheduled for 2023-04-20",
    "Completed 2023-01-10",
    "No date recorded"
])

dates.str.extract(r"(\d{4})-(\d{2})-(\d{2})")

      0     1     2
0  2023    03    15
1  2023    04    20
2  2023    01    10
3   NaN   NaN   NaN

Each capture group becomes a column. Rows that don't match get NaN. You can name the groups for cleaner output:

dates.str.extract(
    r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
)

   year month  day
0  2023    03   15
1  2023    04   20
2  2023    01   10
3   NaN   NaN  NaN

The (?P<name>...) syntax gives each capture group a name, which becomes the column header.

Extracting Multiple Matches with `.str.extractall()`

If each string might contain multiple matches, use .str.extractall():

notes = pd.Series([
    "Received Pfizer on 2023-01-15, booster 2023-07-20",
    "Moderna 2023-03-10",
    "No vaccination record"
])

notes.str.extractall(r"(\d{4}-\d{2}-\d{2})")

               0
  match
0 0      2023-01-15
  1      2023-07-20
1 0      2023-03-10

The result has a multi-index: the original row number and a match number within each row.

A Practical Example: Parsing Vaccine Entries

Elena's dataset has a column with vaccine descriptions that follow no consistent format:

vaccines = pd.Series([
    "Pfizer-BioNTech (BNT162b2) 30mcg",
    "Moderna (mRNA-1273) 100mcg",
    "AstraZeneca (AZD1222) 0.5mL",
    "Johnson & Johnson single dose",
    "Sinovac (CoronaVac) 600SU"
])

# Extract manufacturer and code (if present)
vaccines.str.extract(
    r"^(?P<manufacturer>[A-Za-z&\s-]+?)\s*\((?P[^)]+)\)"
)


         manufacturer        code
0     Pfizer-BioNTech     BNT162b2
1             Moderna    mRNA-1273
2         AstraZeneca      AZD1222
3                 NaN          NaN
4             Sinovac    CoronaVac

Row 3 (Johnson & Johnson) didn't match because it has no code in parentheses. That's fine — NaN tells us which entries need different handling.
Let's break down that pattern piece by piece:
- ^ — start of string
- (?P<manufacturer>[A-Za-z&\s-]+?) — capture group named "manufacturer": one or more letters, ampersands, spaces, or hyphens (lazy match)
- \s* — zero or more spaces
- \( — a literal opening parenthesis (escaped because ( is special in regex)
- (?P<code>[^)]+) — capture group named "code": one or more characters that aren't a closing parenthesis
- \) — a literal closing parenthesis

Check Your Understanding

What's the difference between re.search(r"\d+", text) and re.findall(r"\d+", text)?
In the pattern (\d{4})-(\d{2})-(\d{2}), how many capture groups are there?
What does str.extract() return when a row doesn't match the pattern?



10.10 Threshold Concept: Regular Expressions as a Mini-Language for Describing Patterns

Threshold Concept Alert: This is a concept that, once you truly grasp it, fundamentally changes how you see text data. It may feel uncomfortable at first.

Here's the mental shift: a regular expression is not a string. It's a program.
When you write r"\d{4}-\d{2}-\d{2}", you're not writing a string that somehow matches dates. You're writing instructions in a mini programming language. Those instructions say:

Match a digit. Do this four times.
Match a literal hyphen.
Match a digit. Do this two times.
Match a literal hyphen.
Match a digit. Do this two times.

This language has its own vocabulary (\d, \w, \s), its own control structures (quantifiers, alternation), its own grouping mechanism (parentheses), and its own anchoring system (^, $). It's a language within a language.
Why does this matter? Because once you see regex as a language for describing patterns, three things change:
First, you stop trying to memorize patterns and start composing them. You don't memorize "the regex for a phone number." Instead, you think: "A phone number is three digits, a separator, three digits, a separator, four digits" — and you compose the pattern: \d{3}[-.\s]\d{3}[-.\s]\d{4}.
Second, you start seeing text as having structure. That messy free-text field? It's not chaos. It has patterns. Dates follow patterns. Product codes follow patterns. Addresses follow patterns. Regex is the tool for describing those patterns to a computer.
Third, you understand why regex can be both powerful and dangerous. A programming language that lets you express complex ideas in a few characters is powerful. But dense code is hard to read, debug, and maintain. A 50-character regex that nobody can understand is worse than five lines of clear Python that do the same thing.
This is why experienced data scientists follow a rule: use the simplest tool that works. If str.lower() solves your problem, don't use regex. If str.replace("old", "new") works, don't use regex. Save regex for the problems where you genuinely need pattern matching — extracting variable-format data, validating complex formats, finding patterns that can't be described by a literal string.
Before the threshold: "Regex is a weird way to search for text."
After the threshold: "Regex is a language for describing the structure of text, and I can compose patterns from simple building blocks."

Threshold Check

Explain in your own words why regex is described as a "mini-language" rather than just a search tool.
Given a string like "Invoice #2023-0042, Amount: $1,234.56", describe in plain English the structure you see — then sketch a regex pattern for each piece.
A colleague wrote the regex ^[A-Z]{2}\d{6}[A-Z]$ and says "it matches passport numbers." Without running it, describe in words what strings this pattern will match.



10.11 Putting It Together: Alternation, Anchors, and Escaping
Let's round out our regex toolkit with three more concepts.
Alternation: Matching One Thing OR Another
The pipe character | means "or":
re.findall(r"cat|dog", "I have a cat and a dog and a catfish")

['cat', 'dog', 'cat']

Use parentheses to limit the scope of alternation:
# Without parentheses: matches "gray" or "grey"
re.findall(r"gray|grey", "The gray cat and grey dog")

# Same thing, more concisely
re.findall(r"gr[ae]y", "The gray cat and grey dog")

In data science, alternation is great for matching multiple variants:
vaccines = pd.Series([
    "Pfizer vaccine",
    "Moderna shot",
    "J&J jab",
    "AstraZeneca injection"
])

# Find any entry mentioning a vaccine (different words)
vaccines.str.contains(
    r"vaccine|shot|jab|injection", case=False
)

0    True
1    True
2    True
3    True
dtype: bool

Anchors: Matching Position, Not Characters
Anchors don't match characters — they match positions in the string.



Anchor
Matches




^
Start of string


$
End of string


\b
Word boundary



# Without anchors: matches "cat" anywhere
re.findall(r"cat", "The cat caught a catfish")
# ['cat', 'cat', 'cat']

# With word boundary: matches "cat" as a whole word
re.findall(r"\bcat\b", "The cat caught a catfish")
# ['cat']

Word boundaries (\b) are extremely useful in data science for matching whole words without accidentally matching substrings:
# Searching for the state "IN" (Indiana)
states = pd.Series(["IN", "INDIANA", "MISSING", "IN PROGRESS"])

# Bad: matches "IN" inside other words
states.str.contains("IN")
# All True!

# Better: match "IN" as a complete word
states.str.contains(r"\bIN\b")
# True, False, False, True

Hmm, that last one ("IN PROGRESS") still matched. If you want to match only strings that are exactly "IN":
states.str.match(r"^IN$")
# True, False, False, False

Escaping Special Characters
In regex, characters like ., *, +, ?, (, ), [, ], {, }, ^, $, |, and \ have special meanings. If you want to match them literally, you need to escape them with a backslash:
# The dot matches any character
re.findall(r"3.14", "3.14 and 3X14")
# ['3.14', '3X14']

# Escaped dot matches only a literal dot
re.findall(r"3\.14", "3.14 and 3X14")
# ['3.14']

This comes up constantly with data that contains periods, dollar signs, parentheses, and other punctuation:
# Match a price like "$12.99"
re.findall(r"\$\d+\.\d{2}", "Total: $12.99 plus $3.50 shipping")
# ['$12.99', '$3.50']

The pattern \$\d+\.\d{2} means: a literal dollar sign, one or more digits, a literal dot, exactly two digits.

10.12 Greedy vs. Lazy Matching
This is a subtlety that trips up even experienced regex users.
By default, quantifiers are greedy — they match as much as possible:
text = "<b>bold</b> and <i>italic</i>"
re.findall(r"<.+>", text)

['<b>bold</b> and <i>italic</i>']

The .+ matched everything from the first < to the last >, swallowing all the text in between. That's not what we wanted.
Adding a ? after a quantifier makes it lazy — it matches as little as possible:
re.findall(r"<.+?>", text)

['<b>', '</b>', '<i>', '</i>']

Now .+? matches just enough to reach the nearest >.
In data science, you'll encounter this when extracting text between delimiters:
notes = pd.Series([
    "Diagnosis: (Type 2 Diabetes) Treatment: (Metformin)",
    "Notes: (Patient stable) Follow-up: (2 weeks)"
])

# Greedy: captures everything between first ( and last )
notes.str.extract(r"\((.+)\)")
# 0    Type 2 Diabetes) Treatment: (Metformin
# 1    Patient stable) Follow-up: (2 weeks

# Lazy: captures just the first parenthesized group
notes.str.extract(r"\((.+?)\)")
# 0    Type 2 Diabetes
# 1    Patient stable

The lazy version stops at the first closing parenthesis, which is almost always what you want.

Check Your Understanding

What's the difference between re.findall(r"<.+>", html) and re.findall(r"<.+?>", html)?
Why does \$ match a literal dollar sign instead of "end of string"?
How would you use \b to search for the word "data" without matching "database" or "metadata"?



10.13 Text Normalization: A Systematic Approach
Now that you have both string methods and regex in your toolkit, let's talk about a complete workflow for text normalization — the process of making equivalent text values consistent.
The Normalization Pipeline
Here's the workflow Elena uses in her public health work:
Step 1: Strip whitespace         .str.strip()
Step 2: Standardize case         .str.lower() or .str.upper()
Step 3: Remove punctuation       .str.replace(r"[^\w\s]", "", regex=True)
Step 4: Collapse whitespace      .str.replace(r"\s+", " ", regex=True)
Step 5: Map known variants       .replace(mapping_dict)
Step 6: Verify results           .value_counts()

Let's apply this to a messy column of country names:
df = pd.DataFrame({"country": [
    "  United States  ", "united states", "USA",
    "U.S.A.", "US", " U.S. ", "Brazil", "BRAZIL",
    "  brazil", "United  Kingdom", "U.K.", "UK",
    None, "germany", "Côte d'Ivoire"
]})

# Steps 1-2: Strip and lowercase
df["clean"] = df["country"].str.strip().str.lower()

# Step 3: Remove periods (careful with Côte d'Ivoire)
df["clean"] = df["clean"].str.replace(".", "", regex=False)

# Step 4: Collapse multiple spaces
df["clean"] = df["clean"].str.replace(r"\s+", " ", regex=True)

# Step 5: Map abbreviations to standard names
mapping = {
    "usa": "united states",
    "us": "united states",
    "usa": "united states",
    "uk": "united kingdom",
}
df["clean"] = df["clean"].replace(mapping)

# Step 6: Check results
print(df["clean"].value_counts())

united states     5
brazil            3
united kingdom    3
germany           1
côte d'ivoire     1
Name: clean, dtype: int64

Fourteen messy entries collapsed into five clean values (plus one NaN). That's text normalization at work.
When Mapping Isn't Enough: Fuzzy Matching
Sometimes the variations are too numerous or unpredictable to map manually. Consider medication names:
meds = pd.Series([
    "metformin", "Metformin", "metforman",
    "metphormin", "METFORMIN HCL", "metformin 500mg"
])

The third and fourth entries are misspellings. A simple mapping dictionary can't catch every possible misspelling. For these cases, there are fuzzy matching libraries like fuzzywuzzy or thefuzz, which we'll mention in the Further Reading. For now, know that this problem exists, and that regex combined with string methods can handle most standardization tasks — but not misspellings.

10.14 Tokenization: Breaking Text into Words
Tokenization is the process of splitting text into individual words (or "tokens"). It's the first step in any text analysis.
The simplest approach uses .str.split():
responses = pd.Series([
    "The vaccine was safe and effective",
    "I experienced mild side effects",
    "No issues at all"
])

# Split into word lists
responses.str.split()

0    [The, vaccine, was, safe, and, effective]
1     [I, experienced, mild, side, effects]
2                        [No, issues, at, all]
dtype: object

For counting words, combine with .str.len():
responses.str.split().str.len()

0    6
1    5
2    4
dtype: int64

For more sophisticated tokenization (handling punctuation, contractions, and special cases), libraries like NLTK or spaCy are the professional tools. But for the kind of text wrangling you'll do in data cleaning, str.split() combined with regex is usually sufficient.
Counting Specific Words
Want to know how many times a specific word appears in each entry?
reviews = pd.Series([
    "Great product, really great quality",
    "Good but not great",
    "Great great great!"
])

reviews.str.lower().str.count(r"\bgreat\b")

0    2
1    1
2    3
dtype: int64

The \b word boundaries ensure we count "great" as a whole word, not as part of "greatest" or "greatly."

10.15 When to Use Regex vs. String Methods: A Decision Guide
This might be the most important section in the chapter. Regex is powerful, but it's also easy to overuse. Here's a decision guide:
Use simple string methods when:
- You're doing case conversion (.str.lower(), .str.upper())
- You're stripping whitespace (.str.strip())
- You're replacing a known, fixed substring (.str.replace("old", "new", regex=False))
- You're splitting on a simple delimiter (.str.split(","))
- You're checking for a known, fixed substring (.str.contains("exact text"))
Use regex when:
- You need to match a pattern rather than a fixed string ("any sequence of digits")
- You need to extract part of a string (capture groups with .str.extract())
- You need to match with flexibility (one word OR another, optional characters)
- You need anchoring (must start with, must end with, whole word match)
- You need to match repetition (exactly 3 digits, one or more letters)
Avoid regex when:
- A string method solves the problem just as well (simpler is better)
- The pattern would be unreadable (more than ~30 characters — consider breaking it up)
- You're trying to parse a structured format like HTML or JSON (use a proper parser)
- You're trying to match natural language meaning, not structure (use NLP tools)
Here's a concrete comparison:
# Task: Check if a string contains "covid"

# String method — simple, clear, fast
df["text"].str.contains("covid", case=False)

# Regex — unnecessary for this task
df["text"].str.contains(r"(?i)covid")

# Use the string method. It's clearer.

# Task: Extract a 5-digit ZIP code from an address

# String method — awkward, fragile
# (How do you know which 5 digits are the ZIP?)

# Regex — this is what it's for
df["address"].str.extract(r"(\d{5})(?:-\d{4})?$")

The guiding principle: start simple, escalate to regex only when you need pattern matching. Your future self (and your colleagues) will thank you.

10.16 Project Checkpoint: Extracting Vaccine Manufacturers from Messy Text
Let's apply everything we've learned to Elena's vaccination project. Her dataset has a column called vaccine_info that contains free-text descriptions of vaccines. She needs to extract two things:

The manufacturer name
Whether it's a primary dose or a booster

Here's a representative sample:
vaccine_data = pd.DataFrame({
    "record_id": range(1, 11),
    "vaccine_info": [
        "Pfizer-BioNTech COVID-19 Vaccine (primary series)",
        "MODERNA mRNA vaccine - booster dose",
        "janssen single dose vaccine",
        "Pfizer booster (3rd dose)",
        "AstraZeneca/Oxford vaccine primary",
        "moderna covid vaccine primary dose",
        "PFIZER-BIONTECH (bivalent booster)",
        "J&J/Janssen - single dose",
        "pfizer  primary  series",
        "Unknown vaccine type"
    ]
})

Step 1: Standardize text
vaccine_data["clean"] = (vaccine_data["vaccine_info"]
    .str.strip()
    .str.lower()
    .str.replace(r"\s+", " ", regex=True))

Step 2: Extract manufacturer using pattern matching
# Define manufacturer patterns
manufacturer_patterns = {
    "pfizer": r"pfizer|biontech",
    "moderna": r"moderna|mrna-1273",
    "janssen": r"janssen|j&j|johnson",
    "astrazeneca": r"astrazeneca|oxford|azd1222",
    "sinovac": r"sinovac|coronavac"
}

def identify_manufacturer(text):
    if pd.isna(text):
        return "unknown"
    for name, pattern in manufacturer_patterns.items():
        if re.search(pattern, text):
            return name
    return "unknown"

vaccine_data["manufacturer"] = (vaccine_data["clean"]
    .apply(identify_manufacturer))

Step 3: Identify dose type
vaccine_data["is_booster"] = (vaccine_data["clean"]
    .str.contains(r"booster|3rd dose|bivalent", na=False))

Step 4: Verify results
print(vaccine_data[["vaccine_info", "manufacturer",
                     "is_booster"]])

                                       vaccine_info manufacturer  is_booster
0  Pfizer-BioNTech COVID-19 Vaccine (primary se...       pfizer       False
1       MODERNA mRNA vaccine - booster dose            moderna        True
2             janssen single dose vaccine             janssen       False
3                Pfizer booster (3rd dose)              pfizer        True
4        AstraZeneca/Oxford vaccine primary        astrazeneca       False
5       moderna covid vaccine primary dose             moderna       False
6       PFIZER-BIONTECH (bivalent booster)             pfizer        True
7                 J&J/Janssen - single dose            janssen       False
8                  pfizer  primary  series             pfizer       False
9                    Unknown vaccine type            unknown       False

Ten messy free-text entries, now with clean manufacturer names and booster flags. This is the kind of text wrangling that Elena does daily — and it's the kind of work that makes the difference between a dataset you can analyze and a dataset you can only stare at.
Step 5: Standardize country name variations
While we're at it, let's also tackle the country name standardization that Elena needs:
# Suppose the dataset has these country variations
countries = pd.Series([
    "United States of America", "USA", "U.S.",
    "Republic of Korea", "South Korea", "Korea, South",
    "Viet Nam", "Vietnam",
    "Russian Federation", "Russia",
    "Dem. Rep. Congo", "Democratic Republic of the Congo",
    "DR Congo", "Cote d'Ivoire", "Ivory Coast"
])

# Build a mapping
country_map = {
    "usa": "united states",
    "u.s.": "united states",
    "united states of america": "united states",
    "republic of korea": "south korea",
    "korea, south": "south korea",
    "viet nam": "vietnam",
    "russian federation": "russia",
    "dem. rep. congo": "democratic republic of the congo",
    "dr congo": "democratic republic of the congo",
    "cote d'ivoire": "ivory coast"
}

cleaned = (countries
    .str.strip()
    .str.lower()
    .replace(country_map))

print(cleaned.value_counts())

united states                          3
south korea                            3
democratic republic of the congo       3
vietnam                                2
ivory coast                            2
russia                                 2
Name: count, dtype: int64

Fifteen variations collapsed into six clean country names.

10.17 Common Regex Patterns for Data Science
Here's a reference table of patterns you'll use again and again. You don't need to memorize these — bookmark this page and come back to it.



What You Want to Match
Pattern
Example Matches




Integer
\d+
"42", "12345"


Decimal number
\d+\.?\d*
"42", "3.14", "100.0"


Date (YYYY-MM-DD)
\d{4}-\d{2}-\d{2}
"2023-03-15"


Date (MM/DD/YYYY)
\d{1,2}/\d{1,2}/\d{4}
"3/15/2023", "12/01/2022"


Time (HH:MM)
\d{1,2}:\d{2}
"9:30", "14:05"


US phone number
\d{3}[-.\s]?\d{3}[-.\s]?\d{4}
"555-123-4567"


Email (simplified)
[\w.+-]+@[\w-]+\.[\w.]+
"user@example.com"


US ZIP code
\d{5}(-\d{4})?
"90210", "90210-1234"


Leading/trailing spaces
^\s+\|\s+$
" hello "


Text in parentheses
\(([^)]+)\)
"(hello)" captures "hello"


Text in quotes
"([^"]+)"
'"hello"' captures 'hello'




10.18 The re Module: Beyond findall
We've been using re.search, re.findall, and re.sub. Let's round out our knowledge of the re module with a few more useful features.
re.compile(): Precompiling Patterns
If you're going to use the same pattern many times, compile it first for better performance:
date_pattern = re.compile(r"\d{4}-\d{2}-\d{2}")

# Now use it multiple times
date_pattern.findall("Events on 2023-01-15 and 2023-02-20")
date_pattern.search("Next date: 2023-05-01")

re.IGNORECASE: Case-Insensitive Matching
re.findall(r"pfizer", "Pfizer and PFIZER and pfizer",
           re.IGNORECASE)

['Pfizer', 'PFIZER', 'pfizer']

In pandas, you can achieve this with the case=False parameter or the (?i) flag:
# These are equivalent
df["text"].str.contains("pfizer", case=False)
df["text"].str.contains(r"(?i)pfizer")

re.VERBOSE: Readable Regex with Comments
For complex patterns, the re.VERBOSE flag lets you add whitespace and comments:
phone_pattern = re.compile(r"""
    (\d{3})     # area code
    [-.\s]?     # optional separator
    (\d{3})     # first three digits
    [-.\s]?     # optional separator
    (\d{4})     # last four digits
""", re.VERBOSE)

phone_pattern.findall("Call 555-123-4567 or 800.555.1234")

[('555', '123', '4567'), ('800', '555', '1234')]

This is much more readable than (\d{3})[-.\s]?(\d{3})[-.\s]?(\d{4}). Use re.VERBOSE for any pattern longer than about 20 characters.

10.19 Debugging Regex: When Your Pattern Doesn't Match
Regex debugging is a skill in itself. Here are strategies that work:
Strategy 1: Test with re.findall() on a Small String
Before applying a regex to a column of 50,000 values, test it on a single string:
test = "Pfizer-BioNTech (BNT162b2) 30mcg"
print(re.findall(r"\((\w+)\)", test))
# ['BNT162b2']

Strategy 2: Build Up Incrementally
Don't write the whole pattern at once. Start with the simplest version and add complexity:
text = "Date: 03/15/2023, Amount: $1,234.56"

# Step 1: Match any digits
re.findall(r"\d+", text)
# ['03', '15', '2023', '1', '234', '56']

# Step 2: Match date pattern
re.findall(r"\d{2}/\d{2}/\d{4}", text)
# ['03/15/2023']

# Step 3: Add capture groups
re.findall(r"(\d{2})/(\d{2})/(\d{4})", text)
# [('03', '15', '2023')]

Strategy 3: Use Online Regex Testers
Websites like regex101.com let you type a pattern and a test string, and they show you exactly what matches, which groups are captured, and why. They even explain each part of your pattern in plain English. These tools are invaluable for learning and debugging.
Strategy 4: Check for Common Mistakes



Symptom
Common Cause




Pattern matches nothing
Forgot to escape special chars (., $, ()


Pattern matches too much
Used .+ instead of .+? (greedy vs. lazy)


Pattern matches part of a word
Forgot word boundary \b


Pattern works on test string but not pandas column
Forgot na=False in str.contains


str.replace doesn't change anything
Need regex=True parameter




10.20 Real-World Application: Text Data in Public Health
Text data challenges are everywhere in Elena's work:
Patient notes: "Pt reports fever x2 days post-vaccination" needs to be parsed for symptoms ("fever"), duration ("2 days"), and timing ("post-vaccination").
Survey responses: Free-text answers to "Why didn't you get vaccinated?" need to be categorized into themes (cost, access, trust, medical exemption).
Drug names: "Metformin HCl 500mg extended-release tablet" and "metformin hydrochloride 500 mg ER tab" are the same medication but look completely different to a computer.
Address matching: "123 Main St, Apt 4B" and "123 Main Street, #4B" need to be recognized as the same location.
In each case, the workflow is the same: standardize first (case, whitespace, punctuation), then use pattern matching to extract structure, then map variants to canonical forms. The tools you learned in this chapter — .str methods, regex, and capture groups — are the foundation for all of it.

10.21 Spaced Review: Concepts from Chapters 1-9
Learning sticks when you revisit it. Take five minutes to answer these questions from earlier chapters — without looking back. If you can't answer one, that's a signal to review.
From Chapter 1: What are the three types of data science questions? Give an example of each using vaccination data.
From Chapter 3: What's the difference between = and == in Python? Why does this distinction matter?
From Chapter 5: What's the difference between a list and a dictionary? When would you use each?
From Chapter 7: What does df.head() show you? What about df.info()? Which gives you data types?
From Chapter 8: Name three types of missing data (MCAR, MAR, MNAR). Why does it matter which type you have?
From Chapter 9: What's the difference between merge() and concat() in pandas?

Chapter Summary
You started this chapter treating text as an opaque blob of characters. Now you can see the structure inside it.
The .str accessor gives you vectorized string operations — case conversion, stripping, splitting, replacing — that work on entire columns at once, handle missing values gracefully, and make text standardization efficient.
Regular expressions are a mini-language for describing patterns. You learned the building blocks: literal characters, special characters (\d, \w, \s, .), quantifiers (+, *, ?, {n}), character classes ([A-Z], [^0-9]), anchors (^, $, \b), alternation (|), and capture groups (()).
Capture groups combined with .str.extract() let you pull structured data out of unstructured text — extracting dates, codes, names, and numbers from messy free-text fields.
And you learned the most important lesson of all: use the simplest tool that works. String methods first. Regex when you need pattern matching. And always test on a small sample before running against your full dataset.
Text data is everywhere. The skills you built in this chapter will serve you in every data project you ever work on.

What's Next
In Chapter 11, you'll tackle another tricky data type: dates and times. You'll learn to parse date strings, work with time zones, resample time series data, and compute rolling averages — skills that are essential for any analysis involving trends over time. If you're working on the vaccination project, you'll parse date columns and compute rolling 7-day averages of vaccination rates.


This chapter covered text wrangling with pandas .str methods and regular expressions. In the exercises that follow, you'll practice these skills on realistic text data, from cleaning survey responses to extracting information from medical records. Take your time with regex — it rewards patient practice.

What You Want to Match	Pattern	Example Matches
Integer	`\d+`	"42", "12345"
Decimal number	`\d+\.?\d*`	"42", "3.14", "100.0"
Date (YYYY-MM-DD)	`\d{4}-\d{2}-\d{2}`	"2023-03-15"
Date (MM/DD/YYYY)	`\d{1,2}/\d{1,2}/\d{4}`	"3/15/2023", "12/01/2022"
Time (HH:MM)	`\d{1,2}:\d{2}`	"9:30", "14:05"
US phone number	`\d{3}[-.\s]?\d{3}[-.\s]?\d{4}`	"555-123-4567"
Email (simplified)	`[\w.+-]+@[\w-]+\.[\w.]+`	"user@example.com"
US ZIP code	`\d{5}(-\d{4})?`	"90210", "90210-1234"
Leading/trailing spaces	`^\s+\\|\s+$`	" hello "
Text in parentheses	`\(([^)]+)\)`	"(hello)" captures "hello"
Text in quotes	`"([^"]+)"`	'"hello"' captures 'hello'

Symptom	Common Cause
Pattern matches nothing	Forgot to escape special chars (`.`, `$`, `(`)
Pattern matches too much	Used `.+` instead of `.+?` (greedy vs. lazy)
Pattern matches part of a word	Forgot word boundary `\b`
Pattern works on test string but not pandas column	Forgot `na=False` in `str.contains`
`str.replace` doesn't change anything	Need `regex=True` parameter

Learning Objectives

In This Chapter

Chapter 10: Working with Text Data — String Methods, Regular Expressions, and Extracting Meaning

Chapter Overview

10.1 Why Text Data Is Different (And Why It Matters)

The Three Pillars of Text Wrangling

A Quick Refresher: Python String Methods

10.2 The .str Accessor: String Methods for Entire Columns

Your First .str Operations

How .str Handles Missing Values

The Essential .str Methods

Practical Example: Cleaning Country Names

10.3 Searching Text with .str.contains()

Case-Insensitive Searching

Handling NaN in .str.contains()

10.4 Splitting Text with .str.split()

Splitting with a Limit

Getting a Specific Part After Splitting

10.5 Replacing Text with .str.replace()

10.6 Introduction to Regular Expressions: A Mini-Language for Patterns

What Is a Regular Expression?

Literal Characters: The Simplest Patterns

The re Module Basics

Your First Special Characters: The Dot and the Backslash-d

10.7 Quantifiers: How Many Times Should a Pattern Repeat?

Combining Dots and Quantifiers

A Practical Example: Extracting Numbers from Text

10.8 Character Classes: Matching Categories of Characters

Built-in Character Classes

Custom Character Classes with Square Brackets

Negated Character Classes

A Practical Example: Validating Format Codes

10.9 Capture Groups: Extracting the Parts You Care About

Using Capture Groups with pandas .str.extract()

Extracting Multiple Matches with .str.extractall()

A Practical Example: Parsing Vaccine Entries

10.10 Threshold Concept: Regular Expressions as a Mini-Language for Describing Patterns

10.11 Putting It Together: Alternation, Anchors, and Escaping

Alternation: Matching One Thing OR Another

Anchors: Matching Position, Not Characters

Escaping Special Characters

10.12 Greedy vs. Lazy Matching

10.13 Text Normalization: A Systematic Approach

The Normalization Pipeline

When Mapping Isn't Enough: Fuzzy Matching

10.14 Tokenization: Breaking Text into Words

Counting Specific Words

10.15 When to Use Regex vs. String Methods: A Decision Guide

10.16 Project Checkpoint: Extracting Vaccine Manufacturers from Messy Text

10.17 Common Regex Patterns for Data Science

10.18 The re Module: Beyond findall

re.compile(): Precompiling Patterns

re.IGNORECASE: Case-Insensitive Matching

re.VERBOSE: Readable Regex with Comments

10.19 Debugging Regex: When Your Pattern Doesn't Match

Strategy 1: Test with re.findall() on a Small String

Strategy 2: Build Up Incrementally

Strategy 3: Use Online Regex Testers

Strategy 4: Check for Common Mistakes

10.20 Real-World Application: Text Data in Public Health

10.21 Spaced Review: Concepts from Chapters 1-9

Chapter Summary

10.2 The `.str` Accessor: String Methods for Entire Columns

Your First `.str` Operations

How `.str` Handles Missing Values

The Essential `.str` Methods

10.3 Searching Text with `.str.contains()`

Handling NaN in `.str.contains()`

10.4 Splitting Text with `.str.split()`

10.5 Replacing Text with `.str.replace()`

The `re` Module Basics

Using Capture Groups with pandas `.str.extract()`

Extracting Multiple Matches with `.str.extractall()`

10.18 The `re` Module: Beyond `findall`

`re.compile()`: Precompiling Patterns

`re.IGNORECASE`: Case-Insensitive Matching

`re.VERBOSE`: Readable Regex with Comments

Strategy 1: Test with `re.findall()` on a Small String