> "Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems."
Learning Objectives
- Apply pandas string methods (.str.lower, .str.contains, .str.split, .str.replace) to clean and transform text columns
- Construct regular expressions for common data extraction tasks (dates, numbers, codes, emails) using re module and pandas .str.extract
- Extract structured information from unstructured text fields using capture groups
- Standardize messy categorical text data (inconsistent capitalization, abbreviations, misspellings)
- Evaluate when regex is the right tool versus simpler string methods, avoiding over-engineering
In This Chapter
- Chapter Overview
- 10.1 Why Text Data Is Different (And Why It Matters)
- 10.2 The .str Accessor: String Methods for Entire Columns
- 10.3 Searching Text with .str.contains()
- 10.4 Splitting Text with .str.split()
- 10.5 Replacing Text with .str.replace()
- 10.6 Introduction to Regular Expressions: A Mini-Language for Patterns
- 10.7 Quantifiers: How Many Times Should a Pattern Repeat?
- 10.8 Character Classes: Matching Categories of Characters
- 10.9 Capture Groups: Extracting the Parts You Care About
- 10.10 Threshold Concept: Regular Expressions as a Mini-Language for Describing Patterns
- 10.11 Putting It Together: Alternation, Anchors, and Escaping
- 10.12 Greedy vs. Lazy Matching
- 10.13 Text Normalization: A Systematic Approach
- 10.14 Tokenization: Breaking Text into Words
- 10.15 When to Use Regex vs. String Methods: A Decision Guide
- 10.16 Project Checkpoint: Extracting Vaccine Manufacturers from Messy Text
- 10.17 Common Regex Patterns for Data Science
- 10.18 The re Module: Beyond findall
- 10.19 Debugging Regex: When Your Pattern Doesn't Match
- 10.20 Real-World Application: Text Data in Public Health
- 10.21 Spaced Review: Concepts from Chapters 1-9
- Chapter Summary
Chapter 10: Working with Text Data — String Methods, Regular Expressions, and Extracting Meaning
"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems." — Jamie Zawinski, early Netscape developer
Chapter Overview
Here's something nobody tells you in the first week of a data science course: a huge amount of real-world data is text. Not neat numbers in tidy columns. Text. Messy, inconsistent, riddled-with-typos text.
Survey responses where someone typed "new york" and someone else typed "New York City" and a third person typed "NYC." Medical records where a drug is spelled "Metformin," "metformin," "METFORMIN," and — alarmingly — "metforman." Product listings where the size is buried inside a sentence like "Available in 250mL and 500mL bottles." Vaccination records where the manufacturer field reads "Pfizer-BioNTech COVID-19 Vaccine" in one row and "PFIZER" in the next.
If you can't wrangle text, you can't wrangle most real data.
The good news is that pandas gives you a powerful set of tools for working with text — the .str accessor, which lets you apply string operations to an entire column at once. And for the really tricky patterns, there's an ancient and powerful tool called regular expressions (regex), which is essentially a mini-language for describing patterns in text.
Regular expressions have a fearsome reputation. That quote at the top of this chapter? It's one of the most famous jokes in programming. But here's the thing: regex earned that reputation because people try to learn it all at once, or use it for problems where a simple string method would suffice. We're going to learn it gradually, starting with problems where regex is genuinely the right tool, and we're going to learn when not to use it just as carefully as we learn how to use it.
In this chapter, you will learn to:
- Apply pandas string methods (
.str.lower,.str.contains,.str.split,.str.replace) to clean and transform text columns (all paths) - Construct regular expressions for common data extraction tasks (dates, numbers, codes) using the
remodule and pandas.str.extract(all paths) - Extract structured information from unstructured text fields using capture groups (all paths)
- Standardize messy categorical text data (inconsistent capitalization, abbreviations, misspellings) (all paths)
- Evaluate when regex is the right tool versus simpler string methods, avoiding over-engineering (all paths)
10.1 Why Text Data Is Different (And Why It Matters)
Let's start with a question: why can't we just treat text like any other column?
Try this in your notebook:
import pandas as pd
countries = pd.Series(["United States", "united states",
"USA", "U.S.A.", "US"])
countries.nunique()
5
Five "unique" values — but they all refer to the same country. If you tried to group vaccination data by country using this column, you'd get five separate groups for the United States. Your analysis would be silently, catastrophically wrong.
This is the fundamental challenge of text data: computers see text as sequences of characters, not as meaning. To a computer, "USA" and "United States" have nothing in common. They share zero characters in the same positions. Getting a computer to understand that these refer to the same entity requires you to explicitly tell it — through cleaning, standardization, and pattern matching.
The Three Pillars of Text Wrangling
Every text data problem falls into one of three categories:
- Standardization — Making equivalent values look the same ("NYC" and "New York City" should become a single value)
- Extraction — Pulling structured information out of unstructured text (getting the number "250" out of "250mL bottle")
- Searching — Finding rows that match certain patterns (all entries containing a phone number)
Pandas string methods handle most standardization tasks. Regular expressions are the power tool for extraction and complex searching. Knowing which pillar your problem falls into tells you which tool to reach for.
A Quick Refresher: Python String Methods
Before we dive into pandas, let's remember that Python strings already have useful methods. You met some of these back in Chapter 3:
name = " Elena Rodriguez "
name.strip() # 'Elena Rodriguez'
name.lower() # ' elena rodriguez '
name.upper() # ' ELENA RODRIGUEZ '
name.replace("e", "X") # ' XlXna RodriguXz '
"Elena" in name # True
name.startswith(" ") # True
name.split() # ['Elena', 'Rodriguez']
These methods work great on individual strings. But what about a column of 50,000 strings? You could write a loop:
# This works, but it's slow and clunky
cleaned = []
for country in df["country"]:
cleaned.append(country.strip().lower())
df["country_clean"] = cleaned
That loop processes one string at a time. It's slow on large datasets and it's verbose. Pandas has a better way.
Check Your Understanding
- Why does
"USA" == "United States"evaluate toFalsein Python?- If a column has values "male", "Male", "MALE", and "M", how many unique values would pandas count?
- What Python string method would you use to remove leading and trailing spaces from a name?
10.2 The .str Accessor: String Methods for Entire Columns
The .str accessor is one of pandas' most practical features. It lets you call string methods on every value in a Series at once — no loop required.
Your First .str Operations
countries = pd.Series(["United States", " Brazil ",
"GERMANY", "united kingdom"])
countries.str.lower()
0 united states
1 brazil
2 germany
3 united kingdom
dtype: object
countries.str.strip()
0 United States
1 Brazil
2 GERMANY
3 united kingdom
dtype: object
countries.str.upper()
0 UNITED STATES
1 BRAZIL
2 GERMANY
3 UNITED KINGDOM
dtype: object
You can chain them, just like regular Python string methods:
countries.str.strip().str.lower()
0 united states
1 brazil
2 germany
3 united kingdom
dtype: object
That single line does what our four-line loop did before. Every value gets stripped of whitespace, then converted to lowercase. No loop needed.
How .str Handles Missing Values
Here's something important: real data has missing values, and the .str accessor handles them gracefully.
messy = pd.Series(["Pfizer", None, "MODERNA", " janssen "])
messy.str.lower().str.strip()
0 pfizer
1 None
2 moderna
3 janssen
dtype: object
The None stays as None (technically NaN) instead of crashing with an error. If you tried to call .lower() on None in regular Python, you'd get an AttributeError. The .str accessor just skips missing values. This is exactly the behavior you want when cleaning a messy column.
The Essential .str Methods
Here's your toolkit. You don't need to memorize all of these right now — but knowing they exist means you'll recognize when to use them.
Case conversion:
s.str.lower() # all lowercase
s.str.upper() # ALL UPPERCASE
s.str.title() # Title Case
s.str.capitalize() # First letter capitalized
Whitespace handling:
s.str.strip() # remove leading/trailing whitespace
s.str.lstrip() # remove leading whitespace only
s.str.rstrip() # remove trailing whitespace only
Searching:
s.str.contains("pattern") # True/False for each row
s.str.startswith("prefix") # True/False
s.str.endswith("suffix") # True/False
s.str.find("substring") # position of first match (-1 if none)
Replacing:
s.str.replace("old", "new") # replace substring
Splitting and joining:
s.str.split(",") # split each value into a list
s.str.split(",", expand=True) # split into separate columns
s.str.cat(sep=", ") # join all values into one string
Length and slicing:
s.str.len() # length of each string
s.str[0] # first character
s.str[:3] # first three characters
s.str[-4:] # last four characters
Let's see some of these in action with a realistic example.
Practical Example: Cleaning Country Names
Elena's vaccination dataset has a column called country that's a mess. Let's clean it step by step:
df = pd.DataFrame({
"country": [" United States ", "united states",
"USA", "U.S.A.", "Brazil", "BRAZIL",
"brazil ", "United Kingdom", "UK",
"Côte d'Ivoire", None, "germany"]
})
# Step 1: Strip whitespace and standardize case
df["clean"] = df["country"].str.strip().str.lower()
print(df["clean"].unique())
['united states' 'usa' 'u.s.a.' 'brazil' 'united kingdom'
'uk' nan "côte d'ivoire" 'germany']
That collapsed 12 entries down to 9 unique values — but "united states," "usa," and "u.s.a." are still separate. For those, we need to map abbreviations to standard names:
# Step 2: Replace known abbreviations
replacements = {
"usa": "united states",
"u.s.a.": "united states",
"uk": "united kingdom"
}
df["clean"] = df["clean"].replace(replacements)
print(df["clean"].unique())
['united states' 'brazil' 'united kingdom' nan
"côte d'ivoire" 'germany']
Down to 6 unique countries (plus one missing value). That's clean.
Notice the workflow: we used .str methods for general standardization (strip, lower), then used .replace() with a dictionary for specific mappings. This two-step approach — general standardization first, then specific fixes — is the standard recipe for cleaning categorical text.
Check Your Understanding
- What's the difference between
s.str.strip()ands.str.replace(" ", "")?- Why do we call
.str.strip()before.str.lower()rather than the other way around? (Hint: does the order actually matter here?)- What would happen if we used
df["country"].lower()instead ofdf["country"].str.lower()?
10.3 Searching Text with .str.contains()
One of the most common text operations is searching: finding all rows where a column contains a certain word or pattern.
products = pd.Series([
"Pfizer-BioNTech COVID-19 Vaccine",
"Moderna COVID-19 Vaccine",
"Janssen (Johnson & Johnson) Vaccine",
"AstraZeneca COVID-19 Vaccine",
"Sinovac COVID-19 Vaccine",
"Flu Vaccine (seasonal)",
"MMR Vaccine"
])
# Find all COVID-19 vaccines
products.str.contains("COVID-19")
0 True
1 True
2 False
3 True
4 True
5 False
6 False
dtype: bool
This returns a Boolean Series — perfect for filtering a DataFrame:
df = pd.DataFrame({"product": products, "doses": [100, 80, 50, 70, 60, 40, 30]})
covid_vaccines = df[df["product"].str.contains("COVID-19")]
Case-Insensitive Searching
What if the data has inconsistent capitalization?
notes = pd.Series(["Patient received Pfizer vaccine",
"PFIZER administered",
"pfizer booster given",
"Moderna first dose"])
notes.str.contains("Pfizer")
0 True
1 False
2 False
3 False
dtype: bool
Only the first row matched because str.contains is case-sensitive by default. Fix it with case=False:
notes.str.contains("Pfizer", case=False)
0 True
1 True
2 True
3 False
dtype: bool
Now all three Pfizer entries match, regardless of capitalization.
Handling NaN in .str.contains()
If your column has missing values, str.contains will return NaN for those rows by default, which can cause trouble when you try to use the result as a filter:
messy = pd.Series(["Pfizer vaccine", None, "Moderna vaccine"])
messy.str.contains("Pfizer")
0 True
1 NaN
2 False
dtype: object
That NaN will cause an error if you try to use it as a boolean mask. Use na=False to treat missing values as non-matches:
messy.str.contains("Pfizer", na=False)
0 True
1 False
2 False
dtype: bool
This is one of those small details that will save you twenty minutes of debugging. Make na=False a habit.
10.4 Splitting Text with .str.split()
Sometimes useful information is packed into a single column, separated by a delimiter. str.split breaks it apart.
locations = pd.Series([
"New York, NY",
"Los Angeles, CA",
"Chicago, IL",
"Houston, TX"
])
# Split on comma
locations.str.split(", ")
0 [New York, NY]
1 [Los Angeles, CA]
2 [Chicago, IL]
3 [Houston, TX]
dtype: object
Each value becomes a list. That's useful, but usually you want separate columns. Use expand=True:
locations.str.split(", ", expand=True)
0 1
0 New York NY
1 Los Angeles CA
2 Chicago IL
3 Houston TX
Now you have two columns. You can assign them to your DataFrame:
df = pd.DataFrame({"location": locations})
df[["city", "state"]] = df["location"].str.split(", ", expand=True)
Splitting with a Limit
Sometimes you only want to split on the first occurrence:
entries = pd.Series([
"Smith, John, MD",
"Garcia, Maria, PhD",
"Chen, Wei, DDS"
])
# Split into exactly 2 parts (name, rest)
entries.str.split(", ", n=1, expand=True)
0 1
0 Smith John, MD
1 Garcia Maria, PhD
2 Chen Wei, DDS
The n=1 parameter means "split at most once," giving you two columns. The second column contains everything after the first comma.
Getting a Specific Part After Splitting
If you just need one part, use .str.get() or indexing:
locations.str.split(", ").str[0] # city names
locations.str.split(", ").str[1] # state codes
0 New York
1 Los Angeles
2 Chicago
3 Houston
dtype: object
Check Your Understanding
- What's the difference between
str.split(",")andstr.split(", ")? When does it matter?- If a value is
"Red, Green, Blue"and you usestr.split(", ", n=1), what will the result be?- Why might
expand=Truebe more useful than the default behavior?
10.5 Replacing Text with .str.replace()
The .str.replace() method substitutes one substring for another across an entire column.
vaccines = pd.Series([
"Pfizer-BioNTech COVID-19 Vaccine",
"Moderna COVID-19 Vaccine (mRNA-1273)",
"Johnson & Johnson (Janssen) Vaccine"
])
# Remove "COVID-19" from all entries
vaccines.str.replace("COVID-19 ", "")
0 Pfizer-BioNTech Vaccine
1 Moderna Vaccine (mRNA-1273)
2 Johnson & Johnson (Janssen) Vaccine
dtype: object
You can chain replacements:
# Standardize to just manufacturer names
(vaccines
.str.replace("COVID-19 ", "")
.str.replace("Vaccine", "")
.str.replace(r"\(.*?\)", "", regex=True)
.str.strip())
0 Pfizer-BioNTech
1 Moderna
2 Johnson & Johnson
dtype: object
Wait — what was that regex=True about? That's a sneak preview of what's coming next. The pattern \(.*?\) is a regular expression that matches anything inside parentheses. We'll learn exactly how it works in the next section.
Important note: Starting with pandas 2.0, str.replace uses regex by default when the pattern looks like a regex. To be safe and explicit, always pass regex=False when you mean a literal replacement, and regex=True when you mean a pattern:
# Literal replacement — no regex
s.str.replace(".", "", regex=False)
# Pattern replacement — uses regex
s.str.replace(r"\d+", "NUM", regex=True)
10.6 Introduction to Regular Expressions: A Mini-Language for Patterns
This is the big one. Regular expressions — universally abbreviated as regex — are a pattern-matching language that has been part of computing since the 1960s. They work in Python, JavaScript, Java, Ruby, SQL, grep, sed, and dozens of other tools. Learning regex once means you can use it everywhere.
But regex has a reputation for being cryptic. A pattern like ^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$ is, honestly, not pretty. It matches email addresses, but reading it feels like decoding ancient glyphs.
Here's our approach: we're going to start small. Really small. And we're going to build up piece by piece, always with practical examples, always asking "why would I need this?"
What Is a Regular Expression?
A regular expression is a string that describes a pattern of characters. You use it to say things like:
- "Find any string that starts with a digit"
- "Find any string that contains a phone number"
- "Extract the part of this string that looks like a date"
Think of regex as a way to describe the shape of text without knowing the exact text.
Literal Characters: The Simplest Patterns
The simplest regex is just a string of normal characters:
import re
text = "The vaccine was administered on 2023-03-15"
re.search("vaccine", text)
<re.Match object; span=(4, 11), match='vaccine'>
The pattern "vaccine" matches the literal word "vaccine." This is no different from using in or str.contains. But regex can do much more.
The re Module Basics
Python's re module has four essential functions:
import re
text = "Patient ID: 12345, Date: 2023-03-15"
# Does the pattern appear anywhere in the text?
re.search(r"\d+", text) # finds '12345'
# Does the text start with this pattern?
re.match(r"Patient", text) # matches 'Patient'
# Find ALL occurrences
re.findall(r"\d+", text) # ['12345', '2023', '03', '15']
# Replace pattern with something else
re.sub(r"\d+", "X", text) # 'Patient ID: X, Date: X-X-X'
Notice the r before each pattern string. This is a raw string — it tells Python not to interpret backslashes as escape sequences. Always use raw strings for regex patterns. Always. It will save you from mysterious bugs.
Your First Special Characters: The Dot and the Backslash-d
Now let's learn two special characters that make regex more than just literal matching.
The dot (.) matches any single character (except a newline):
re.findall(r"c.t", "cat cot cut chart coat")
['cat', 'cot', 'cut']
The pattern c.t means "a c, then any character, then a t." It matches "cat," "cot," and "cut" but not "chart" (too many characters between c and t) or "coat" (same reason).
The \d matches any digit (0-9):
re.findall(r"\d", "Room 42, Floor 3")
['4', '2', '3']
Each \d matches one digit. But what if you want to match a multi-digit number?
10.7 Quantifiers: How Many Times Should a Pattern Repeat?
Quantifiers tell regex how many times the preceding character or group should appear.
| Quantifier | Meaning | Example | Matches |
|---|---|---|---|
+ |
One or more | \d+ |
"1", "42", "12345" |
* |
Zero or more | \d* |
"", "1", "42" |
? |
Zero or one | \d? |
"", "5" |
{n} |
Exactly n | \d{3} |
"123", "456" |
{n,m} |
Between n and m | \d{2,4} |
"12", "123", "1234" |
The most useful quantifier for data work is + (one or more):
text = "Patient ID: 12345, Date: 2023-03-15"
# \d+ means "one or more digits"
re.findall(r"\d+", text)
['12345', '2023', '03', '15']
Compare this to \d alone, which would give you individual digits. The + says "keep matching digits until you hit something that isn't a digit."
Combining Dots and Quantifiers
# .+ means "one or more of any character"
re.search(r"ID: .+,", "Patient ID: 12345, Date: 2023")
<re.Match object; span=(8, 22), match='ID: 12345, Da'>
Hmm, that matched more than expected. The .+ is greedy — it matches as much as possible. We'll learn about greedy versus lazy matching later in this chapter.
A Practical Example: Extracting Numbers from Text
Suppose you have a column of medication dosages written in free text:
dosages = pd.Series([
"Take 500mg twice daily",
"Apply 2.5mL topically",
"Inject 10 units subcutaneously",
"250 mcg nasal spray"
])
# Extract the first number from each entry
dosages.str.extract(r"(\d+\.?\d*)")
0
0 500
1 2.5
2 10
3 250
Wait — there's a new method here. Let's talk about .str.extract() and capture groups.
Check Your Understanding
- What's the difference between
\dand\d+?- What would
re.findall(r"\d{4}", "Phone: 555-1234, Ext 42")return?- Why do we use
r"..."(raw strings) for regex patterns?
10.8 Character Classes: Matching Categories of Characters
Sometimes you need to match a specific set of characters, not just "any character" (.) or "any digit" (\d). That's what character classes are for.
Built-in Character Classes
| Pattern | Matches | Equivalent To |
|---|---|---|
\d |
Any digit | [0-9] |
\D |
Any non-digit | [^0-9] |
\w |
Any "word" character | [A-Za-z0-9_] |
\W |
Any non-word character | [^A-Za-z0-9_] |
\s |
Any whitespace | [ \t\n\r\f] |
\S |
Any non-whitespace | [^ \t\n\r\f] |
The uppercase versions are always the opposite of the lowercase ones. \d matches digits; \D matches everything that isn't a digit.
Custom Character Classes with Square Brackets
Square brackets let you define your own set of allowed characters:
# Match any vowel
re.findall(r"[aeiou]", "hello world")
['e', 'o', 'o']
# Match any character that's a letter or a hyphen
re.findall(r"[A-Za-z-]+", "Pfizer-BioNTech COVID-19")
['Pfizer-BioNTech', 'COVID-']
You can use ranges inside brackets: [A-Z] means any uppercase letter, [a-z] any lowercase letter, [0-9] any digit.
Negated Character Classes
A caret (^) at the start of a character class means "anything except these characters":
# Match anything that's not a digit or hyphen
re.findall(r"[^0-9-]+", "2023-03-15 vaccine administered")
[' vaccine administered']
A Practical Example: Validating Format Codes
Suppose your data has country codes that should be exactly two uppercase letters:
codes = pd.Series(["US", "uk", "FR", "123", "DE", "X", "BR"])
# Check which ones match the pattern: exactly 2 uppercase letters
codes.str.match(r"^[A-Z]{2}$")
0 True
1 False
2 True
3 False
4 True
5 False
6 True
dtype: bool
Let's unpack that pattern: ^[A-Z]{2}$
- ^ — start of string
- [A-Z] — one uppercase letter
- {2} — exactly two times
- $ — end of string
Together: "The entire string must be exactly two uppercase letters." The anchors ^ and $ are important — without them, the pattern would match any string that contains two uppercase letters, even "123AB456".
10.9 Capture Groups: Extracting the Parts You Care About
Here's where regex becomes truly powerful for data science. Capture groups let you extract specific parts of a match, not just find whether a pattern exists.
A capture group is created by wrapping part of a pattern in parentheses:
text = "Date: 2023-03-15"
match = re.search(r"(\d{4})-(\d{2})-(\d{2})", text)
match.group(0) # '2023-03-15' (entire match)
match.group(1) # '2023' (first group)
match.group(2) # '03' (second group)
match.group(3) # '15' (third group)
The pattern (\d{4})-(\d{2})-(\d{2}) says "match four digits, a hyphen, two digits, a hyphen, two digits." The parentheses mark three capture groups: year, month, and day.
Using Capture Groups with pandas .str.extract()
This is where capture groups shine in data science. The .str.extract() method pulls out the captured groups into separate DataFrame columns:
dates = pd.Series([
"Administered on 2023-03-15",
"Scheduled for 2023-04-20",
"Completed 2023-01-10",
"No date recorded"
])
dates.str.extract(r"(\d{4})-(\d{2})-(\d{2})")
0 1 2
0 2023 03 15
1 2023 04 20
2 2023 01 10
3 NaN NaN NaN
Each capture group becomes a column. Rows that don't match get NaN. You can name the groups for cleaner output:
dates.str.extract(
r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
)
year month day
0 2023 03 15
1 2023 04 20
2 2023 01 10
3 NaN NaN NaN
The (?P<name>...) syntax gives each capture group a name, which becomes the column header.
Extracting Multiple Matches with .str.extractall()
If each string might contain multiple matches, use .str.extractall():
notes = pd.Series([
"Received Pfizer on 2023-01-15, booster 2023-07-20",
"Moderna 2023-03-10",
"No vaccination record"
])
notes.str.extractall(r"(\d{4}-\d{2}-\d{2})")
0
match
0 0 2023-01-15
1 2023-07-20
1 0 2023-03-10
The result has a multi-index: the original row number and a match number within each row.
A Practical Example: Parsing Vaccine Entries
Elena's dataset has a column with vaccine descriptions that follow no consistent format:
vaccines = pd.Series([
"Pfizer-BioNTech (BNT162b2) 30mcg",
"Moderna (mRNA-1273) 100mcg",
"AstraZeneca (AZD1222) 0.5mL",
"Johnson & Johnson single dose",
"Sinovac (CoronaVac) 600SU"
])
# Extract manufacturer and code (if present)
vaccines.str.extract(
r"^(?P<manufacturer>[A-Za-z&\s-]+?)\s*\((?P[^)]+)\)"
)
manufacturer code
0 Pfizer-BioNTech BNT162b2
1 Moderna mRNA-1273
2 AstraZeneca AZD1222
3 NaN NaN
4 Sinovac CoronaVac
Row 3 (Johnson & Johnson) didn't match because it has no code in parentheses. That's fine — NaN tells us which entries need different handling.
Let's break down that pattern piece by piece:
- ^ — start of string
- (?P<manufacturer>[A-Za-z&\s-]+?) — capture group named "manufacturer": one or more letters, ampersands, spaces, or hyphens (lazy match)
- \s* — zero or more spaces
- \( — a literal opening parenthesis (escaped because ( is special in regex)
- (?P<code>[^)]+) — capture group named "code": one or more characters that aren't a closing parenthesis
- \) — a literal closing parenthesis
Check Your Understanding
- What's the difference between
re.search(r"\d+", text)andre.findall(r"\d+", text)?- In the pattern
(\d{4})-(\d{2})-(\d{2}), how many capture groups are there?- What does
str.extract()return when a row doesn't match the pattern?
10.10 Threshold Concept: Regular Expressions as a Mini-Language for Describing Patterns
Threshold Concept Alert: This is a concept that, once you truly grasp it, fundamentally changes how you see text data. It may feel uncomfortable at first.
Here's the mental shift: a regular expression is not a string. It's a program.
When you write r"\d{4}-\d{2}-\d{2}", you're not writing a string that somehow matches dates. You're writing instructions in a mini programming language. Those instructions say:
- Match a digit. Do this four times.
- Match a literal hyphen.
- Match a digit. Do this two times.
- Match a literal hyphen.
- Match a digit. Do this two times.
This language has its own vocabulary (\d, \w, \s), its own control structures (quantifiers, alternation), its own grouping mechanism (parentheses), and its own anchoring system (^, $). It's a language within a language.
Why does this matter? Because once you see regex as a language for describing patterns, three things change:
First, you stop trying to memorize patterns and start composing them. You don't memorize "the regex for a phone number." Instead, you think: "A phone number is three digits, a separator, three digits, a separator, four digits" — and you compose the pattern: \d{3}[-.\s]\d{3}[-.\s]\d{4}.
Second, you start seeing text as having structure. That messy free-text field? It's not chaos. It has patterns. Dates follow patterns. Product codes follow patterns. Addresses follow patterns. Regex is the tool for describing those patterns to a computer.
Third, you understand why regex can be both powerful and dangerous. A programming language that lets you express complex ideas in a few characters is powerful. But dense code is hard to read, debug, and maintain. A 50-character regex that nobody can understand is worse than five lines of clear Python that do the same thing.
This is why experienced data scientists follow a rule: use the simplest tool that works. If str.lower() solves your problem, don't use regex. If str.replace("old", "new") works, don't use regex. Save regex for the problems where you genuinely need pattern matching — extracting variable-format data, validating complex formats, finding patterns that can't be described by a literal string.
Before the threshold: "Regex is a weird way to search for text." After the threshold: "Regex is a language for describing the structure of text, and I can compose patterns from simple building blocks."
Threshold Check
- Explain in your own words why regex is described as a "mini-language" rather than just a search tool.
- Given a string like
"Invoice #2023-0042, Amount: $1,234.56", describe in plain English the structure you see — then sketch a regex pattern for each piece.- A colleague wrote the regex
^[A-Z]{2}\d{6}[A-Z]$and says "it matches passport numbers." Without running it, describe in words what strings this pattern will match.
10.11 Putting It Together: Alternation, Anchors, and Escaping
Let's round out our regex toolkit with three more concepts.
Alternation: Matching One Thing OR Another
The pipe character | means "or":
re.findall(r"cat|dog", "I have a cat and a dog and a catfish")
['cat', 'dog', 'cat']
Use parentheses to limit the scope of alternation:
# Without parentheses: matches "gray" or "grey"
re.findall(r"gray|grey", "The gray cat and grey dog")
# Same thing, more concisely
re.findall(r"gr[ae]y", "The gray cat and grey dog")
In data science, alternation is great for matching multiple variants:
vaccines = pd.Series([
"Pfizer vaccine",
"Moderna shot",
"J&J jab",
"AstraZeneca injection"
])
# Find any entry mentioning a vaccine (different words)
vaccines.str.contains(
r"vaccine|shot|jab|injection", case=False
)
0 True
1 True
2 True
3 True
dtype: bool
Anchors: Matching Position, Not Characters
Anchors don't match characters — they match positions in the string.
| Anchor | Matches |
|---|---|
^ |
Start of string |
$ |
End of string |
\b |
Word boundary |
# Without anchors: matches "cat" anywhere
re.findall(r"cat", "The cat caught a catfish")
# ['cat', 'cat', 'cat']
# With word boundary: matches "cat" as a whole word
re.findall(r"\bcat\b", "The cat caught a catfish")
# ['cat']
Word boundaries (\b) are extremely useful in data science for matching whole words without accidentally matching substrings:
# Searching for the state "IN" (Indiana)
states = pd.Series(["IN", "INDIANA", "MISSING", "IN PROGRESS"])
# Bad: matches "IN" inside other words
states.str.contains("IN")
# All True!
# Better: match "IN" as a complete word
states.str.contains(r"\bIN\b")
# True, False, False, True
Hmm, that last one ("IN PROGRESS") still matched. If you want to match only strings that are exactly "IN":
states.str.match(r"^IN$")
# True, False, False, False
Escaping Special Characters
In regex, characters like ., *, +, ?, (, ), [, ], {, }, ^, $, |, and \ have special meanings. If you want to match them literally, you need to escape them with a backslash:
# The dot matches any character
re.findall(r"3.14", "3.14 and 3X14")
# ['3.14', '3X14']
# Escaped dot matches only a literal dot
re.findall(r"3\.14", "3.14 and 3X14")
# ['3.14']
This comes up constantly with data that contains periods, dollar signs, parentheses, and other punctuation:
# Match a price like "$12.99"
re.findall(r"\$\d+\.\d{2}", "Total: $12.99 plus $3.50 shipping")
# ['$12.99', '$3.50']
The pattern \$\d+\.\d{2} means: a literal dollar sign, one or more digits, a literal dot, exactly two digits.
10.12 Greedy vs. Lazy Matching
This is a subtlety that trips up even experienced regex users.
By default, quantifiers are greedy — they match as much as possible:
text = "<b>bold</b> and <i>italic</i>"
re.findall(r"<.+>", text)
['<b>bold</b> and <i>italic</i>']
The .+ matched everything from the first < to the last >, swallowing all the text in between. That's not what we wanted.
Adding a ? after a quantifier makes it lazy — it matches as little as possible:
re.findall(r"<.+?>", text)
['<b>', '</b>', '<i>', '</i>']
Now .+? matches just enough to reach the nearest >.
In data science, you'll encounter this when extracting text between delimiters:
notes = pd.Series([
"Diagnosis: (Type 2 Diabetes) Treatment: (Metformin)",
"Notes: (Patient stable) Follow-up: (2 weeks)"
])
# Greedy: captures everything between first ( and last )
notes.str.extract(r"\((.+)\)")
# 0 Type 2 Diabetes) Treatment: (Metformin
# 1 Patient stable) Follow-up: (2 weeks
# Lazy: captures just the first parenthesized group
notes.str.extract(r"\((.+?)\)")
# 0 Type 2 Diabetes
# 1 Patient stable
The lazy version stops at the first closing parenthesis, which is almost always what you want.
Check Your Understanding
- What's the difference between
re.findall(r"<.+>", html)andre.findall(r"<.+?>", html)?- Why does
\$match a literal dollar sign instead of "end of string"?- How would you use
\bto search for the word "data" without matching "database" or "metadata"?
10.13 Text Normalization: A Systematic Approach
Now that you have both string methods and regex in your toolkit, let's talk about a complete workflow for text normalization — the process of making equivalent text values consistent.
The Normalization Pipeline
Here's the workflow Elena uses in her public health work:
Step 1: Strip whitespace .str.strip()
Step 2: Standardize case .str.lower() or .str.upper()
Step 3: Remove punctuation .str.replace(r"[^\w\s]", "", regex=True)
Step 4: Collapse whitespace .str.replace(r"\s+", " ", regex=True)
Step 5: Map known variants .replace(mapping_dict)
Step 6: Verify results .value_counts()
Let's apply this to a messy column of country names:
df = pd.DataFrame({"country": [
" United States ", "united states", "USA",
"U.S.A.", "US", " U.S. ", "Brazil", "BRAZIL",
" brazil", "United Kingdom", "U.K.", "UK",
None, "germany", "Côte d'Ivoire"
]})
# Steps 1-2: Strip and lowercase
df["clean"] = df["country"].str.strip().str.lower()
# Step 3: Remove periods (careful with Côte d'Ivoire)
df["clean"] = df["clean"].str.replace(".", "", regex=False)
# Step 4: Collapse multiple spaces
df["clean"] = df["clean"].str.replace(r"\s+", " ", regex=True)
# Step 5: Map abbreviations to standard names
mapping = {
"usa": "united states",
"us": "united states",
"usa": "united states",
"uk": "united kingdom",
}
df["clean"] = df["clean"].replace(mapping)
# Step 6: Check results
print(df["clean"].value_counts())
united states 5
brazil 3
united kingdom 3
germany 1
côte d'ivoire 1
Name: clean, dtype: int64
Fourteen messy entries collapsed into five clean values (plus one NaN). That's text normalization at work.
When Mapping Isn't Enough: Fuzzy Matching
Sometimes the variations are too numerous or unpredictable to map manually. Consider medication names:
meds = pd.Series([
"metformin", "Metformin", "metforman",
"metphormin", "METFORMIN HCL", "metformin 500mg"
])
The third and fourth entries are misspellings. A simple mapping dictionary can't catch every possible misspelling. For these cases, there are fuzzy matching libraries like fuzzywuzzy or thefuzz, which we'll mention in the Further Reading. For now, know that this problem exists, and that regex combined with string methods can handle most standardization tasks — but not misspellings.
10.14 Tokenization: Breaking Text into Words
Tokenization is the process of splitting text into individual words (or "tokens"). It's the first step in any text analysis.
The simplest approach uses .str.split():
responses = pd.Series([
"The vaccine was safe and effective",
"I experienced mild side effects",
"No issues at all"
])
# Split into word lists
responses.str.split()
0 [The, vaccine, was, safe, and, effective]
1 [I, experienced, mild, side, effects]
2 [No, issues, at, all]
dtype: object
For counting words, combine with .str.len():
responses.str.split().str.len()
0 6
1 5
2 4
dtype: int64
For more sophisticated tokenization (handling punctuation, contractions, and special cases), libraries like NLTK or spaCy are the professional tools. But for the kind of text wrangling you'll do in data cleaning, str.split() combined with regex is usually sufficient.
Counting Specific Words
Want to know how many times a specific word appears in each entry?
reviews = pd.Series([
"Great product, really great quality",
"Good but not great",
"Great great great!"
])
reviews.str.lower().str.count(r"\bgreat\b")
0 2
1 1
2 3
dtype: int64
The \b word boundaries ensure we count "great" as a whole word, not as part of "greatest" or "greatly."
10.15 When to Use Regex vs. String Methods: A Decision Guide
This might be the most important section in the chapter. Regex is powerful, but it's also easy to overuse. Here's a decision guide:
Use simple string methods when:
- You're doing case conversion (.str.lower(), .str.upper())
- You're stripping whitespace (.str.strip())
- You're replacing a known, fixed substring (.str.replace("old", "new", regex=False))
- You're splitting on a simple delimiter (.str.split(","))
- You're checking for a known, fixed substring (.str.contains("exact text"))
Use regex when:
- You need to match a pattern rather than a fixed string ("any sequence of digits")
- You need to extract part of a string (capture groups with .str.extract())
- You need to match with flexibility (one word OR another, optional characters)
- You need anchoring (must start with, must end with, whole word match)
- You need to match repetition (exactly 3 digits, one or more letters)
Avoid regex when: - A string method solves the problem just as well (simpler is better) - The pattern would be unreadable (more than ~30 characters — consider breaking it up) - You're trying to parse a structured format like HTML or JSON (use a proper parser) - You're trying to match natural language meaning, not structure (use NLP tools)
Here's a concrete comparison:
# Task: Check if a string contains "covid"
# String method — simple, clear, fast
df["text"].str.contains("covid", case=False)
# Regex — unnecessary for this task
df["text"].str.contains(r"(?i)covid")
# Use the string method. It's clearer.
# Task: Extract a 5-digit ZIP code from an address
# String method — awkward, fragile
# (How do you know which 5 digits are the ZIP?)
# Regex — this is what it's for
df["address"].str.extract(r"(\d{5})(?:-\d{4})?$")
The guiding principle: start simple, escalate to regex only when you need pattern matching. Your future self (and your colleagues) will thank you.
10.16 Project Checkpoint: Extracting Vaccine Manufacturers from Messy Text
Let's apply everything we've learned to Elena's vaccination project. Her dataset has a column called vaccine_info that contains free-text descriptions of vaccines. She needs to extract two things:
- The manufacturer name
- Whether it's a primary dose or a booster
Here's a representative sample:
vaccine_data = pd.DataFrame({
"record_id": range(1, 11),
"vaccine_info": [
"Pfizer-BioNTech COVID-19 Vaccine (primary series)",
"MODERNA mRNA vaccine - booster dose",
"janssen single dose vaccine",
"Pfizer booster (3rd dose)",
"AstraZeneca/Oxford vaccine primary",
"moderna covid vaccine primary dose",
"PFIZER-BIONTECH (bivalent booster)",
"J&J/Janssen - single dose",
"pfizer primary series",
"Unknown vaccine type"
]
})
Step 1: Standardize text
vaccine_data["clean"] = (vaccine_data["vaccine_info"]
.str.strip()
.str.lower()
.str.replace(r"\s+", " ", regex=True))
Step 2: Extract manufacturer using pattern matching
# Define manufacturer patterns
manufacturer_patterns = {
"pfizer": r"pfizer|biontech",
"moderna": r"moderna|mrna-1273",
"janssen": r"janssen|j&j|johnson",
"astrazeneca": r"astrazeneca|oxford|azd1222",
"sinovac": r"sinovac|coronavac"
}
def identify_manufacturer(text):
if pd.isna(text):
return "unknown"
for name, pattern in manufacturer_patterns.items():
if re.search(pattern, text):
return name
return "unknown"
vaccine_data["manufacturer"] = (vaccine_data["clean"]
.apply(identify_manufacturer))
Step 3: Identify dose type
vaccine_data["is_booster"] = (vaccine_data["clean"]
.str.contains(r"booster|3rd dose|bivalent", na=False))
Step 4: Verify results
print(vaccine_data[["vaccine_info", "manufacturer",
"is_booster"]])
vaccine_info manufacturer is_booster
0 Pfizer-BioNTech COVID-19 Vaccine (primary se... pfizer False
1 MODERNA mRNA vaccine - booster dose moderna True
2 janssen single dose vaccine janssen False
3 Pfizer booster (3rd dose) pfizer True
4 AstraZeneca/Oxford vaccine primary astrazeneca False
5 moderna covid vaccine primary dose moderna False
6 PFIZER-BIONTECH (bivalent booster) pfizer True
7 J&J/Janssen - single dose janssen False
8 pfizer primary series pfizer False
9 Unknown vaccine type unknown False
Ten messy free-text entries, now with clean manufacturer names and booster flags. This is the kind of text wrangling that Elena does daily — and it's the kind of work that makes the difference between a dataset you can analyze and a dataset you can only stare at.
Step 5: Standardize country name variations
While we're at it, let's also tackle the country name standardization that Elena needs:
# Suppose the dataset has these country variations
countries = pd.Series([
"United States of America", "USA", "U.S.",
"Republic of Korea", "South Korea", "Korea, South",
"Viet Nam", "Vietnam",
"Russian Federation", "Russia",
"Dem. Rep. Congo", "Democratic Republic of the Congo",
"DR Congo", "Cote d'Ivoire", "Ivory Coast"
])
# Build a mapping
country_map = {
"usa": "united states",
"u.s.": "united states",
"united states of america": "united states",
"republic of korea": "south korea",
"korea, south": "south korea",
"viet nam": "vietnam",
"russian federation": "russia",
"dem. rep. congo": "democratic republic of the congo",
"dr congo": "democratic republic of the congo",
"cote d'ivoire": "ivory coast"
}
cleaned = (countries
.str.strip()
.str.lower()
.replace(country_map))
print(cleaned.value_counts())
united states 3
south korea 3
democratic republic of the congo 3
vietnam 2
ivory coast 2
russia 2
Name: count, dtype: int64
Fifteen variations collapsed into six clean country names.
10.17 Common Regex Patterns for Data Science
Here's a reference table of patterns you'll use again and again. You don't need to memorize these — bookmark this page and come back to it.
| What You Want to Match | Pattern | Example Matches |
|---|---|---|
| Integer | \d+ |
"42", "12345" |
| Decimal number | \d+\.?\d* |
"42", "3.14", "100.0" |
| Date (YYYY-MM-DD) | \d{4}-\d{2}-\d{2} |
"2023-03-15" |
| Date (MM/DD/YYYY) | \d{1,2}/\d{1,2}/\d{4} |
"3/15/2023", "12/01/2022" |
| Time (HH:MM) | \d{1,2}:\d{2} |
"9:30", "14:05" |
| US phone number | \d{3}[-.\s]?\d{3}[-.\s]?\d{4} |
"555-123-4567" |
| Email (simplified) | [\w.+-]+@[\w-]+\.[\w.]+ |
"user@example.com" |
| US ZIP code | \d{5}(-\d{4})? |
"90210", "90210-1234" |
| Leading/trailing spaces | ^\s+\|\s+$ |
" hello " |
| Text in parentheses | \(([^)]+)\) |
"(hello)" captures "hello" |
| Text in quotes | "([^"]+)" |
'"hello"' captures 'hello' |
10.18 The re Module: Beyond findall
We've been using re.search, re.findall, and re.sub. Let's round out our knowledge of the re module with a few more useful features.
re.compile(): Precompiling Patterns
If you're going to use the same pattern many times, compile it first for better performance:
date_pattern = re.compile(r"\d{4}-\d{2}-\d{2}")
# Now use it multiple times
date_pattern.findall("Events on 2023-01-15 and 2023-02-20")
date_pattern.search("Next date: 2023-05-01")
re.IGNORECASE: Case-Insensitive Matching
re.findall(r"pfizer", "Pfizer and PFIZER and pfizer",
re.IGNORECASE)
['Pfizer', 'PFIZER', 'pfizer']
In pandas, you can achieve this with the case=False parameter or the (?i) flag:
# These are equivalent
df["text"].str.contains("pfizer", case=False)
df["text"].str.contains(r"(?i)pfizer")
re.VERBOSE: Readable Regex with Comments
For complex patterns, the re.VERBOSE flag lets you add whitespace and comments:
phone_pattern = re.compile(r"""
(\d{3}) # area code
[-.\s]? # optional separator
(\d{3}) # first three digits
[-.\s]? # optional separator
(\d{4}) # last four digits
""", re.VERBOSE)
phone_pattern.findall("Call 555-123-4567 or 800.555.1234")
[('555', '123', '4567'), ('800', '555', '1234')]
This is much more readable than (\d{3})[-.\s]?(\d{3})[-.\s]?(\d{4}). Use re.VERBOSE for any pattern longer than about 20 characters.
10.19 Debugging Regex: When Your Pattern Doesn't Match
Regex debugging is a skill in itself. Here are strategies that work:
Strategy 1: Test with re.findall() on a Small String
Before applying a regex to a column of 50,000 values, test it on a single string:
test = "Pfizer-BioNTech (BNT162b2) 30mcg"
print(re.findall(r"\((\w+)\)", test))
# ['BNT162b2']
Strategy 2: Build Up Incrementally
Don't write the whole pattern at once. Start with the simplest version and add complexity:
text = "Date: 03/15/2023, Amount: $1,234.56"
# Step 1: Match any digits
re.findall(r"\d+", text)
# ['03', '15', '2023', '1', '234', '56']
# Step 2: Match date pattern
re.findall(r"\d{2}/\d{2}/\d{4}", text)
# ['03/15/2023']
# Step 3: Add capture groups
re.findall(r"(\d{2})/(\d{2})/(\d{4})", text)
# [('03', '15', '2023')]
Strategy 3: Use Online Regex Testers
Websites like regex101.com let you type a pattern and a test string, and they show you exactly what matches, which groups are captured, and why. They even explain each part of your pattern in plain English. These tools are invaluable for learning and debugging.
Strategy 4: Check for Common Mistakes
| Symptom | Common Cause |
|---|---|
| Pattern matches nothing | Forgot to escape special chars (., $, () |
| Pattern matches too much | Used .+ instead of .+? (greedy vs. lazy) |
| Pattern matches part of a word | Forgot word boundary \b |
| Pattern works on test string but not pandas column | Forgot na=False in str.contains |
str.replace doesn't change anything |
Need regex=True parameter |
10.20 Real-World Application: Text Data in Public Health
Text data challenges are everywhere in Elena's work:
Patient notes: "Pt reports fever x2 days post-vaccination" needs to be parsed for symptoms ("fever"), duration ("2 days"), and timing ("post-vaccination").
Survey responses: Free-text answers to "Why didn't you get vaccinated?" need to be categorized into themes (cost, access, trust, medical exemption).
Drug names: "Metformin HCl 500mg extended-release tablet" and "metformin hydrochloride 500 mg ER tab" are the same medication but look completely different to a computer.
Address matching: "123 Main St, Apt 4B" and "123 Main Street, #4B" need to be recognized as the same location.
In each case, the workflow is the same: standardize first (case, whitespace, punctuation), then use pattern matching to extract structure, then map variants to canonical forms. The tools you learned in this chapter — .str methods, regex, and capture groups — are the foundation for all of it.
10.21 Spaced Review: Concepts from Chapters 1-9
Learning sticks when you revisit it. Take five minutes to answer these questions from earlier chapters — without looking back. If you can't answer one, that's a signal to review.
From Chapter 1: What are the three types of data science questions? Give an example of each using vaccination data.
From Chapter 3: What's the difference between = and == in Python? Why does this distinction matter?
From Chapter 5: What's the difference between a list and a dictionary? When would you use each?
From Chapter 7: What does df.head() show you? What about df.info()? Which gives you data types?
From Chapter 8: Name three types of missing data (MCAR, MAR, MNAR). Why does it matter which type you have?
From Chapter 9: What's the difference between merge() and concat() in pandas?
Chapter Summary
You started this chapter treating text as an opaque blob of characters. Now you can see the structure inside it.
The .str accessor gives you vectorized string operations — case conversion, stripping, splitting, replacing — that work on entire columns at once, handle missing values gracefully, and make text standardization efficient.
Regular expressions are a mini-language for describing patterns. You learned the building blocks: literal characters, special characters (\d, \w, \s, .), quantifiers (+, *, ?, {n}), character classes ([A-Z], [^0-9]), anchors (^, $, \b), alternation (|), and capture groups (()).
Capture groups combined with .str.extract() let you pull structured data out of unstructured text — extracting dates, codes, names, and numbers from messy free-text fields.
And you learned the most important lesson of all: use the simplest tool that works. String methods first. Regex when you need pattern matching. And always test on a small sample before running against your full dataset.
Text data is everywhere. The skills you built in this chapter will serve you in every data project you ever work on.
What's Next
In Chapter 11, you'll tackle another tricky data type: dates and times. You'll learn to parse date strings, work with time zones, resample time series data, and compute rolling averages — skills that are essential for any analysis involving trends over time. If you're working on the vaccination project, you'll parse date columns and compute rolling 7-day averages of vaccination rates.
This chapter covered text wrangling with pandas .str methods and regular expressions. In the exercises that follow, you'll practice these skills on realistic text data, from cleaning survey responses to extracting information from medical records. Take your time with regex — it rewards patient practice.