This is your reference card for Chapter 10 — the chapter where you learned to see structure inside messy text. Keep this nearby whenever you're cleaning text columns.
The Text Cleaning Workflow
Every text standardization task follows the same arc:
1. STRIP WHITESPACE .str.strip()
| Remove leading/trailing spaces.
v
2. STANDARDIZE CASE .str.lower() or .str.upper()
| Eliminate case-based duplicates.
v
3. REMOVE/STANDARDIZE .str.replace(r"[^\w\s]", "", regex=True)
PUNCTUATION Be careful: hyphens and apostrophes may be meaningful.
|
v
4. COLLAPSE WHITESPACE .str.replace(r"\s+", " ", regex=True)
| Turn "New York" into "New York".
v
5. MAP KNOWN VARIANTS .replace(mapping_dict)
| "usa" -> "united states", "uk" -> "united kingdom".
v
6. VERIFY RESULTS .value_counts()
Always check what you've got.
Essential .str Methods Cheat Sheet
Method
What It Does
Example
.str.lower()
Lowercase everything
"NYC" -> "nyc"
.str.upper()
Uppercase everything
"nyc" -> "NYC"
.str.strip()
Remove edge whitespace
" hi " -> "hi"
.str.contains(pat)
Boolean: does it contain?
Returns True/False Series
.str.startswith(pat)
Boolean: starts with?
Returns True/False Series
.str.replace(old, new)
Replace substring
"USA" -> "United States"
.str.split(sep)
Split into list
"a,b,c" -> ["a","b","c"]
.str.split(sep, expand=True)
Split into columns
Returns DataFrame
.str.extract(regex)
Extract capture groups
Returns DataFrame of groups
.str.extractall(regex)
Extract all matches
Returns multi-indexed DataFrame
.str.findall(regex)
Find all matches
Returns lists
.str.count(pat)
Count occurrences
Returns integer Series
.str.len()
Length of each string
Returns integer Series
.str[0] / .str[:3]
Slice each string
First char / first 3 chars
.str.cat(sep=",")
Join all values
Returns single string
.str.match(regex)
Boolean: matches from start?
Like contains with ^ anchor
Critical parameters:
- str.contains(pat, case=False) — case-insensitive search
- str.contains(pat, na=False) — treat NaN as False (essential for filtering)
- str.replace(old, new, regex=False) — literal replacement (no regex interpretation)
Regex Syntax Quick Reference
Character Matchers
Pattern
Matches
Example
.
Any character (except newline)
c.t matches "cat", "cot"
\d
Any digit (0-9)
\d\d matches "42"
\D
Any non-digit
\D+ matches "hello"
\w
Word character (letter, digit, _)
\w+ matches "hello_1"
\W
Non-word character
\W matches "!", " "
\s
Whitespace (space, tab, newline)
\s+ matches " "
\S
Non-whitespace
\S+ matches "hello"
[abc]
Any of a, b, or c
[aeiou] matches vowels
[A-Z]
Any uppercase letter
[A-Z]{2} matches "NY"
[^abc]
Anything except a, b, c
[^0-9] matches non-digits
Quantifiers
Pattern
Meaning
Example
+
One or more
\d+ matches "123"
*
Zero or more
\d* matches "" or "123"
?
Zero or one
\d? matches "" or "5"
{n}
Exactly n
\d{4} matches "2023"
{n,m}
Between n and m
\d{1,3} matches "1"-"999"
+?
One or more (lazy)
.+? matches as little as possible
Anchors and Boundaries
Pattern
Matches Position
^
Start of string
$
End of string
\b
Word boundary
Groups and Alternation
Pattern
Meaning
(...)
Capture group
(?P<name>...)
Named capture group
(?:...)
Non-capturing group
a\|b
a OR b
Escaping
These characters are special in regex and need \ to match literally: . * + ? ( ) [ ] { } ^ $ | \
Decision Guide: String Methods vs. Regex
Task
Tool
Example
Change case
String method
.str.lower()
Strip whitespace
String method
.str.strip()
Replace fixed text
String method
.str.replace("old", "new", regex=False)
Split on delimiter
String method
.str.split(",")
Check for fixed text
String method
.str.contains("exact")
Match a pattern
Regex
.str.contains(r"\d{3}-\d{4}")
Extract parts of text
Regex
.str.extract(r"(\d+)\s*(mg)")
Match alternatives
Regex
.str.contains(r"cat\|dog")
Whole word match
Regex
.str.contains(r"\bdata\b")
Validate format
Regex
.str.match(r"^[A-Z]{2}\d{4}$")
Rule of thumb: Start with string methods. Reach for regex only when you need pattern matching, extraction, or alternation.
Common Patterns for Data Science
# Extract a date (YYYY-MM-DD)
s.str.extract(r"(\d{4})-(\d{2})-(\d{2})")
# Extract a number with optional decimal
s.str.extract(r"(\d+\.?\d*)")
# Extract text inside parentheses
s.str.extract(r"\(([^)]+)\)")
# Match a phone number (flexible separators)
s.str.contains(r"\d{3}[-.\s]?\d{3}[-.\s]?\d{4}")
# Extract email components
s.str.extract(r"([\w.+-]+)@([\w-]+\.[\w.]+)")
# Remove HTML-like tags
s.str.replace(r"<[^>]+>", "", regex=True)
# Collapse whitespace
s.str.replace(r"\s+", " ", regex=True)
Debugging Checklist
When your regex doesn't work as expected:
Test on a small string first — use re.findall() on a single example
Build incrementally — start with the simplest pattern that matches something, then add complexity
Check for unescaped special characters — ., $, (, ), *, +, ? all need \ for literal matching
Check greedy vs. lazy — if matching too much, add ? after your quantifier
Check for missing na=False — required when using str.contains() as a filter mask
Check regex=True/False — make sure str.replace() is interpreting your pattern correctly
Use regex101.com — paste your pattern and test string for visual debugging
Key Vocabulary
Term
Definition
str accessor
Pandas interface (.str) for applying string methods to a Series
Regular expression (regex)
A pattern-matching mini-language for describing text structure
Character class
A set of characters to match: [A-Z], \d, \w
Quantifier
Specifies repetition: +, *, ?, {n}
Capture group
Parenthesized part of a regex that extracts matched text
Greedy matching
Default: match as much as possible (+, *)
Lazy matching
Match as little as possible (+?, *?)
Anchor
Matches a position, not a character: ^, $, \b
Text normalization
Process of making equivalent text values consistent
Tokenization
Splitting text into individual words or tokens
We use cookies to improve your experience and show relevant ads. Privacy Policy