Key Takeaways: Working with Text Data

This is your reference card for Chapter 10 — the chapter where you learned to see structure inside messy text. Keep this nearby whenever you're cleaning text columns.


The Text Cleaning Workflow

Every text standardization task follows the same arc:

1. STRIP WHITESPACE        .str.strip()
    |                      Remove leading/trailing spaces.
    v
2. STANDARDIZE CASE        .str.lower() or .str.upper()
    |                      Eliminate case-based duplicates.
    v
3. REMOVE/STANDARDIZE      .str.replace(r"[^\w\s]", "", regex=True)
   PUNCTUATION             Be careful: hyphens and apostrophes may be meaningful.
    |
    v
4. COLLAPSE WHITESPACE     .str.replace(r"\s+", " ", regex=True)
    |                      Turn "New   York" into "New York".
    v
5. MAP KNOWN VARIANTS      .replace(mapping_dict)
    |                      "usa" -> "united states", "uk" -> "united kingdom".
    v
6. VERIFY RESULTS          .value_counts()
                           Always check what you've got.

Essential .str Methods Cheat Sheet

Method What It Does Example
.str.lower() Lowercase everything "NYC" -> "nyc"
.str.upper() Uppercase everything "nyc" -> "NYC"
.str.strip() Remove edge whitespace " hi " -> "hi"
.str.contains(pat) Boolean: does it contain? Returns True/False Series
.str.startswith(pat) Boolean: starts with? Returns True/False Series
.str.replace(old, new) Replace substring "USA" -> "United States"
.str.split(sep) Split into list "a,b,c" -> ["a","b","c"]
.str.split(sep, expand=True) Split into columns Returns DataFrame
.str.extract(regex) Extract capture groups Returns DataFrame of groups
.str.extractall(regex) Extract all matches Returns multi-indexed DataFrame
.str.findall(regex) Find all matches Returns lists
.str.count(pat) Count occurrences Returns integer Series
.str.len() Length of each string Returns integer Series
.str[0] / .str[:3] Slice each string First char / first 3 chars
.str.cat(sep=",") Join all values Returns single string
.str.match(regex) Boolean: matches from start? Like contains with ^ anchor

Critical parameters: - str.contains(pat, case=False) — case-insensitive search - str.contains(pat, na=False) — treat NaN as False (essential for filtering) - str.replace(old, new, regex=False) — literal replacement (no regex interpretation)


Regex Syntax Quick Reference

Character Matchers

Pattern Matches Example
. Any character (except newline) c.t matches "cat", "cot"
\d Any digit (0-9) \d\d matches "42"
\D Any non-digit \D+ matches "hello"
\w Word character (letter, digit, _) \w+ matches "hello_1"
\W Non-word character \W matches "!", " "
\s Whitespace (space, tab, newline) \s+ matches " "
\S Non-whitespace \S+ matches "hello"
[abc] Any of a, b, or c [aeiou] matches vowels
[A-Z] Any uppercase letter [A-Z]{2} matches "NY"
[^abc] Anything except a, b, c [^0-9] matches non-digits

Quantifiers

Pattern Meaning Example
+ One or more \d+ matches "123"
* Zero or more \d* matches "" or "123"
? Zero or one \d? matches "" or "5"
{n} Exactly n \d{4} matches "2023"
{n,m} Between n and m \d{1,3} matches "1"-"999"
+? One or more (lazy) .+? matches as little as possible

Anchors and Boundaries

Pattern Matches Position
^ Start of string
$ End of string
\b Word boundary

Groups and Alternation

Pattern Meaning
(...) Capture group
(?P<name>...) Named capture group
(?:...) Non-capturing group
a\|b a OR b

Escaping

These characters are special in regex and need \ to match literally: . * + ? ( ) [ ] { } ^ $ | \


Decision Guide: String Methods vs. Regex

Task Tool Example
Change case String method .str.lower()
Strip whitespace String method .str.strip()
Replace fixed text String method .str.replace("old", "new", regex=False)
Split on delimiter String method .str.split(",")
Check for fixed text String method .str.contains("exact")
Match a pattern Regex .str.contains(r"\d{3}-\d{4}")
Extract parts of text Regex .str.extract(r"(\d+)\s*(mg)")
Match alternatives Regex .str.contains(r"cat\|dog")
Whole word match Regex .str.contains(r"\bdata\b")
Validate format Regex .str.match(r"^[A-Z]{2}\d{4}$")

Rule of thumb: Start with string methods. Reach for regex only when you need pattern matching, extraction, or alternation.


Common Patterns for Data Science

# Extract a date (YYYY-MM-DD)
s.str.extract(r"(\d{4})-(\d{2})-(\d{2})")

# Extract a number with optional decimal
s.str.extract(r"(\d+\.?\d*)")

# Extract text inside parentheses
s.str.extract(r"\(([^)]+)\)")

# Match a phone number (flexible separators)
s.str.contains(r"\d{3}[-.\s]?\d{3}[-.\s]?\d{4}")

# Extract email components
s.str.extract(r"([\w.+-]+)@([\w-]+\.[\w.]+)")

# Remove HTML-like tags
s.str.replace(r"<[^>]+>", "", regex=True)

# Collapse whitespace
s.str.replace(r"\s+", " ", regex=True)

Debugging Checklist

When your regex doesn't work as expected:

  1. Test on a small string first — use re.findall() on a single example
  2. Build incrementally — start with the simplest pattern that matches something, then add complexity
  3. Check for unescaped special characters., $, (, ), *, +, ? all need \ for literal matching
  4. Check greedy vs. lazy — if matching too much, add ? after your quantifier
  5. Check for missing na=False — required when using str.contains() as a filter mask
  6. Check regex=True/False — make sure str.replace() is interpreting your pattern correctly
  7. Use regex101.com — paste your pattern and test string for visual debugging

Key Vocabulary

Term Definition
str accessor Pandas interface (.str) for applying string methods to a Series
Regular expression (regex) A pattern-matching mini-language for describing text structure
Character class A set of characters to match: [A-Z], \d, \w
Quantifier Specifies repetition: +, *, ?, {n}
Capture group Parenthesized part of a regex that extracts matched text
Greedy matching Default: match as much as possible (+, *)
Lazy matching Match as little as possible (+?, *?)
Anchor Matches a position, not a character: ^, $, \b
Text normalization Process of making equivalent text values consistent
Tokenization Splitting text into individual words or tokens