Key Takeaways: Working with Text Data

Contributors to Introduction to Data Science

Key Takeaways: Working with Text Data

This is your reference card for Chapter 10 — the chapter where you learned to see structure inside messy text. Keep this nearby whenever you're cleaning text columns.

The Text Cleaning Workflow

Every text standardization task follows the same arc:

1. STRIP WHITESPACE        .str.strip()
    |                      Remove leading/trailing spaces.
    v
2. STANDARDIZE CASE        .str.lower() or .str.upper()
    |                      Eliminate case-based duplicates.
    v
3. REMOVE/STANDARDIZE      .str.replace(r"[^\w\s]", "", regex=True)
   PUNCTUATION             Be careful: hyphens and apostrophes may be meaningful.
    |
    v
4. COLLAPSE WHITESPACE     .str.replace(r"\s+", " ", regex=True)
    |                      Turn "New   York" into "New York".
    v
5. MAP KNOWN VARIANTS      .replace(mapping_dict)
    |                      "usa" -> "united states", "uk" -> "united kingdom".
    v
6. VERIFY RESULTS          .value_counts()
                           Always check what you've got.

Essential `.str` Methods Cheat Sheet

Method	What It Does	Example
`.str.lower()`	Lowercase everything	"NYC" -> "nyc"
`.str.upper()`	Uppercase everything	"nyc" -> "NYC"
`.str.strip()`	Remove edge whitespace	" hi " -> "hi"
`.str.contains(pat)`	Boolean: does it contain?	Returns True/False Series
`.str.startswith(pat)`	Boolean: starts with?	Returns True/False Series
`.str.replace(old, new)`	Replace substring	"USA" -> "United States"
`.str.split(sep)`	Split into list	"a,b,c" -> ["a","b","c"]
`.str.split(sep, expand=True)`	Split into columns	Returns DataFrame
`.str.extract(regex)`	Extract capture groups	Returns DataFrame of groups
`.str.extractall(regex)`	Extract all matches	Returns multi-indexed DataFrame
`.str.findall(regex)`	Find all matches	Returns lists
`.str.count(pat)`	Count occurrences	Returns integer Series
`.str.len()`	Length of each string	Returns integer Series
`.str[0]` / `.str[:3]`	Slice each string	First char / first 3 chars
`.str.cat(sep=",")`	Join all values	Returns single string
`.str.match(regex)`	Boolean: matches from start?	Like contains with ^ anchor

Critical parameters: - str.contains(pat, case=False) — case-insensitive search - str.contains(pat, na=False) — treat NaN as False (essential for filtering) - str.replace(old, new, regex=False) — literal replacement (no regex interpretation)

Regex Syntax Quick Reference

Character Matchers

Pattern	Matches	Example
`.`	Any character (except newline)	`c.t` matches "cat", "cot"
`\d`	Any digit (0-9)	`\d\d` matches "42"
`\D`	Any non-digit	`\D+` matches "hello"
`\w`	Word character (letter, digit, _)	`\w+` matches "hello_1"
`\W`	Non-word character	`\W` matches "!", " "
`\s`	Whitespace (space, tab, newline)	`\s+` matches " "
`\S`	Non-whitespace	`\S+` matches "hello"
`[abc]`	Any of a, b, or c	`[aeiou]` matches vowels
`[A-Z]`	Any uppercase letter	`[A-Z]{2}` matches "NY"
`[^abc]`	Anything except a, b, c	`[^0-9]` matches non-digits

Quantifiers

Pattern	Meaning	Example
`+`	One or more	`\d+` matches "123"
`*`	Zero or more	`\d*` matches "" or "123"
`?`	Zero or one	`\d?` matches "" or "5"
`{n}`	Exactly n	`\d{4}` matches "2023"
`{n,m}`	Between n and m	`\d{1,3}` matches "1"-"999"
`+?`	One or more (lazy)	`.+?` matches as little as possible

Anchors and Boundaries

Pattern	Matches Position
`^`	Start of string
`$`	End of string
`\b`	Word boundary

Groups and Alternation

Pattern	Meaning
`(...)`	Capture group
`(?P<name>...)`	Named capture group
`(?:...)`	Non-capturing group
`a\\|b`	a OR b

Escaping

These characters are special in regex and need \ to match literally: . * + ? ( ) [ ] { } ^ $ | \

Decision Guide: String Methods vs. Regex

Task	Tool	Example
Change case	String method	`.str.lower()`
Strip whitespace	String method	`.str.strip()`
Replace fixed text	String method	`.str.replace("old", "new", regex=False)`
Split on delimiter	String method	`.str.split(",")`
Check for fixed text	String method	`.str.contains("exact")`
Match a pattern	Regex	`.str.contains(r"\d{3}-\d{4}")`
Extract parts of text	Regex	`.str.extract(r"(\d+)\s*(mg)")`
Match alternatives	Regex	`.str.contains(r"cat\\|dog")`
Whole word match	Regex	`.str.contains(r"\bdata\b")`
Validate format	Regex	`.str.match(r"^[A-Z]{2}\d{4}$")`

Rule of thumb: Start with string methods. Reach for regex only when you need pattern matching, extraction, or alternation.

Common Patterns for Data Science

# Extract a date (YYYY-MM-DD)
s.str.extract(r"(\d{4})-(\d{2})-(\d{2})")

# Extract a number with optional decimal
s.str.extract(r"(\d+\.?\d*)")

# Extract text inside parentheses
s.str.extract(r"\(([^)]+)\)")

# Match a phone number (flexible separators)
s.str.contains(r"\d{3}[-.\s]?\d{3}[-.\s]?\d{4}")

# Extract email components
s.str.extract(r"([\w.+-]+)@([\w-]+\.[\w.]+)")

# Remove HTML-like tags
s.str.replace(r"<[^>]+>", "", regex=True)

# Collapse whitespace
s.str.replace(r"\s+", " ", regex=True)

Debugging Checklist

When your regex doesn't work as expected:

Test on a small string first — use re.findall() on a single example
Build incrementally — start with the simplest pattern that matches something, then add complexity
Check for unescaped special characters — ., $, (, ), *, +, ? all need \ for literal matching
Check greedy vs. lazy — if matching too much, add ? after your quantifier
Check for missing na=False — required when using str.contains() as a filter mask
Check regex=True/False — make sure str.replace() is interpreting your pattern correctly
Use regex101.com — paste your pattern and test string for visual debugging

Key Vocabulary

Term	Definition
str accessor	Pandas interface (`.str`) for applying string methods to a Series
Regular expression (regex)	A pattern-matching mini-language for describing text structure
Character class	A set of characters to match: `[A-Z]`, `\d`, `\w`
Quantifier	Specifies repetition: `+`, `*`, `?`, `{n}`
Capture group	Parenthesized part of a regex that extracts matched text
Greedy matching	Default: match as much as possible (`+`, `*`)
Lazy matching	Match as little as possible (`+?`, `*?`)
Anchor	Matches a position, not a character: `^`, `$`, `\b`
Text normalization	Process of making equivalent text values consistent
Tokenization	Splitting text into individual words or tokens

Key Takeaways: Working with Text Data

The Text Cleaning Workflow

Essential .str Methods Cheat Sheet

Regex Syntax Quick Reference

Character Matchers

Quantifiers

Anchors and Boundaries

Groups and Alternation

Escaping

Decision Guide: String Methods vs. Regex

Common Patterns for Data Science

Debugging Checklist

Key Vocabulary

Essential `.str` Methods Cheat Sheet