17 min read

> "Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems."

Learning Objectives

  • Write regular expression patterns for common text matching tasks
  • Use the re module: search, match, findall, sub, split
  • Apply character classes, quantifiers, groups, and anchors
  • Extract structured data from unstructured text using capture groups
  • Know when regex is the right tool and when it's overkill

Chapter 22: Regular Expressions: Pattern Matching Power

"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems." — Jamie Zawinski, 1997

Chapter Overview

You've been processing text since Chapter 7. You know split(), strip(), replace(), find(), startswith(), and endswith(). Those string methods are workhorses — and for simple tasks, they're all you need.

But real-world text isn't simple.

Consider these problems: you need to find every phone number in a 10,000-line customer database, but the numbers appear as (555) 867-5309, 555-867-5309, 555.867.5309, and 5558675309. You need to pull dates out of log entries where the format shifts between 2025-03-14, 03/14/2025, and March 14, 2025. You need to validate that an email address is at least structurally plausible before sending a confirmation message.

String methods can handle one format. When the problem is "find a pattern that could appear in many shapes," you need a different tool.

That tool is the regular expression — regex for short. A regex is a mini-language for describing text patterns. Instead of saying "find the exact string 555-867-5309," you say "find three digits, then a separator, then three digits, then a separator, then four digits." The regex engine handles the rest.

Regular expressions appear everywhere in professional software. Log analysis, data cleaning, input validation, search-and-replace, web scraping, bioinformatics, network security — any domain where you process text at scale uses regex. Python's re module puts that power directly in your hands.

Fair warning: regex has a learning curve. The syntax is dense, and a complicated pattern can look like a cat walked across the keyboard. But once you learn to read it, regex becomes one of the most productive tools in your kit. A single line of regex can replace twenty lines of manual string parsing — and it'll handle edge cases you'd never think to code by hand.

In this chapter, you will learn to: - Write regex patterns for common text matching tasks - Use the re module functions: search, match, findall, sub, and split - Apply character classes, quantifiers, groups, and anchors to build precise patterns - Extract structured data from unstructured text using capture groups - Distinguish greedy from lazy matching and know when each is appropriate - Know when regex is the right tool — and when plain string methods are enough

Spaced Review: This chapter builds directly on Chapter 7 (string methods, raw strings, immutability) and draws on Chapter 10 (reading files line by line). You'll also see bridging connections to Chapter 21 (data processing pipelines).

Fast Track: If you're comfortable with basic regex syntax from another language, skim 22.1-22.3 and jump to 22.6 (groups).

Deep Dive: The case studies explore log file analysis with regex and how search engines parse user queries.


22.1 Why Regex? (Text Data Is Messy)

Let's start with a real problem. Dr. Patel — the biology researcher you first met in Chapter 1 — processes FASTA files containing DNA sequences. Each sequence has a header line like this:

>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens
>sp|Q9Y6K1|DNM3A_HUMAN DNA (cytosine-5)-methyltransferase 3A OS=Homo sapiens
>tr|A0A0C4DH68|A0A0C4DH68_HUMAN Isoform of P53 OS=Homo sapiens

Dr. Patel needs to extract just the gene names: P53_HUMAN, DNM3A_HUMAN, A0A0C4DH68_HUMAN. They always appear between the second | character and the first space after it.

With string methods, you'd write something like:

def extract_gene_name(header: str) -> str:
    """Extract gene name from a FASTA header — the hard way."""
    # Find the second pipe character
    first_pipe = header.index("|")
    second_pipe = header.index("|", first_pipe + 1)
    # Find the space after the gene name
    space_after = header.index(" ", second_pipe + 1)
    return header[second_pipe + 1 : space_after]

That works, but it's fragile. What if a header has no space after the gene name? What if the format changes slightly? And what if you need to also extract the organism name (Homo sapiens) and the protein identifier (P04637)? You'd need more index() calls, more slicing, more variables — and the code would be hard to read and harder to maintain.

With regex, the same extraction looks like this:

import re

pattern = r">(\w+)\|(\w+)\|(\w+)\s+(.+?)\s+OS=(.+)"
match = re.search(pattern, header)
if match:
    db, protein_id, gene_name, description, organism = match.groups()

One line of pattern, one line to extract five fields. And if the format changes, you update the pattern — not an entire parsing function.

Let's build up to that level of regex fluency step by step.

What Is a Regular Expression?

A regular expression (or regex, or regexp) is a sequence of characters that defines a search pattern. You write a pattern, and the regex engine scans through text looking for substrings that match.

At its simplest, a regex is just a literal string:

import re

text = "The cat sat on the mat."
result = re.search(r"cat", text)
print(result)  # <re.Match object; span=(4, 7), match='cat'>

The pattern cat matches the literal characters c, a, t in sequence. But regex gets its power from metacharacters — characters with special meaning that let you describe classes of strings rather than one specific string.

Raw Strings: The r Prefix

You'll notice we write regex patterns with the r prefix: r"cat" instead of "cat". This creates a raw string — one where backslashes are treated as literal characters, not escape sequences.

This matters because regex uses backslashes extensively. Without the r prefix, Python would interpret \d (a regex shorthand for "any digit") as an escape sequence. The raw string r"\d" passes the literal characters \ and d to the regex engine, which is what you want.

# Without raw string — Python interprets \b as a backspace character
pattern_bad = "\bword\b"

# With raw string — \b reaches the regex engine as a word boundary
pattern_good = r"\bword\b"

Best Practice: Always use raw strings for regex patterns. Always. Even when the pattern doesn't contain backslashes — it's a habit that prevents subtle bugs.


22.2 The re Module: search, match, findall

Python's re module is part of the standard library — no installation needed. It provides several functions for working with regex patterns. Let's meet the three most common.

re.search() — Find the First Match Anywhere

re.search(pattern, string) scans the entire string and returns a match object for the first occurrence of the pattern, or None if no match is found.

import re

log_line = "2025-03-14 08:23:17 ERROR Database connection timeout"

# Search for a date pattern (we'll refine this later)
result = re.search(r"\d{4}-\d{2}-\d{2}", log_line)
if result:
    print(f"Found date: {result.group()}")   # Found date: 2025-03-14
    print(f"Position: {result.span()}")       # Position: (0, 10)

The match object gives you: - .group() — the actual text that matched - .span() — the start and end positions (as a tuple) - .start() and .end() — the individual positions

re.match() — Match at the Beginning Only

re.match(pattern, string) only checks whether the pattern matches at the start of the string. It won't find matches in the middle.

log_line = "2025-03-14 08:23:17 ERROR Database connection timeout"

# match() works here — the date IS at the start
result = re.match(r"\d{4}-\d{2}-\d{2}", log_line)
print(result.group())  # 2025-03-14

# match() fails here — "ERROR" isn't at the start
result = re.match(r"ERROR", log_line)
print(result)  # None

Pitfall: A common source of confusion. re.match() does NOT check whether the pattern matches the whole string — it checks whether the pattern matches at the start. If you need to match the entire string, use re.fullmatch().

re.findall() — Find All Matches

re.findall(pattern, string) returns a list of all non-overlapping matches. This is the function you'll reach for most often.

text = "Call 555-1234 or 555-5678. Fax: 555-9999."

# Find all phone-number-like patterns
numbers = re.findall(r"\d{3}-\d{4}", text)
print(numbers)  # ['555-1234', '555-5678', '555-9999']

No match objects — just a clean list of strings. When your pattern contains groups (Section 22.6), findall() returns the group contents instead.

re.split() — Split on a Pattern

Like str.split(), but the delimiter is a regex pattern:

text = "one,two;  three   four"

# Split on any combination of comma, semicolon, or whitespace
parts = re.split(r"[,;\s]+", text)
print(parts)  # ['one', 'two', 'three', 'four']

This is impossible with str.split(), which only accepts a single delimiter string.

Check Your Understanding #1

What's the difference between re.search(r"hello", text) and re.match(r"hello", text) when text = "Say hello!"?

Answer `re.search()` finds `"hello"` at position 4 and returns a match object. `re.match()` returns `None` because `"hello"` doesn't appear at the *start* of the string — the string starts with `"Say"`.

22.3 Character Classes: [abc], \d, \w, \s

So far our patterns have been literal characters and \d (which we haven't formally explained). Let's fix that.

A character class matches any single character from a defined set. You write it inside square brackets.

import re

# Match any vowel
text = "Hello, World!"
vowels = re.findall(r"[aeiouAEIOU]", text)
print(vowels)  # ['e', 'o', 'o']

Ranges in Character Classes

Use a hyphen to specify a range:

# Any lowercase letter
re.findall(r"[a-z]", "Hello 123")       # ['e', 'l', 'l', 'o']

# Any digit
re.findall(r"[0-9]", "Hello 123")       # ['1', '2', '3']

# Any letter (upper or lower) or digit
re.findall(r"[a-zA-Z0-9]", "Hi! 42")   # ['H', 'i', '4', '2']

Negated Character Classes

A caret ^ inside brackets means "NOT these characters":

# Any character that is NOT a digit
re.findall(r"[^0-9]", "Hello 123")  # ['H', 'e', 'l', 'l', 'o', ' ']

Shorthand Character Classes

Typing [0-9] every time gets old. Regex provides shorthand:

Shorthand Equivalent Meaning
\d [0-9] Any digit
\D [^0-9] Any NON-digit
\w [a-zA-Z0-9_] Any "word" character
\W [^a-zA-Z0-9_] Any NON-word character
\s [ \t\n\r\f\v] Any whitespace
\S [^ \t\n\r\f\v] Any NON-whitespace
. (almost anything) Any character except newline

The dot . is the wildcard — it matches any single character except \n. It's powerful but dangerously broad. Use a specific character class when you can.

# Find all "words" (sequences of word characters)
text = "Dr. Patel has 3 gene files!"
words = re.findall(r"\w+", text)
print(words)  # ['Dr', 'Patel', 'has', '3', 'gene', 'files']

# Find all whitespace-separated tokens
tokens = re.findall(r"\S+", text)
print(tokens)  # ['Dr.', 'Patel', 'has', '3', 'gene', 'files!']

Connection to Chapter 7: Remember str.isdigit(), str.isalpha(), and str.isalnum() from Chapter 7? They test whether an entire string consists of certain character types. Character classes in regex let you match those same categories within a larger pattern.


22.4 Quantifiers: +, *, ?, {n,m}

A character class matches one character. Quantifiers specify how many times a character or class should repeat.

Quantifier Meaning Example Matches
+ One or more \d+ "1", "42", "1000"
* Zero or more \d* "", "1", "42"
? Zero or one (optional) colou?r "color", "colour"
{n} Exactly n times \d{4} "2025" (four digits)
{n,} At least n times \d{2,} "42", "100", "1000"
{n,m} Between n and m times \d{2,4} "42", "123", "2025"

Let's put these together with character classes to build useful patterns.

import re

text = "Order #12345 shipped on 2025-03-14. Contact: support@example.com"

# Find the order number (# followed by digits)
order = re.search(r"#(\d+)", text)
print(order.group())   # #12345
print(order.group(1))  # 12345  (just the digits — we'll explain groups in 22.6)

# Find the date (four digits, dash, two digits, dash, two digits)
date = re.search(r"\d{4}-\d{2}-\d{2}", text)
print(date.group())    # 2025-03-14

Elena's Phone Number Problem

Elena Vasquez — our nonprofit data analyst — receives donor contact information in wildly inconsistent formats. Phone numbers appear as:

(555) 867-5309
555-867-5309
555.867.5309
5558675309
555 867 5309

With quantifiers and character classes, she can write one pattern that handles all of them:

# Optional opening paren, 3 digits, optional closing paren,
# optional separator (space, dash, or dot), 3 digits,
# optional separator, 4 digits
phone_pattern = r"\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}"

phones = [
    "(555) 867-5309",
    "555-867-5309",
    "555.867.5309",
    "5558675309",
    "555 867 5309",
]

for phone in phones:
    match = re.search(phone_pattern, phone)
    if match:
        print(f"  Matched: {match.group()}")

Let's break down that pattern piece by piece:

Fragment Meaning
\(? Optional opening parenthesis
\d{3} Exactly three digits
\)? Optional closing parenthesis
[-.\s]? Optional separator: dash, dot, or space
\d{3} Exactly three digits
[-.\s]? Optional separator
\d{4} Exactly four digits

This is the power of regex. One pattern, five formats.

Check Your Understanding #2

What does the pattern r"[A-Z][a-z]*" match? Give three example strings that would match.

Answer It matches a single uppercase letter followed by zero or more lowercase letters. Examples: `"H"`, `"Hello"`, `"Python"`. It would also match `"A"` (just one uppercase letter, since `*` allows zero lowercase letters).

22.5 Anchors and Boundaries: ^, $, \b

So far, our patterns float — they match wherever they appear in the string. Anchors pin a pattern to a specific position.

Anchor Meaning
^ Start of string (or start of line with re.MULTILINE)
$ End of string (or end of line with re.MULTILINE)
\b Word boundary (between a \w and a \W character)

Start and End Anchors

import re

lines = [
    "ERROR: disk full",
    "WARNING: disk space low",
    "INFO: disk check passed",
    "Disk ERROR found",
]

# Only lines that START with ERROR
for line in lines:
    if re.search(r"^ERROR", line):
        print(line)
# Output: ERROR: disk full

# Only lines that END with a specific word
for line in lines:
    if re.search(r"passed$", line):
        print(line)
# Output: INFO: disk check passed

Word Boundaries

The \b anchor is incredibly useful. It matches the boundary between a word character and a non-word character — it doesn't consume any characters, it just asserts a position.

text = "cat concatenate caterpillar scat category"

# Without \b — finds "cat" inside other words too
print(re.findall(r"cat", text))
# ['cat', 'cat', 'cat', 'cat', 'cat']

# With \b — finds only the standalone word "cat"
print(re.findall(r"\bcat\b", text))
# ['cat']

# Words that START with "cat"
print(re.findall(r"\bcat\w*", text))
# ['cat', 'concatenate', 'caterpillar', 'category']

Pitfall: Remember that \b needs a raw string! Without the r prefix, Python interprets \b as a backspace character, and your pattern silently fails to match what you expect.

Dr. Patel's FASTA Processing with Anchors

Dr. Patel's FASTA files mix header lines (starting with >) and sequence lines (containing only nucleotide characters). Anchors make it easy to separate them:

fasta_data = """>sp|P04637|P53_HUMAN Cellular tumor antigen p53
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP
>sp|Q9Y6K1|DNM3A_HUMAN DNA methyltransferase 3A
MAGEADRPNPEATVFEDSPLLERFLQTDYEHSVDISSLLDAGS"""

lines = fasta_data.strip().split("\n")

headers = [line for line in lines if re.match(r"^>", line)]
sequences = [line for line in lines if re.match(r"^[ACGT]+$", line)]

print(f"Headers:   {len(headers)}")    # Headers:   2
print(f"Sequences: {len(sequences)}")  # Sequences: 2

The pattern ^[ACGT]+$ means "the entire line consists of one or more nucleotide characters and nothing else." That's precise.


22.6 Groups and Capturing: (pattern) and Named Groups

Groups are where regex goes from "find this pattern" to "extract data." Parentheses create a capture group — they tell the regex engine "I want to remember this part separately."

Basic Capture Groups

import re

log = "2025-03-14 08:23:17 ERROR Database connection timeout"

# Capture year, month, day separately
match = re.search(r"(\d{4})-(\d{2})-(\d{2})", log)
if match:
    print(match.group())   # 2025-03-14  (full match)
    print(match.group(1))  # 2025        (first group)
    print(match.group(2))  # 03          (second group)
    print(match.group(3))  # 14          (third group)
    print(match.groups())  # ('2025', '03', '14')

Groups are numbered left to right, starting at 1. group(0) (or just group()) is the entire match.

findall() with Groups

When your pattern contains groups, findall() returns the group contents — not the full match:

text = "Born: 1990-05-21, Graduated: 2012-06-15, Hired: 2015-09-01"

# Without groups — returns full matches
dates = re.findall(r"\d{4}-\d{2}-\d{2}", text)
print(dates)  # ['1990-05-21', '2012-06-15', '2015-09-01']

# With groups — returns tuples of group contents
parts = re.findall(r"(\d{4})-(\d{2})-(\d{2})", text)
print(parts)  # [('1990', '05', '21'), ('2012', '06', '15'), ('2015', '09', '01')]

This makes findall() with groups a powerful data extraction tool.

Named Groups

Group numbers are fragile — if you add a group, all the numbers shift. Named groups let you refer to groups by name:

pattern = r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
match = re.search(pattern, "Event on 2025-03-14")
if match:
    print(match.group("year"))   # 2025
    print(match.group("month"))  # 03
    print(match.group("day"))    # 14
    print(match.groupdict())     # {'year': '2025', 'month': '03', 'day': '14'}

The syntax (?P<name>pattern) creates a named group. The .groupdict() method returns all named groups as a dictionary — perfect for converting matched data into structured form.

Elena's Email Extraction

Elena needs to pull email addresses out of messy donor records. Here's a pattern that captures the username and domain separately:

email_pattern = r"(?P<user>[\w.+-]+)@(?P<domain>[\w.-]+\.\w{2,})"

records = [
    "Jane Doe <jane.doe@example.com> donated $500",
    "Contact: bob_smith@harbor-nonprofit.org",
    "Reply to info@example.co.uk for details",
]

for record in records:
    match = re.search(email_pattern, record)
    if match:
        print(f"  User: {match.group('user'):20s}  "
              f"Domain: {match.group('domain')}")

Output:

  User: jane.doe              Domain: example.com
  User: bob_smith             Domain: harbor-nonprofit.org
  User: info                  Domain: example.co.uk

Non-Capturing Groups

Sometimes you need parentheses for grouping (to apply a quantifier to a multi-character sequence) but you don't want to capture the contents. Use (?:pattern):

# Capturing group — "http" or "https" gets stored
re.search(r"(https?)://", "https://example.com").group(1)  # 'https'

# Non-capturing group — the grouping works, but nothing is stored
re.findall(r"(?:https?)://\S+", "Visit https://a.com or http://b.org")
# ['https://a.com', 'http://b.org']  — full matches, not group captures

22.7 Substitution with re.sub()

re.sub(pattern, replacement, string) finds all matches and replaces them. It's str.replace() on steroids.

Simple Replacement

import re

text = "Call 555-1234 or 555-5678 for info."

# Redact phone numbers
redacted = re.sub(r"\d{3}-\d{4}", "XXX-XXXX", text)
print(redacted)  # Call XXX-XXXX or XXX-XXXX for info.

Using Groups in Replacements

You can reference captured groups in the replacement string using \1, \2, etc. (or \g<name> for named groups):

# Reformat dates from YYYY-MM-DD to MM/DD/YYYY
text = "Start: 2025-03-14, End: 2025-06-30"
reformatted = re.sub(
    r"(\d{4})-(\d{2})-(\d{2})",
    r"\2/\3/\1",
    text,
)
print(reformatted)  # Start: 03/14/2025, End: 06/30/2025

The replacement \2/\3/\1 means: second group (month), slash, third group (day), slash, first group (year).

Using a Function as Replacement

For complex replacements, pass a function instead of a string. The function receives the match object and returns the replacement string:

def censor_email(match):
    """Replace the middle of an email username with asterisks."""
    user = match.group("user")
    domain = match.group("domain")
    if len(user) > 2:
        censored = user[0] + "*" * (len(user) - 2) + user[-1]
    else:
        censored = user[0] + "*"
    return f"{censored}@{domain}"

text = "Contact jane.doe@example.com or bob@test.org"
result = re.sub(
    r"(?P<user>[\w.+-]+)@(?P<domain>[\w.-]+\.\w{2,})",
    censor_email,
    text,
)
print(result)  # Contact j******e@example.com or b*b@test.org

re.sub() with a Count Limit

The optional count parameter limits how many replacements to make:

text = "aaa bbb ccc ddd"
print(re.sub(r"\w+", "X", text))           # X X X X
print(re.sub(r"\w+", "X", text, count=2))  # X X ccc ddd

22.8 Greedy vs. Lazy Matching

This is where beginners get bitten. By default, quantifiers are greedy — they match as much text as possible.

import re

html = '<b>bold</b> and <i>italic</i>'

# Greedy: .* matches as MUCH as possible
match = re.search(r"<.*>", html)
print(match.group())
# <b>bold</b> and <i>italic</i>     <-- matched too much!

The .* ate everything from the first < to the last > — which isn't what you wanted.

Lazy Quantifiers

Adding ? after a quantifier makes it lazy (also called non-greedy or reluctant). Lazy quantifiers match as little as possible.

Greedy Lazy Meaning
* *? Zero or more (but as few as possible)
+ +? One or more (but as few as possible)
? ?? Zero or one (prefer zero)
{n,m} {n,m}? Between n and m (prefer fewer)
html = '<b>bold</b> and <i>italic</i>'

# Lazy: .*? matches as LITTLE as possible
tags = re.findall(r"<.*?>", html)
print(tags)  # ['<b>', '</b>', '<i>', '</i>']

# Extract tag contents
contents = re.findall(r"<\w+>(.*?)</\w+>", html)
print(contents)  # ['bold', 'italic']

The .*? stops matching as soon as it can — which means it stops at the first > it finds, not the last.

When to Use Each

  • Greedy (default): when you want the longest possible match. Usually right for patterns that don't have an ending delimiter, like \d+ (match all consecutive digits).
  • Lazy: when you're matching content between delimiters, like extracting text between HTML tags, quotes, or brackets.

Debugging Tip: If your regex matches more than you expected, check whether a greedy quantifier is gobbling up too much. Adding ? to make it lazy is usually the fix.

Check Your Understanding #3

What does re.findall(r'"(.*?)"', 'She said "hello" and "goodbye"') return?

Answer `['hello', 'goodbye']`. The lazy `.*?` matches the shortest content between each pair of double quotes. Without the `?`, the greedy `.*` would match `hello" and "goodbye` as a single match.

22.9 When NOT to Use Regex

Here's an important truth that regex enthusiasts often forget: string methods are frequently better than regex.

Prefer String Methods When...

1. You're looking for a fixed string:

# YES — simple and clear
if "ERROR" in log_line:
    ...

# OVERKILL — regex for a fixed string
if re.search(r"ERROR", log_line):
    ...

2. You're splitting on a fixed delimiter:

# YES
parts = line.split(",")

# OVERKILL
parts = re.split(r",", line)

3. You're doing a simple replacement:

# YES
clean = text.replace("  ", " ")

# OVERKILL
clean = re.sub(r"  ", " ", text)

4. You're checking start/end:

# YES
if filename.endswith(".csv"):
    ...

# OVERKILL
if re.search(r"\.csv$", filename):
    ...

Use Regex When...

  • The pattern involves variability: multiple formats, optional parts, unknown length
  • You need to extract parts of the match (capture groups)
  • You're matching character classes (any digit, any letter)
  • The delimiter is complex or variable: split on any whitespace, any punctuation
  • You need anchored matching: word boundaries, start/end of line

The Readability Test

If a colleague can't understand your regex pattern in 30 seconds, consider whether string methods would be clearer — even if they take more lines. Code is read far more often than it's written.

# This regex is clever but opaque:
re.sub(r"(\w)(\w+)", lambda m: m.group(1).upper() + m.group(2).lower(), text)

# This is longer but immediately clear:
words = text.split()
result = " ".join(word.capitalize() for word in words)

Both achieve the same result. The second version is better code.

String Methods vs. Regex: Quick Decision Guide

Task Use String Methods Use Regex
Find/replace an exact substring Yes
Split on a single delimiter Yes
Check if string starts/ends with text Yes
Strip whitespace Yes
Validate a complex format Yes
Extract multiple fields from a line Yes
Find patterns across variable formats Yes
Split on multiple/variable delimiters Yes
Search with word boundaries Yes

Theme: Errors are information, not failures (Ch 3). Don't beat yourself up when a regex doesn't work on the first try. Regex development is inherently iterative — write a simple pattern, test it, refine it, test again. Tools like regex101.com let you test patterns interactively and see exactly what matches.


22.10 Quick Reference: Regex Syntax at a Glance

Keep this table bookmarked. You'll refer to it constantly when writing regex.

Characters and Classes

Pattern Matches
. Any character except newline
\d Any digit [0-9]
\D Any non-digit [^0-9]
\w Any word character [a-zA-Z0-9_]
\W Any non-word character [^a-zA-Z0-9_]
\s Any whitespace [ \t\n\r\f\v]
\S Any non-whitespace
[abc] Any of a, b, or c
[a-z] Any lowercase letter
[^abc] Any character NOT a, b, or c

Quantifiers

Pattern Meaning
* Zero or more (greedy)
+ One or more (greedy)
? Zero or one (optional)
{n} Exactly n times
{n,} n or more times
{n,m} Between n and m times
*? Zero or more (lazy)
+? One or more (lazy)
?? Zero or one (lazy, prefer zero)

Anchors and Boundaries

Pattern Meaning
^ Start of string/line
$ End of string/line
\b Word boundary
\B NOT a word boundary

Groups and Alternation

Pattern Meaning
(pattern) Capture group
(?:pattern) Non-capturing group
(?P<name>pat) Named capture group
pattern1\|pattern2 Alternation (match either)

Common re Functions

Function Returns Description
re.search(pat, s) Match object or None First match anywhere in string
re.match(pat, s) Match object or None Match at start of string only
re.fullmatch(pat, s) Match object or None Match entire string
re.findall(pat, s) List of strings or tuples All non-overlapping matches
re.finditer(pat, s) Iterator of Match objects Iterate over all matches
re.sub(pat, repl, s) String Replace all matches
re.split(pat, s) List of strings Split string by pattern
re.compile(pat) Compiled pattern object Precompile for repeated use

22.11 Compiling Patterns with re.compile()

If you use the same pattern repeatedly — say, inside a loop that processes thousands of lines — you can compile it once for better performance:

import re

# Compile the pattern once
date_pattern = re.compile(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})")

log_lines = [
    "2025-03-14 08:23:17 ERROR disk full",
    "2025-03-14 08:24:01 INFO recovery started",
    "2025-03-14 08:25:33 WARNING disk space low",
]

for line in log_lines:
    match = date_pattern.search(line)  # Use the compiled pattern object
    if match:
        print(f"  {match.group('year')}/{match.group('month')}/{match.group('day')}")

The compiled pattern object has the same methods (search, match, findall, sub, split) as the re module functions, but you call them on the pattern object instead.

Performance note: Python caches recently used patterns internally, so compiling only matters when you have many distinct patterns or are processing very large datasets in tight loops. For typical scripts, the performance difference is negligible. Compile when it makes the code cleaner (giving the pattern a descriptive name), not just for speed.


22.12 Flags: Modifying Regex Behavior

Regex flags change how the pattern engine operates. The most useful ones:

Flag Short Form Effect
re.IGNORECASE re.I Case-insensitive matching
re.MULTILINE re.M ^/$ match start/end of each line, not just the string
re.DOTALL re.S . matches newline characters too
import re

# Case-insensitive search
text = "Error: file not found. ERROR: disk full. error: timeout."
errors = re.findall(r"error", text, re.IGNORECASE)
print(errors)  # ['Error', 'ERROR', 'error']

# Multiline: ^ matches start of each line
log = """INFO: started
ERROR: disk full
WARNING: low memory
ERROR: timeout"""

error_lines = re.findall(r"^ERROR:.*", log, re.MULTILINE)
print(error_lines)  # ['ERROR: disk full', 'ERROR: timeout']

Combine flags with the | operator: re.IGNORECASE | re.MULTILINE.


22.13 Project Checkpoint: TaskFlow v2.1

Time to put regex to work in our progressive project. In Chapter 21, TaskFlow v2.0 added CSV/JSON import/export and weather API integration. Now we'll add two regex-powered features:

  1. Advanced search — search tasks with regex patterns, not just plain keywords
  2. Natural language date parsing — interpret inputs like "next Tuesday" or "tomorrow" as actual dates

The current search_tasks() from v2.0 uses keyword in title.lower() — a simple substring match. Let's upgrade it to accept regex patterns:

import re

def search_tasks_regex(tasks: list[dict], pattern: str) -> list[dict]:
    """Search tasks using a regex pattern.

    Falls back to plain substring search if the pattern is
    not valid regex.
    """
    try:
        compiled = re.compile(pattern, re.IGNORECASE)
    except re.error:
        # Invalid regex — treat as literal substring
        return [t for t in tasks if pattern.lower() in t["title"].lower()]

    return [t for t in tasks if compiled.search(t["title"])]

Now users can search with patterns like: - "^Buy" — tasks that start with "Buy" - "meeting|call" — tasks containing "meeting" or "call" - "\d{4}" — tasks containing a 4-digit number - "report$" — tasks ending with "report"

Feature 2: Natural Language Date Parsing

This is a simplified date parser — it handles common phrases, not full natural language. The key insight: each phrase is a pattern that maps to a date calculation.

import re
from datetime import datetime, timedelta

WEEKDAYS = {
    "monday": 0, "tuesday": 1, "wednesday": 2, "thursday": 3,
    "friday": 4, "saturday": 5, "sunday": 6,
}

def parse_natural_date(text: str) -> datetime | None:
    """Parse simple natural language date expressions.

    Supports: 'today', 'tomorrow', 'next <weekday>',
    'in N days', and 'YYYY-MM-DD'.
    """
    text = text.strip().lower()
    today = datetime.now().replace(hour=0, minute=0, second=0, microsecond=0)

    # "today"
    if re.fullmatch(r"today", text):
        return today

    # "tomorrow"
    if re.fullmatch(r"tomorrow", text):
        return today + timedelta(days=1)

    # "next <weekday>"
    match = re.fullmatch(r"next\s+(\w+)", text)
    if match:
        day_name = match.group(1)
        if day_name in WEEKDAYS:
            target = WEEKDAYS[day_name]
            current = today.weekday()
            days_ahead = (target - current) % 7
            if days_ahead == 0:
                days_ahead = 7  # "next Monday" on a Monday means 7 days
            return today + timedelta(days=days_ahead)

    # "in N days"
    match = re.fullmatch(r"in\s+(\d+)\s+days?", text)
    if match:
        n = int(match.group(1))
        return today + timedelta(days=n)

    # ISO format: YYYY-MM-DD
    match = re.fullmatch(r"(\d{4})-(\d{2})-(\d{2})", text)
    if match:
        try:
            return datetime(
                int(match.group(1)),
                int(match.group(2)),
                int(match.group(3)),
            )
        except ValueError:
            return None

    return None

This function uses re.fullmatch() to ensure the entire input matches a known pattern — not just a substring. Each branch handles one format, extracts the relevant data with groups, and computes the result.

Integrating Into the Main Loop

def add_task_with_due_date(tasks: list[dict]) -> None:
    """Add a task with an optional natural-language due date."""
    title = input("  Task title: ").strip()
    if not title:
        print("  Title cannot be empty.")
        return

    priority = input("  Priority (high/medium/low) [medium]: ").strip().lower()
    if priority not in {"high", "medium", "low"}:
        priority = "medium"

    category = input("  Category [general]: ").strip() or "general"

    due_input = input("  Due date (e.g., 'tomorrow', 'next Friday', "
                      "'in 3 days', '2025-04-01') [none]: ").strip()

    due_date = None
    if due_input:
        due_date = parse_natural_date(due_input)
        if due_date is None:
            print(f"  Could not parse '{due_input}' — no due date set.")
        else:
            print(f"  Due date: {due_date.strftime('%Y-%m-%d')}")

    task = {
        "title": title,
        "priority": priority,
        "category": category,
        "done": False,
        "created": datetime.now().strftime("%Y-%m-%d %H:%M"),
        "due": due_date.strftime("%Y-%m-%d") if due_date else None,
    }
    tasks.append(task)

What we built: TaskFlow v2.1 adds regex search (users can filter tasks with patterns like "^Buy" or "meeting|call") and natural language date parsing (users type "next Tuesday" and get a real date). Both features are powered by the re module.


Chapter Summary

Regular expressions give you a concise, powerful language for describing text patterns. You learned to:

  1. Import and use the re module: search, match, findall, sub, split
  2. Build patterns with character classes ([abc], \d, \w, \s), quantifiers (+, *, ?, {n,m}), and anchors (^, $, \b)
  3. Extract data with capture groups (pattern) and named groups (?P<name>pattern)
  4. Replace text with re.sub(), including group references and function-based replacements
  5. Control greediness by adding ? after quantifiers for lazy matching
  6. Know the limits — string methods are often simpler and clearer for straightforward tasks

Regex is a skill that rewards practice. You won't memorize every metacharacter on first reading — nobody does. Instead, keep the quick reference table handy, use tools like regex101.com to test patterns interactively, and build your fluency one pattern at a time.

Spaced Review: You'll see regex again in Chapter 24 (web scraping — extracting data from HTML) and the Capstone 3 (data dashboard — parsing mixed-format datasets). The pattern-matching mindset from this chapter will also help you think about data validation wherever it appears.

What's next: In Chapter 23, we shift from what you can do with Python to how you manage Python itself — virtual environments, pip, requirements.txt, and the ecosystem of third-party libraries that make Python the language of choice for everything from web development to data science.