Case Study: Diagnosing a Regex That Hangs (and One That Can't Exist)
"A model of computation is not a computer. It is an argument about what computers can do." — paraphrasing the spirit of Sipser's Introduction to the Theory of Computation
Executive Summary
A web service's input-validation layer is misbehaving in two different ways, and both turn out to be theory problems wearing an engineering costume. First, one validation regex makes the server hang for seconds on certain short inputs — a classic catastrophic-backtracking (ReDoS) incident. Second, a ticket asks you to "just write a regex" that accepts only strings of correctly nested brackets, and after a day of trying, nobody can. In this analysis-focused case study you will diagnose both, using exactly the machinery of this chapter: you will read a regex as a finite automaton (§35.4), explain the blow-up by counting the paths an NFA explores, and then prove — with the pumping lemma (§35.5) — that the second task is not merely hard but impossible in principle, because balanced brackets is not a regular language and therefore lies above finite automata on the Chomsky hierarchy (§35.6).
The deliverable is not new code; it is an understanding: which problems a regex can solve at all, and
which ones blow up or cannot be expressed. That understanding is what separates "add another * and
hope" from "this needs a parser, here's why."
Skills applied
- Reading a regular expression as the finite automaton it compiles to (Thompson's construction, §35.4).
- Explaining catastrophic backtracking in terms of the number of accepting paths an NFA can explore.
- Recasting a "valid input?" question as a language-recognition problem (§35.1).
- Applying the pumping lemma to prove a concrete language (balanced brackets) is not regular (§35.5).
- Placing a language in the Chomsky hierarchy and choosing the right tool for its level (§35.6).
Background
The scenario
You maintain the request-validation layer for an internal API. Two incidents land in the same sprint.
Incident A — the hang. Monitoring shows that a small fraction of requests pin a CPU core for several
seconds before failing. The offending requests all hit a field validated by this pattern, meant to check
"a sequence of as, optionally grouped, ending in b":
import re
# Intended: match strings like "ab", "aab", "aaab", ...
pattern = re.compile(r'^(a+)+b$')
print(bool(pattern.match("aaaaaaaaaaaaaaaaaaaaab"))) # matches, fine
print(bool(pattern.match("aaaaaaaaaaaaaaaaaaaaa!"))) # NO trailing b -> the slow case
# Expected output:
# True
# False
The first call returns quickly. The second — same length, but with the final b replaced by ! so the
match must fail — takes dramatically longer as the run of as grows. A 25-character string with no
trailing b can take seconds. That is the bug.
Incident B — the impossible ticket. A teammate files: "Validate that a configuration string of (
and ) is correctly nested — e.g. (()) is valid, (() and )( are not. Please just use a regex."
Several attempts (^\(*\)*$`, `^(\(\))*$, …) each accept some bad strings or reject some good ones. The
question we will actually answer is sharper than "what's the right regex?": it is "can any regex do
this at all?"
💡 Intuition: Both incidents are about the machine behind the regex. Incident A is about how many paths that machine explores; Incident B is about whether such a machine can exist. §35.4 told us a regex is a finite automaton; this case study cashes that in.
Why this matters
Regular expressions are the first line of input validation in nearly every system, and both failure modes here are real and common. ReDoS (regular-expression denial of service) has taken down production services and is a tracked vulnerability class. And the "write a regex for nested structure" request appears constantly — from validating brackets to "parsing" HTML with a regex — and is always the wrong tool for the same theoretical reason. Knowing the reason saves days.
Phase 1: Recast both incidents as language-recognition problems
Before touching either bug, name the language each regex is supposed to recognize (§35.1). A regex sorts $\Sigma^{*}$ into accept/reject; the set of accepted strings is the object to reason about.
- Incident A. Over $\Sigma = \{a, b\}$ (we will ignore
!; it simply forces a non-match), the pattern^(a+)+b$` is *intended* to denote $L_A = \{a^{n}b \mid n \ge 1\}$ — one or moreas then a singleb. Note immediately that $L_A$ is a perfectly **regular** language: the plain regexa+b(equivalently $aa^{*}b$) denotes it, and a four-state DFA recognizes it. So the *language* is easy; the trouble is entirely in the *chosen expression*(a+)+`. - Incident B. Over $\Sigma = \{(, )\}$, the target is $$L_B = \{\, w \in \{(,)\}^{*} \mid w \text{ is a string of correctly nested parentheses} \,\},$$ e.g. $\varepsilon, (), (()), ()(), \dots$ The question is whether $L_B$ is regular at all.
🔄 Check Your Understanding Give a four-state DFA (in words) for $L_A = \{a^{n}b \mid n \ge 1\}$. What does each state remember?
Answer
States: $S$ (start, no
ayet), $A$ ("seen at least onea"), $F$ (accept, "seenas then ab"), and a dead state $D$. Transitions: $S\xrightarrow{a}A$, $S\xrightarrow{b}D$; $A\xrightarrow{a}A$, $A\xrightarrow{b}F$; $F\xrightarrow{a}D$, $F\xrightarrow{b}D$; $D$ loops on everything. Accept $=\{F\}$. Each state remembers only "how far into the patterna…abwe are" — exactly the fixed, finite memory of §35.2. The language is regular; we will contrast this sharply with $L_B$.
Phase 2: Diagnose Incident A — count the paths the NFA explores
A regex engine that backtracks does not build the DFA; it simulates the NFA from Thompson's construction (§35.4), trying choices and backing up when they fail. Catastrophic backtracking happens when, on a failing input, the number of distinct ways the NFA can try to match grows exponentially.
Look at (a+)+ on the string $a^{n}$ (with the required trailing b absent, so every attempt must
ultimately fail at $). The inner a+ matches a non-empty run of as; the outer + repeats that group.
So a run of $n$ as can be carved into groups in every way you can split $n$ into an ordered sum of
positive parts — and there are $2^{n-1}$ such compositions of $n$ (a fact you proved in the
combinatorics chapters: each of the $n-1$ internal gaps is either a cut or not). The engine, looking for a
match that does not exist, may explore all of them before giving up.
We do not need to run anything to see the blow-up — we count it. Here is an instrumented model that counts how many group-splittings the backtracker would consider, mirroring the engine's search:
def split_attempts(n):
"""Number of ways (a+)+ can partition a^n into non-empty groups = compositions of n.
This is the count of distinct match attempts a backtracking engine may explore."""
if n == 0:
return 1 # empty: one (trivial) way
total = 0
for first_group in range(1, n + 1): # the first group takes 1..n of the a's
total += split_attempts(n - first_group)
return total
print([split_attempts(n) for n in range(1, 8)])
# Expected output:
# [1, 2, 4, 8, 16, 32, 64]
Hand-derive it to be sure the model is right: split_attempts(1) has only first_group = 1, then
split_attempts(0) = 1, so it is $1$. split_attempts(2): first_group=1 gives split_attempts(1)=1,
first_group=2 gives split_attempts(0)=1, total $2$. split_attempts(3):
$\,\texttt{split\_attempts}(2) + \texttt{split\_attempts}(1) + \texttt{split\_attempts}(0) = 2+1+1 = 4$.
The pattern $1, 2, 4, 8, \dots$ is $2^{n-1}$ — exactly the number of compositions of $n$. On a failing
input the engine's work is proportional to this, i.e. exponential in the length of the run of as.
⚠️ Common Pitfall: The blow-up is not a property of the language $L_A$ — that language is regular and a DFA decides it in $O(n)$ time with no backtracking at all. The blow-up is a property of the expression
(a+)+, whose nested quantifiers give the NFA exponentially many equivalent paths. The theory's lesson: equivalent regexes (same language) can have wildly different operational behavior, just as §35.3 warned that NFAs and DFAs are equal in power but not in size or simulation cost.
Where the paths come from: the Thompson NFA of (a+)+
It is worth seeing structurally why (a+)+ has exponentially many paths, because it makes the diagnosis
precise rather than folkloric. Apply Thompson's construction (§35.4) operator by operator. The inner a+
is aa*: an a-edge into a small loop that can consume one or more as, with an $\varepsilon$-edge
closing the loop after each a (the star's "loop back for more" gadget). The outer (...)+ wraps that
in another loop with its own $\varepsilon$ back-edge. So between consuming any two consecutive as, the
machine faces a choice of two $\varepsilon$-paths that both lead onward: "stay inside the current inner
group" or "close the inner group and re-enter via the outer loop." Each of the $n-1$ gaps between $n$ as
is an independent binary choice — and $2^{n-1}$ is exactly the number of ways to resolve $n-1$ independent
binary choices. That product-of-choices is the same $2^{n-1}$ the split_attempts model counted, now read
off the wiring diagram instead of the composition combinatorics.
💡 Intuition: Two nested quantifiers over the same symbol create overlapping $\varepsilon$-loops, and overlapping loops are where a backtracker multiplies its work. The red flag in any pattern is a quantified group whose body is itself quantified over content the outer quantifier can also match:
(a+)+,(a*)*,(a|a)*,(a|aa)+. Each makes "how do I carve this run?" ambiguous, and ambiguity on a failing input is what the engine pays for, branch by branch.
The crucial observation is that every one of those $2^{n-1}$ paths reaches the same NFA state after
consuming the as (they differ only in which $\varepsilon$-loops they took), so a machine that tracked the
set of reachable states — the subset simulation of §35.3 — would collapse them all into one. A
backtracker does not; it re-explores each path. The exponential cost is the price of not determinizing.
The fix, justified by theory
The remedy is to choose an expression whose NFA has one obvious path per input. The pattern a+b
(equivalently aa*b) denotes the same language $L_A$ but is unambiguous: there is exactly one way to match
a run of as followed by a b, so the engine never explores $2^{n-1}$ alternatives.
import re
fixed = re.compile(r'^a+b$') # same language as (a+)+b, no nested quantifier
print(bool(fixed.match("aaaab"))) # accept
print(bool(fixed.match("aaaa!"))) # fail FAST -- one path, not 2^(n-1)
# Expected output:
# True
# False
The failing case now returns immediately: the NFA for a+b has no overlapping repetition, so a
linear left-to-right scan settles it. We removed the nondeterminism that mattered for cost without
changing the language — a direct application of "many machines, one language" (§35.4).
🔗 Connection: A non-backtracking engine (one that simulates the NFA by tracking the set of states it could be in — the on-the-fly subset construction of §35.3, exactly the
nfa_acceptsdriver) runs any regex in time $O(\lvert w\rvert \cdot \lvert Q\rvert)$, with no catastrophic blow-up at all. That is why engines built on automata (e.g. RE2-style) are immune to ReDoS: they never enumerate paths, they track reachable states. The chapter's theory is the engineering fix.
Phase 3: Diagnose Incident B — prove no regex can exist
Now the impossible ticket. We will show $L_B$ (balanced parentheses) is not regular, so by Kleene's theorem (§35.4) no regular expression and no finite automaton recognizes it — the request cannot be satisfied, ever, no matter how clever the regex.
The strategy first. This is a non-regularity proof, so we use the pumping lemma by contradiction (§35.5): assume $L_B$ is regular, obtain a pumping length $p$, choose a string $s \in L_B$ that depends on $p$, use condition $\lvert xy\rvert \le p$ to corner what $y$ must be, then pump to produce an unbalanced string. The discipline (from §35.5): the adversary picks $p$ and the split; we pick $s$ and the exponent $i$, and we must defeat every legal split.
Theorem. $L_B = \{\text{strings of correctly nested parentheses}\}$ is not regular.
Proof. Suppose, for contradiction, that $L_B$ is regular. By the pumping lemma it has a pumping length
$p$. Choose the string
$$s = \underbrace{(\,(\,\cdots\,(}_{p}\ \underbrace{)\,)\,\cdots\,)}_{p} = (^{\,p}\,)^{\,p},$$
which is correctly nested (each of the $p$ opens is matched by one of the $p$ closes), so $s \in L_B$, and
$\lvert s\rvert = 2p \ge p$, so the lemma applies. The lemma gives a split $s = xyz$ with
$\lvert y\rvert > 0$ and $\lvert xy\rvert \le p$. Because $\lvert xy\rvert \le p$ and the first $p$
symbols of $s$ are all (, both $x$ and $y$ lie entirely within that opening block — so $y$ consists of
only (, say $y = (^{\,k}$ with $k \ge 1$ (using $\lvert y\rvert > 0$).
Now pump with $i = 0$ (delete $y$): the string
$$xy^{0}z = (^{\,p-k}\,)^{\,p}$$
has $p - k$ opening parentheses but $p$ closing ones. Since $k \ge 1$, there are strictly fewer ( than
), so the string is unbalanced — some ) has no matching ( — and therefore $xy^{0}z \notin L_B$. But
condition (3) of the pumping lemma demands $xy^{0}z \in L_B$. Contradiction. Therefore $L_B$ is not
regular. $\blacksquare$
So Incident B is unsolvable by regex — not because we are not clever enough, but because of a theorem.
🐛 Find the Error. A teammate counters: "But
^(\(\))*$works —(),()(),()()()all match, and(()doesn't!" They conclude $L_B$ is regular after all. Where is their error?
Answer
(\(\))*recognizes the language $\{(\,)\,(\,)\cdots(\,) \} = \{$()repeated $\} = \{(^{1})^{1}\}^{*}$ — strings of flat, adjacent pairs like()()(). That is a different (and genuinely regular) language, not $L_B$. It rejects the validly nested(())(two opens then two closes), which is in $L_B$. So the regex recognizes a regular subset/cousin of $L_B$, not $L_B$ itself. The pumping-lemma proof stands: no regex captures all of $L_B$, because nesting requires counting unbounded depth, which a finite automaton cannot do. (This is the same trap as "match HTML with a regex.")
Phase 4: Place $L_B$ on the hierarchy and pick the right tool
If a finite automaton cannot recognize $L_B$, what can? The Chomsky hierarchy (§35.6) answers: balanced
parentheses is the canonical context-free (Type 2) language, recognized by a pushdown automaton —
a finite automaton plus a stack. The stack is exactly the unbounded memory the pumping lemma just
proved a regex lacks: push on each (, pop on each ), accept iff the stack ends empty (and you never
pop an empty stack).
def balanced(s):
"""Pushdown-style recognizer for L_B: a DFA's logic plus ONE stack."""
stack = [] # the unbounded memory a regex cannot have
for ch in s:
if ch == '(':
stack.append('(') # push on open
elif ch == ')':
if not stack:
return False # close with nothing to match -> reject
stack.pop() # pop on close
return len(stack) == 0 # accept iff every open was matched
print(balanced("(())"), balanced("(()"), balanced(")("), balanced(""))
# Expected output:
# True False False True
Hand-trace each, watching the stack as the pushdown automaton would: "(())" pushes to depth $2$, pops
back to $0$ → accept; "(()" ends with one ( still on the stack → reject; ")(" tries to pop an empty
stack on the first ) → reject; "" never touches the stack → accept (the $n=0$ case). With a single
LIFO structure we have climbed from Type 3 (regular) to Type 2 (context-free) and recognized exactly the
language the pumping lemma put out of a regex's reach.
🔗 Connection: This is why a compiler has both a lexer and a parser (§35.6). The lexer is a finite automaton (regular) that chops source into tokens; the parser is a pushdown automaton (context-free) that recognizes nested syntax — matched braces, nested
if/else, parenthesized arithmetic. Incident B is the lexer-vs-parser boundary in miniature: nesting belongs to the parser, and asking a regex to do it is asking the lexer to do the parser's job. The right resolution to the ticket is "use a parser (a stack), not a regex," and now you can say why with a theorem, not a hunch.
Discussion Questions
- Incident A's language $L_A = \{a^{n}b\}$ is regular, yet the regex
(a+)+bis catastrophic on failing inputs. Explain to a teammate, in two sentences, the difference between "the language is easy" and "this expression is slow," referencing the $2^{n-1}$ compositions count from Phase 2. - The fix in Phase 2 changed the expression but not the language. Describe how a non-backtracking
engine (tracking the set of reachable states, à la
nfa_accepts) would handle the original(a+)+bwithout blowing up. Which theorem of §35.3 guarantees this is possible? - In the Phase 3 proof, we pumped with $i = 0$ (deletion). Would $i = 2$ also work to derive a
contradiction? Write the resulting string and check whether it is unbalanced. Why does some value of
$i$ always suffice once $y$ is cornered to be all
(? - The teammate's regex
(\(\))*recognizes a regular language. Describe that language precisely in English, and give one string it accepts that is not in $L_B$ and one string in $L_B$ that it rejects. - Suppose the ticket changes to "validate brackets nested at most 3 deep." Is that language regular? Argue informally why bounding the depth changes the answer. (Hint: what can now be remembered with finitely many states?)
Your Turn: Extensions
- Option A (measure the blow-up by hand). Extend the
split_attemptsmodel to also print the ratiosplit_attempts(n+1) / split_attempts(n)for $n = 1, \dots, 6$. Hand-derive the ratios and confirm the growth is a constant factor of $2$ per addeda— the signature of exponential backtracking. - Option B (a second non-regular language). Using the same pumping-lemma recipe, prove that $L = \{a^{n}b^{n} \mid n \ge 0\}$ is not regular. State your $s$, corner $y$ with $\lvert xy\rvert \le p$, and name the breaking exponent $i$. Compare your proof line-by-line with the Phase 3 proof.
- Option C (balanced with two bracket kinds). Extend
balancedto accept correctly nested()and[](so([])is valid but([)]is not), and explain which line enforces "the right kind of close." Then argue that this richer language is still context-free and still not regular. - Option D (depth-bounded is regular). Write a DFA (in words or as a
deltadict) that accepts parentheses nested at most $2$ deep, over $\Sigma = \{(, )\}$, by using states "depth 0, 1, 2, dead." Confirm that bounding the depth collapses the problem back to Type 3.
Key Takeaways
- A regex is a finite automaton, and its cost is about paths. Catastrophic backtracking is exponentially many NFA match-attempts on a failing input — here $2^{n-1}$ compositions — not a property of the (regular, easy) language. Equivalent expressions can differ wildly in speed (§35.3–35.4).
- Some validations are impossible by regex, provably. Balanced parentheses is not regular; the pumping lemma (§35.5) settles it for all infinitely many possible regexes at once. "Just use a regex" is the wrong tool whenever the structure is nested.
- The pumping-lemma discipline is the whole game. Pick $s$ to depend on $p$; let $\lvert xy\rvert \le p$ corner $y$; pump to break the invariant; you must beat every split, but the adversary's hands are tied by condition (2).
- Match the language to its rung. Regular → finite automaton (a regex, fast and stackless); context-free → pushdown automaton (a stack). The lexer/parser split in every compiler is this boundary (§35.6) — and now you can defend it with a proof.