Case Study: Diagnosing a Regex That Hangs (and One That Can't Exist)

DataField.Dev

Case Study: Diagnosing a Regex That Hangs (and One That Can't Exist)

"A model of computation is not a computer. It is an argument about what computers can do." — paraphrasing the spirit of Sipser's Introduction to the Theory of Computation

Executive Summary

A web service's input-validation layer is misbehaving in two different ways, and both turn out to be theory problems wearing an engineering costume. First, one validation regex makes the server hang for seconds on certain short inputs — a classic catastrophic-backtracking (ReDoS) incident. Second, a ticket asks you to "just write a regex" that accepts only strings of correctly nested brackets, and after a day of trying, nobody can. In this analysis-focused case study you will diagnose both, using exactly the machinery of this chapter: you will read a regex as a finite automaton (§35.4), explain the blow-up by counting the paths an NFA explores, and then prove — with the pumping lemma (§35.5) — that the second task is not merely hard but impossible in principle, because balanced brackets is not a regular language and therefore lies above finite automata on the Chomsky hierarchy (§35.6).

The deliverable is not new code; it is an understanding: which problems a regex can solve at all, and which ones blow up or cannot be expressed. That understanding is what separates "add another * and hope" from "this needs a parser, here's why."

Skills applied

Reading a regular expression as the finite automaton it compiles to (Thompson's construction, §35.4).
Explaining catastrophic backtracking in terms of the number of accepting paths an NFA can explore.
Recasting a "valid input?" question as a language-recognition problem (§35.1).
Applying the pumping lemma to prove a concrete language (balanced brackets) is not regular (§35.5).
Placing a language in the Chomsky hierarchy and choosing the right tool for its level (§35.6).

Background

The scenario

You maintain the request-validation layer for an internal API. Two incidents land in the same sprint.

Incident A — the hang. Monitoring shows that a small fraction of requests pin a CPU core for several seconds before failing. The offending requests all hit a field validated by this pattern, meant to check "a sequence of as, optionally grouped, ending in b":

import re
# Intended: match strings like "ab", "aab", "aaab", ...
pattern = re.compile(r'^(a+)+b$')

print(bool(pattern.match("aaaaaaaaaaaaaaaaaaaaab")))   # matches, fine
print(bool(pattern.match("aaaaaaaaaaaaaaaaaaaaa!")))   # NO trailing b -> the slow case
# Expected output:
# True
# False

The first call returns quickly. The second — same length, but with the final b replaced by ! so the match must fail — takes dramatically longer as the run of as grows. A 25-character string with no trailing b can take seconds. That is the bug.

Incident B — the impossible ticket. A teammate files: "Validate that a configuration string of ( and ) is correctly nested — e.g. (()) is valid, (() and )( are not. Please just use a regex." Several attempts (^$*$*$`, `^()*$, …) each accept some bad strings or reject some good ones. The question we will actually answer is sharper than "what's the right regex?": it is "can any regex do this at all?"

💡 Intuition: Both incidents are about the machine behind the regex. Incident A is about how many paths that machine explores; Incident B is about whether such a machine can exist. §35.4 told us a regex is a finite automaton; this case study cashes that in.

Why this matters

Regular expressions are the first line of input validation in nearly every system, and both failure modes here are real and common. ReDoS (regular-expression denial of service) has taken down production services and is a tracked vulnerability class. And the "write a regex for nested structure" request appears constantly — from validating brackets to "parsing" HTML with a regex — and is always the wrong tool for the same theoretical reason. Knowing the reason saves days.

Phase 1: Recast both incidents as language-recognition problems

Before touching either bug, name the language each regex is supposed to recognize (§35.1). A regex sorts $\Sigma^{*}$ into accept/reject; the set of accepted strings is the object to reason about.

Incident A. Over $\Sigma = \{a, b\}$ (we will ignore !; it simply forces a non-match), the pattern ^(a+)+b$` is *intended* to denote $L_A = \{a^{n}b \mid n \ge 1\}$ — one or moreas then a singleb. Note immediately that $L_A$ is a perfectly **regular** language: the plain regexa+b(equivalently $aa^{*}b$) denotes it, and a four-state DFA recognizes it. So the *language* is easy; the trouble is entirely in the *chosen expression*(a+)+`.
Incident B. Over $\Sigma = \{(, )\}$, the target is $$L_B = \{\, w \in \{(,)\}^{*} \mid w \text{ is a string of correctly nested parentheses} \,\},$$ e.g. $\varepsilon, (), (()), ()(), \dots$ The question is whether $L_B$ is regular at all.

🔄 Check Your Understanding Give a four-state DFA (in words) for $L_A = \{a^{n}b \mid n \ge 1\}$. What does each state remember?

Answer

States: $S$ (start, no a yet), $A$ ("seen at least one a"), $F$ (accept, "seen as then a b"), and a dead state $D$. Transitions: $S\xrightarrow{a}A$, $S\xrightarrow{b}D$; $A\xrightarrow{a}A$, $A\xrightarrow{b}F$; $F\xrightarrow{a}D$, $F\xrightarrow{b}D$; $D$ loops on everything. Accept $=\{F\}$. Each state remembers only "how far into the pattern a…ab we are" — exactly the fixed, finite memory of §35.2. The language is regular; we will contrast this sharply with $L_B$.

Phase 2: Diagnose Incident A — count the paths the NFA explores

A regex engine that backtracks does not build the DFA; it simulates the NFA from Thompson's construction (§35.4), trying choices and backing up when they fail. Catastrophic backtracking happens when, on a failing input, the number of distinct ways the NFA can try to match grows exponentially.

Look at (a+)+ on the string $a^{n}$ (with the required trailing b absent, so every attempt must ultimately fail at $). The inner a+ matches a non-empty run of as; the outer + repeats that group. So a run of $n$ as can be carved into groups in every way you can split $n$ into an ordered sum of positive parts — and there are $2^{n-1}$ such compositions of $n$ (a fact you proved in the combinatorics chapters: each of the $n-1$ internal gaps is either a cut or not). The engine, looking for a match that does not exist, may explore all of them before giving up.

We do not need to run anything to see the blow-up — we count it. Here is an instrumented model that counts how many group-splittings the backtracker would consider, mirroring the engine's search:

def split_attempts(n):
    """Number of ways (a+)+ can partition a^n into non-empty groups = compositions of n.
    This is the count of distinct match attempts a backtracking engine may explore."""
    if n == 0:
        return 1                       # empty: one (trivial) way
    total = 0
    for first_group in range(1, n + 1):    # the first group takes 1..n of the a's
        total += split_attempts(n - first_group)
    return total

print([split_attempts(n) for n in range(1, 8)])
# Expected output:
# [1, 2, 4, 8, 16, 32, 64]

Hand-derive it to be sure the model is right: split_attempts(1) has only first_group = 1, then split_attempts(0) = 1, so it is $1$. split_attempts(2): first_group=1 gives split_attempts(1)=1, first_group=2 gives split_attempts(0)=1, total $2$. split_attempts(3): $\,\texttt{split\_attempts}(2) + \texttt{split\_attempts}(1) + \texttt{split\_attempts}(0) = 2+1+1 = 4$. The pattern $1, 2, 4, 8, \dots$ is $2^{n-1}$ — exactly the number of compositions of $n$. On a failing input the engine's work is proportional to this, i.e. exponential in the length of the run of as.

⚠️ Common Pitfall: The blow-up is not a property of the language $L_A$ — that language is regular and a DFA decides it in $O(n)$ time with no backtracking at all. The blow-up is a property of the expression (a+)+, whose nested quantifiers give the NFA exponentially many equivalent paths. The theory's lesson: equivalent regexes (same language) can have wildly different operational behavior, just as §35.3 warned that NFAs and DFAs are equal in power but not in size or simulation cost.

Where the paths come from: the Thompson NFA of `(a+)+`

It is worth seeing structurally why (a+)+ has exponentially many paths, because it makes the diagnosis precise rather than folkloric. Apply Thompson's construction (§35.4) operator by operator. The inner a+ is aa*: an a-edge into a small loop that can consume one or more as, with an $\varepsilon$-edge closing the loop after each a (the star's "loop back for more" gadget). The outer (...)+ wraps that in another loop with its own $\varepsilon$ back-edge. So between consuming any two consecutive as, the machine faces a choice of two $\varepsilon$-paths that both lead onward: "stay inside the current inner group" or "close the inner group and re-enter via the outer loop." Each of the $n-1$ gaps between $n$ as is an independent binary choice — and $2^{n-1}$ is exactly the number of ways to resolve $n-1$ independent binary choices. That product-of-choices is the same $2^{n-1}$ the split_attempts model counted, now read off the wiring diagram instead of the composition combinatorics.

💡 Intuition: Two nested quantifiers over the same symbol create overlapping $\varepsilon$-loops, and overlapping loops are where a backtracker multiplies its work. The red flag in any pattern is a quantified group whose body is itself quantified over content the outer quantifier can also match: (a+)+, (a*)*, (a|a)*, (a|aa)+. Each makes "how do I carve this run?" ambiguous, and ambiguity on a failing input is what the engine pays for, branch by branch.

The crucial observation is that every one of those $2^{n-1}$ paths reaches the same NFA state after consuming the as (they differ only in which $\varepsilon$-loops they took), so a machine that tracked the set of reachable states — the subset simulation of §35.3 — would collapse them all into one. A backtracker does not; it re-explores each path. The exponential cost is the price of not determinizing.

The fix, justified by theory

The remedy is to choose an expression whose NFA has one obvious path per input. The pattern a+b (equivalently aa*b) denotes the same language $L_A$ but is unambiguous: there is exactly one way to match a run of as followed by a b, so the engine never explores $2^{n-1}$ alternatives.

import re
fixed = re.compile(r'^a+b$')                  # same language as (a+)+b, no nested quantifier
print(bool(fixed.match("aaaab")))             # accept
print(bool(fixed.match("aaaa!")))             # fail FAST -- one path, not 2^(n-1)
# Expected output:
# True
# False

The failing case now returns immediately: the NFA for a+b has no overlapping repetition, so a linear left-to-right scan settles it. We removed the nondeterminism that mattered for cost without changing the language — a direct application of "many machines, one language" (§35.4).

🔗 Connection: A non-backtracking engine (one that simulates the NFA by tracking the set of states it could be in — the on-the-fly subset construction of §35.3, exactly the nfa_accepts driver) runs any regex in time $O(\lvert w\rvert \cdot \lvert Q\rvert)$, with no catastrophic blow-up at all. That is why engines built on automata (e.g. RE2-style) are immune to ReDoS: they never enumerate paths, they track reachable states. The chapter's theory is the engineering fix.

Phase 3: Diagnose Incident B — prove no regex can exist

Now the impossible ticket. We will show $L_B$ (balanced parentheses) is not regular, so by Kleene's theorem (§35.4) no regular expression and no finite automaton recognizes it — the request cannot be satisfied, ever, no matter how clever the regex.

The strategy first. This is a non-regularity proof, so we use the pumping lemma by contradiction (§35.5): assume $L_B$ is regular, obtain a pumping length $p$, choose a string $s \in L_B$ that depends on $p$, use condition $\lvert xy\rvert \le p$ to corner what $y$ must be, then pump to produce an unbalanced string. The discipline (from §35.5): the adversary picks $p$ and the split; we pick $s$ and the exponent $i$, and we must defeat every legal split.

Theorem. $L_B = \{\text{strings of correctly nested parentheses}\}$ is not regular.

Proof. Suppose, for contradiction, that $L_B$ is regular. By the pumping lemma it has a pumping length $p$. Choose the string $$s = \underbrace{(\,(\,\cdots\,(}_{p}\ \underbrace{)\,)\,\cdots\,)}_{p} = (^{\,p}\,)^{\,p},$$ which is correctly nested (each of the $p$ opens is matched by one of the $p$ closes), so $s \in L_B$, and $\lvert s\rvert = 2p \ge p$, so the lemma applies. The lemma gives a split $s = xyz$ with $\lvert y\rvert > 0$ and $\lvert xy\rvert \le p$. Because $\lvert xy\rvert \le p$ and the first $p$ symbols of $s$ are all (, both $x$ and $y$ lie entirely within that opening block — so $y$ consists of only (, say $y = (^{\,k}$ with $k \ge 1$ (using $\lvert y\rvert > 0$).

Now pump with $i = 0$ (delete $y$): the string $$xy^{0}z = (^{\,p-k}\,)^{\,p}$$ has $p - k$ opening parentheses but $p$ closing ones. Since $k \ge 1$, there are strictly fewer ( than ), so the string is unbalanced — some ) has no matching ( — and therefore $xy^{0}z \notin L_B$. But condition (3) of the pumping lemma demands $xy^{0}z \in L_B$. Contradiction. Therefore $L_B$ is not regular. $\blacksquare$

So Incident B is unsolvable by regex — not because we are not clever enough, but because of a theorem.

🐛 Find the Error. A teammate counters: "But ^()*$ works — (), ()(), ()()() all match, and (() doesn't!" They conclude $L_B$ is regular after all. Where is their error?

Answer

()* recognizes the language $\{(\,)\,(\,)\cdots(\,) \} = \{$ () repeated $\} = \{(^{1})^{1}\}^{*}$ — strings of flat, adjacent pairs like ()()(). That is a different (and genuinely regular) language, not $L_B$. It rejects the validly nested (()) (two opens then two closes), which is in $L_B$. So the regex recognizes a regular subset/cousin of $L_B$, not $L_B$ itself. The pumping-lemma proof stands: no regex captures all of $L_B$, because nesting requires counting unbounded depth, which a finite automaton cannot do. (This is the same trap as "match HTML with a regex.")

Phase 4: Place $L_B$ on the hierarchy and pick the right tool

If a finite automaton cannot recognize $L_B$, what can? The Chomsky hierarchy (§35.6) answers: balanced parentheses is the canonical context-free (Type 2) language, recognized by a pushdown automaton — a finite automaton plus a stack. The stack is exactly the unbounded memory the pumping lemma just proved a regex lacks: push on each (, pop on each ), accept iff the stack ends empty (and you never pop an empty stack).

def balanced(s):
    """Pushdown-style recognizer for L_B: a DFA's logic plus ONE stack."""
    stack = []                          # the unbounded memory a regex cannot have
    for ch in s:
        if ch == '(':
            stack.append('(')           # push on open
        elif ch == ')':
            if not stack:
                return False            # close with nothing to match -> reject
            stack.pop()                 # pop on close
    return len(stack) == 0              # accept iff every open was matched

print(balanced("(())"), balanced("(()"), balanced(")("), balanced(""))
# Expected output:
# True False False True

Hand-trace each, watching the stack as the pushdown automaton would: "(())" pushes to depth $2$, pops back to $0$ → accept; "(()" ends with one ( still on the stack → reject; ")(" tries to pop an empty stack on the first ) → reject; "" never touches the stack → accept (the $n=0$ case). With a single LIFO structure we have climbed from Type 3 (regular) to Type 2 (context-free) and recognized exactly the language the pumping lemma put out of a regex's reach.

🔗 Connection: This is why a compiler has both a lexer and a parser (§35.6). The lexer is a finite automaton (regular) that chops source into tokens; the parser is a pushdown automaton (context-free) that recognizes nested syntax — matched braces, nested if/else, parenthesized arithmetic. Incident B is the lexer-vs-parser boundary in miniature: nesting belongs to the parser, and asking a regex to do it is asking the lexer to do the parser's job. The right resolution to the ticket is "use a parser (a stack), not a regex," and now you can say why with a theorem, not a hunch.

Discussion Questions

Incident A's language $L_A = \{a^{n}b\}$ is regular, yet the regex (a+)+b is catastrophic on failing inputs. Explain to a teammate, in two sentences, the difference between "the language is easy" and "this expression is slow," referencing the $2^{n-1}$ compositions count from Phase 2.
The fix in Phase 2 changed the expression but not the language. Describe how a non-backtracking engine (tracking the set of reachable states, à la nfa_accepts) would handle the original (a+)+b without blowing up. Which theorem of §35.3 guarantees this is possible?
In the Phase 3 proof, we pumped with $i = 0$ (deletion). Would $i = 2$ also work to derive a contradiction? Write the resulting string and check whether it is unbalanced. Why does some value of $i$ always suffice once $y$ is cornered to be all (?
The teammate's regex ()* recognizes a regular language. Describe that language precisely in English, and give one string it accepts that is not in $L_B$ and one string in $L_B$ that it rejects.
Suppose the ticket changes to "validate brackets nested at most 3 deep." Is that language regular? Argue informally why bounding the depth changes the answer. (Hint: what can now be remembered with finitely many states?)

Your Turn: Extensions

Option A (measure the blow-up by hand). Extend the split_attempts model to also print the ratio split_attempts(n+1) / split_attempts(n) for $n = 1, \dots, 6$. Hand-derive the ratios and confirm the growth is a constant factor of $2$ per added a — the signature of exponential backtracking.
Option B (a second non-regular language). Using the same pumping-lemma recipe, prove that $L = \{a^{n}b^{n} \mid n \ge 0\}$ is not regular. State your $s$, corner $y$ with $\lvert xy\rvert \le p$, and name the breaking exponent $i$. Compare your proof line-by-line with the Phase 3 proof.
Option C (balanced with two bracket kinds). Extend balanced to accept correctly nested () and [] (so ([]) is valid but ([)] is not), and explain which line enforces "the right kind of close." Then argue that this richer language is still context-free and still not regular.
Option D (depth-bounded is regular). Write a DFA (in words or as a delta dict) that accepts parentheses nested at most $2$ deep, over $\Sigma = \{(, )\}$, by using states "depth 0, 1, 2, dead." Confirm that bounding the depth collapses the problem back to Type 3.

Key Takeaways

A regex is a finite automaton, and its cost is about paths. Catastrophic backtracking is exponentially many NFA match-attempts on a failing input — here $2^{n-1}$ compositions — not a property of the (regular, easy) language. Equivalent expressions can differ wildly in speed (§35.3–35.4).
Some validations are impossible by regex, provably. Balanced parentheses is not regular; the pumping lemma (§35.5) settles it for all infinitely many possible regexes at once. "Just use a regex" is the wrong tool whenever the structure is nested.
The pumping-lemma discipline is the whole game. Pick $s$ to depend on $p$; let $\lvert xy\rvert \le p$ corner $y$; pump to break the invariant; you must beat every split, but the adversary's hands are tied by condition (2).
Match the language to its rung. Regular → finite automaton (a regex, fast and stackless); context-free → pushdown automaton (a stack). The lexer/parser split in every compiler is this boundary (§35.6) — and now you can defend it with a proof.

Case Study: Diagnosing a Regex That Hangs (and One That Can't Exist)

Executive Summary

Background

The scenario

Why this matters

Phase 1: Recast both incidents as language-recognition problems

Phase 2: Diagnose Incident A — count the paths the NFA explores

Where the paths come from: the Thompson NFA of (a+)+

The fix, justified by theory

Phase 3: Diagnose Incident B — prove no regex can exist

Phase 4: Place $L_B$ on the hierarchy and pick the right tool

Discussion Questions

Your Turn: Extensions

Key Takeaways

Where the paths come from: the Thompson NFA of `(a+)+`