Case Study: Verifying a Recursive Expression Evaluator by Structural Induction

DataField.Dev

Case Study: Verifying a Recursive Expression Evaluator by Structural Induction

"The structure of the program should reflect the structure of the data." — a maxim of recursive programming

Executive Summary

A calculator app, a spreadsheet, a database query planner, and a compiler all share one tiny, critical component: a function that takes a tree-shaped arithmetic expression and computes its value. In this case study you'll take a real evaluate function written over the recursively defined expression trees of §7.3, and you'll prove it correct for every possible expression — not by testing it on a hundred trees, but by structural induction over the very definition that generates them. The proof mirrors the code line for line: one case per clause of the definition, the recursive clauses handed an inductive hypothesis about their sub-expressions. This is the analysis side of the chapter — we are not building something new so much as certifying that an existing recursive procedure does what it claims, the same skill you'll need when you reason about the BFS, DFS, and tree traversals of Part V.

Skills applied

Recognizing a recursively defined set (expression trees) and the structural induction it licenses.
Translating "this evaluator is correct" into a precise statement $P(e)$ over all expressions $e$.
Writing a structural-induction proof whose cases line up with a recursive function's branches.
Using the inductive hypothesis on each sub-expression (the strong-induction flavor of recursion).
Distinguishing a property a proof does establish from one it leaves open (termination vs. value).

Background

The scenario

You've joined a team maintaining a small expression engine. The data type is exactly the recursive definition from §7.3, lightly relabeled, with binary +, *, and now - (subtraction) thrown in because the product team asked for it. Expressions are built bottom-up as nested tuples — every tuple is a build history, in the sense of §7.3's "factory" intuition:

# Recursive definition mirrored as tagged tuples (see Chapter 7, section 7.3):
#   basis:      ("num", v)         a numeric literal v
#   recursive:  ("+", a, b)        ("*", a, b)        ("-", a, b)
# where a, b are themselves expressions.

def evaluate(e):
    """Compute the numeric value of an expression tree e."""
    tag = e[0]
    if tag == "num":                 # basis clause
        return e[1]
    a = evaluate(e[1])               # recurse on the left sub-expression
    b = evaluate(e[2])               # recurse on the right sub-expression
    if tag == "+":
        return a + b
    if tag == "*":
        return a * b
    if tag == "-":
        return a - b

# Build  ((2 + 3) * 4)  and evaluate it:
expr = ("*", ("+", ("num", 2), ("num", 3)), ("num", 4))
print(evaluate(expr))
# Expected output:
# 20

It passes the one test in the pull request: $((2 + 3) \times 4) = 20$. But "it returned 20 on one tree" is the §7 refrain — computation tests, proof guarantees. Your task in review is to be sure evaluate(e) returns the mathematically correct value of $e$ for every expression the data type can represent, of any depth.

💡 Intuition: The function has exactly the shape of the recursive definition: one branch for the basis ("num") and one branch that first evaluates the two sub-expressions and then combines them. Because the code is structured by the definition, the proof can be too. That is the whole reason structural induction feels mechanical once you see it: you are walking the same two clauses the programmer already walked.

Why this matters

Every interpreter and compiler you will ever use rests on a function shaped like this one. A bug here is not cosmetic — it silently miscomputes spreadsheet totals, query filters, or shader math. And you cannot test your way to confidence: the set of expression trees is infinite (you can always nest one more operation), so no finite suite covers it. Structural induction is the only tool that certifies all of them at once, and it is the standard technique for reasoning about anything defined by a grammar (formal languages, Chapter 35) or built as a recursive data structure (trees, Chapter 31).

🔗 Connection. The expression set here is the §7.3 set $E$ with one extra operator. The proof you are about to write is the §7.4 parenthesis-counting proof's twin — same skeleton, different invariant. If you can do one, you can do the other; that transferability is theme three (abstraction) in action.

Phase 1: State the claim precisely

To prove evaluate correct, we first need a reference notion of an expression's true value — a mathematical function $\mathcal{V}$ defined directly on the recursive structure, independent of any code. We define $\mathcal{V}$ recursively, matching the data type's clauses:

$$\mathcal{V}\big(("num", v)\big) = v,$$ $$\mathcal{V}\big(("+", a, b)\big) = \mathcal{V}(a) + \mathcal{V}(b),$$ $$\mathcal{V}\big(("*", a, b)\big) = \mathcal{V}(a) \cdot \mathcal{V}(b),$$ $$\mathcal{V}\big(("-", a, b)\big) = \mathcal{V}(a) - \mathcal{V}(b).$$

This $\mathcal{V}$ is the specification: it says, in pure mathematics, what each expression ought to evaluate to. Now the correctness claim is crisp:

Claim. For every expression $e$ in the recursively defined set of expression trees, $\operatorname{evaluate}(e) = \mathcal{V}(e)$.

⚠️ Common Pitfall: proving a function equals itself. A frequent mistake is to define the "correct value" by running the code — then the claim "the code computes the correct value" is vacuously true and proves nothing. The specification $\mathcal{V}$ must be defined independently of the implementation (here, by the recursive equations above), so that the proof genuinely connects two different descriptions. Specification first, implementation second.

🔄 Check Your Understanding For the tree $e = ("-", ("num", 7), ("num", 2))$, compute $\mathcal{V}(e)$ from the equations above, and separately hand-trace evaluate(e). Do they agree?

Answer

$\mathcal{V}(e) = \mathcal{V}(("num", 7)) - \mathcal{V}(("num", 2)) = 7 - 2 = 5$. The code: tag is "-", so a = evaluate(("num", 7)) = 7, b = evaluate(("num", 2)) = 2, return 7 - 2 = 5. Both give $5$. (One agreement is reassuring, not a proof — that is what Phase 3 is for.)

Phase 2: Choose the induction and lay out the cases

The expression trees are a recursively defined set: a basis clause (numeric literals) and three recursive clauses (+, *, -). The natural proof is structural induction on $e$. Its shape is fixed by §7.4: one basis step for the basis clause, and one recursive step per recursive clause, each recursive step entitled to assume the claim for the sub-expressions it combines.

So the proof will have exactly four parts:

Clause of the definition	Proof obligation
basis: `("num", v)`	show $\operatorname{evaluate} = \mathcal{V}$ with no hypothesis
recursive: `("+", a, b)`	assume claim for $a$ and $b$; show it for the sum
recursive: `("*", a, b)`	assume claim for $a$ and $b$; show it for the product
recursive: `("-", a, b)`	assume claim for $a$ and $b$; show it for the difference

💡 Intuition: Notice the table is just the function's if-branches turned into proof cases. The basis case matches if tag == "num"; each recursive case matches one of the combining branches. When the proof's cases mirror the code's branches this exactly, you have set the proof up correctly.

The structural inductive hypothesis in each recursive case is: $\operatorname{evaluate}(a) = \mathcal{V}(a)$ *and* $\operatorname{evaluate}(b) = \mathcal{V}(b)$. We are allowed to assume the claim for both sub-expressions at once — this is the structural analogue of strong induction's "all smaller cases," and it is licensed by the exclusion clause (§7.3): every expression is built from literals by finitely many clause applications, so an induction on the number of build steps reaches every tree.

🧩 Productive Struggle. Before reading Phase 3, write the basis case yourself. What does evaluate(("num", v)) return, and what is $\mathcal{V}(("num", v))$? The case is one line — try it.

Phase 3: The structural-induction proof

Proof (by structural induction on the expression $e$). Let $P(e)$ be the statement $\operatorname{evaluate}(e) = \mathcal{V}(e)$.

Basis step ($e = ("num", v)$). The function takes the branch if tag == "num": return e[1], returning $v$. The specification gives $\mathcal{V}(("num", v)) = v$. So $\operatorname{evaluate}(e) = v = \mathcal{V}(e)$, and $P(e)$ holds for every literal.

Recursive step, + case ($e = ("+", a, b)$). Assume the inductive hypotheses $P(a)$ and $P(b)$: $$\operatorname{evaluate}(a) = \mathcal{V}(a), \qquad \operatorname{evaluate}(b) = \mathcal{V}(b).$$ On input $e$, the function computes a_val = evaluate(a), b_val = evaluate(b), and returns a_val + b_val. Therefore $$\operatorname{evaluate}(e) = \operatorname{evaluate}(a) + \operatorname{evaluate}(b) = \mathcal{V}(a) + \mathcal{V}(b) = \mathcal{V}\big(("+", a, b)\big) = \mathcal{V}(e),$$ where the middle equality is the two hypotheses and the third is the definition of $\mathcal{V}$ on a sum. So $P(e)$ holds.

Recursive step, * case ($e = ("*", a, b)$). Identical, with $\cdot$ for $+$: $$\operatorname{evaluate}(e) = \operatorname{evaluate}(a) \cdot \operatorname{evaluate}(b) = \mathcal{V}(a) \cdot \mathcal{V}(b) = \mathcal{V}(e).$$

Recursive step, - case ($e = ("-", a, b)$). Identical, with $-$: $$\operatorname{evaluate}(e) = \operatorname{evaluate}(a) - \operatorname{evaluate}(b) = \mathcal{V}(a) - \mathcal{V}(b) = \mathcal{V}(e).$$

All cases of the definition are covered, each recursive case using only the hypotheses about its sub-expressions. By structural induction, $\operatorname{evaluate}(e) = \mathcal{V}(e)$ for every expression tree $e$. $\blacksquare$

That is a complete correctness proof for an infinite family of inputs, and it never mentioned a single integer $n$ — the induction was on the shape of the tree, exactly as the code's recursion is.

🔗 Connection. Re-read the three recursive cases: they are copies of one another, differing only in the arithmetic operator. That repetition is a hint that the evaluator could be refactored to a single operator table — and the proof would collapse to one recursive case parameterized by the operator. Spotting "these cases are the same proof" is the abstraction instinct (theme three) that turns four lemmas into one.

Phase 4: What the proof does — and does not — guarantee

A correctness proof certifies one property. It is worth being precise about which.

We proved: if evaluate(e) returns a value, that value equals $\mathcal{V}(e)$. We did not prove that evaluate(e) terminates (returns at all) on every input — though here it does, and the reason is a §7.2 well-ordering argument: each recursive call is on a strictly smaller sub-expression (fewer build steps), and a strictly decreasing sequence of natural numbers cannot descend forever, because $\mathbb{N}$ has a floor. Termination and correctness-of-value are separate obligations; a fast proof habit is to handle them separately and explicitly.

Let's also test before we fully trust — theme four, computation and proof as partners. We compare evaluate against an independent re-derivation on a handful of hand-built trees:

# Three trees with values worked out BY HAND, then checked against evaluate:
t1 = ("num", 9)                                          # V = 9
t2 = ("+", ("num", 9), ("*", ("num", 2), ("num", 5)))   # V = 9 + (2*5) = 19
t3 = ("-", ("*", ("num", 4), ("num", 4)), ("num", 1))   # V = (4*4) - 1 = 15

print([evaluate(t1), evaluate(t2), evaluate(t3)])
# Expected output:
# [9, 19, 15]

Three agreements with the hand-computed values $9, 19, 15$. Reassuring — and because we also proved correctness, we know the agreement extends to every tree the test could never enumerate.

🐛 Find the Error. A teammate "optimizes" the + branch to return b + a (claiming addition is commutative, so the order is irrelevant) and at the same time, by copy-paste accident, changes the - branch to return b - a. The test trees above still print [9, 19, 15] by luck. Where does the proof now break, and on which tiny tree does the bug surface?

Answer

The + change is harmless: $b + a = a + b$, so that recursive case's algebra still ends at $\mathcal{V}(a) + \mathcal{V}(b)$. The - change is fatal: the branch now returns $\operatorname{evaluate}(b) - \operatorname{evaluate}(a) = \mathcal{V}(b) - \mathcal{V}(a)$, which is $-\mathcal{V}(e)$, not $\mathcal{V}(e)$ — the recursive - case no longer reaches $\mathcal{V}(e)$, so the proof fails exactly there. It surfaces on ("-", ("num", 7), ("num", 2)): the spec says $5$, the buggy code returns $2 - 7 = -3$. The three test trees missed it because $t_3$'s subtraction happened to be at a place the accident didn't disturb — a perfect illustration of why a proof beats a test suite.

Discussion Questions

The specification $\mathcal{V}$ and the implementation evaluate have identical recursive shapes. Is that a coincidence, or is it forced by the data definition? What would change in the proof if the data type added a unary negation node ("neg", a)?
We argued termination from "each recursive call is on a strictly smaller sub-expression." Make that precise: what natural-number quantity strictly decreases, and which §7.2 principle forbids it from decreasing forever?
The four cases of the proof were near-duplicates. Sketch how you would restructure both the code and the proof so that all binary operators share one recursive case. What is the single inductive hypothesis in the refactored proof?
Suppose expressions could also be malformed (e.g., a ("+", a) node missing its second operand). The exclusion clause of §7.3 says such objects are not in the set $E$. How does that clause protect the proof, and what would you have to add to handle untrusted input that might not respect the definition?
Compare this proof with §7.4's "equal parentheses" proof. Identify the one structural difference (what the invariant tracks) and the many similarities (case structure, use of the hypothesis).

Your Turn: Extensions

Option A (a new operator, analysis). Add an integer-max node ("max", a, b) with specification $\mathcal{V}(("max", a, b)) = \max(\mathcal{V}(a), \mathcal{V}(b))$. Extend evaluate and add the one new recursive case to the proof. Confirm the case is, again, a near-copy.
Option B (a derived measure, structural induction). Define $\operatorname{depth}(e)$ recursively (a literal has depth $0$; a binary node has depth $1 + \max$ of its children's depths) and prove by structural induction that the number of literals in $e$ is at most $2^{\operatorname{depth}(e)}$. This is the §7.5 tree-height bound wearing expression clothes.
Option C (test then prove). Write a small generator that builds random-shaped expression trees of bounded depth, and use it to compare evaluate against a second, independently written evaluator (e.g., one that converts to a string and is checked by hand). State exactly what agreement on 1,000 trees does and does not establish — and why the Phase 3 proof is what closes the gap.

Key Takeaways

Recursively defined data ⟹ structural induction. When the inputs are generated by a basis clause and recursive clauses, the correctness proof has one case per clause — the recursive cases assuming the claim for their sub-objects.
The proof mirrors the code. Each if-branch of the recursive function became one case of the proof; the recursive calls are the inductive hypotheses. Write code shaped by the data and the proof writes itself.
Specify independently of the implementation. The reference value $\mathcal{V}$ had to be defined by its own recursive equations, not "whatever the code returns," or the correctness claim would be empty.
Correctness of value and termination are separate. We proved the returned value is right; termination is a distinct well-ordering argument (each call shrinks the structure, and $\mathbb{N}$ has a floor).
A proof covers what no test can. The expression trees are infinite; one structural induction certified all of them, catching a bug (the swapped -) that three hand-picked tests sailed past.