Case Study: Building a Derivation Checker

DataField.Dev

Case Study: Building a Derivation Checker

"A proof is a program; the formula it proves is a type for the program." — the Curry–Howard correspondence, as commonly summarized

Executive Summary

In this build-focused case study you will write a small program that does what proof assistants like Coq, Lean, and Isabelle do at their core (§3.5–§3.6): it takes a derivation — a numbered list of statements, each citing a named inference rule and the earlier lines it was drawn from — and checks, line by line, that every step is legal. You will not write a prover (the hard, creative job of finding a proof); you will write a checker (the mechanical job of verifying one). That asymmetry — hard to find, easy to check — is the whole reason the checker is short, and §3.5 promised it would be.

You will build the tool in four increments: a representation for formulas and lines, a pattern-matcher for two rules (modus ponens and modus tollens), a driver that walks a derivation and validates every line's citations, and finally an extension to the full §3.2 rule set plus the error-reporting that turns the checker into something you would actually trust. Along the way the checker will catch the bad derivation from §3.5's Check-Your-Understanding (the bogus "modus ponens" that concludes $q$ from $q \rightarrow r$ and $q$).

Skills applied

Representing propositional formulas and derivation lines as data (§3.1 form made concrete).
Encoding inference rules as pattern matches on that data (§3.2).
Enforcing the derivation discipline: every line is a premise or follows from earlier lines by a named rule (§3.5).
Building incrementally and verifying each increment by hand-traced expected output (theme four).

Background: why build a checker at all?

When rustc accepts your program, or Lean accepts your proof, it has constructed (or verified) a derivation in a system of inference rules (§3.6). The trust we place in those tools rests on a small, auditable kernel: the part that checks each step is so simple you can believe it is correct. Everything clever — type inference, tactics, automation — produces candidate derivations that the kernel then checks. If the kernel is right, a bug anywhere else can only ever cause a rejection, never a false acceptance. That is why the checker, not the prover, is the trusted core.

We will build a toy kernel for propositional logic. It is small enough to fit in your head, which is exactly the property a kernel should have.

🔗 Connection: The "easy to check, hard to find" split you make concrete here is one of the deepest ideas in computer science. A short certificate (the derivation) that a fast checker can verify is precisely the structure behind the complexity class NP, and propositional satisfiability — the validity question of §3.1 turned inside out — is its canonical problem. We will study this head-on as SAT and P vs. NP in Chapter 37. Today you build the "checker" half by hand.

Phase 1: Represent formulas and derivation lines

A checker needs the statements as data, not prose. We use a tiny tree: a formula is either a variable name (a string) or a tuple (op, ...) for a connective. Three connectives suffice for this chapter's rules.

# A formula is:
#   a string                      -> a variable, e.g. "p"
#   ("not", f)                    -> negation
#   ("and", f, g) / ("or", f, g)  -> conjunction / disjunction
#   ("imp", f, g)                 -> implication f -> g
# Examples:
p, q, r = "p", "q", "r"
imp_pq  = ("imp", p, q)           # p -> q
not_q   = ("not", q)              # ~q

# A derivation line: (formula, rule_name, [cited line numbers]).
# Premises cite the rule "premise" and an empty list.
line1 = (imp_pq, "premise", [])
line2 = (p,      "premise", [])
line3 = (q,      "mp",      [1, 2])   # claim: q by modus ponens from lines 1, 2
print(line3)
# Expected output:
# ('q', 'mp', [1, 2])

Two design choices matter. First, structural equality is just ==: two formulas are the same statement iff their trees are equal, which Python tuples give us for free — essential, because every rule check reduces to "does this subtree equal that one?" Second, lines are 1-indexed in citations (to match how §3.5 numbers derivations on paper), so the driver will translate to 0-indexed list access.

💡 Intuition: A formula tree makes "pattern" literal. Modus ponens needs a line of the form ("imp", X, Y) and another line equal to X, and then licenses Y. With formulas as trees, "of the form $p \rightarrow q$" becomes "a tuple whose first element is "imp"" — a pattern your code can test directly. The abstract premises-over-line shapes of §3.2 become concrete if statements (theme three: the right representation makes the rule trivial to state).

Phase 2: Encode two rules as pattern matchers

Each inference rule becomes a function: given the cited formulas, does this rule license the claimed conclusion? Start with the two workhorses, modus ponens and modus tollens (§3.2). Each returns True iff the rule legitimately produces concl from the cited formulas.

def check_mp(cited, concl):
    """Modus ponens: from (X -> Y) and X, conclude Y. cited = [f1, f2]."""
    if len(cited) != 2:
        return False
    a, b = cited
    # Accept either citation order: one must be an implication whose
    # antecedent is the other cited line, and whose consequent is concl.
    for imp_form, other in ((a, b), (b, a)):
        if (isinstance(imp_form, tuple) and imp_form[0] == "imp"
                and imp_form[1] == other and imp_form[2] == concl):
            return True
    return False

def check_mt(cited, concl):
    """Modus tollens: from (X -> Y) and ~Y, conclude ~X. cited = [f1, f2]."""
    if len(cited) != 2 or not (isinstance(concl, tuple) and concl[0] == "not"):
        return False
    x = concl[1]                                   # concl is ~X, so X = concl[1]
    for imp_form, neg in ((cited[0], cited[1]), (cited[1], cited[0])):
        if (isinstance(imp_form, tuple) and imp_form[0] == "imp"
                and imp_form[1] == x                       # antecedent is X
                and neg == ("not", imp_form[2])):          # other line is ~Y
            return True
    return False

# Quick hand checks:
print(check_mp([("imp", "p", "q"), "p"], "q"))      # legitimate MP
print(check_mp([("imp", "p", "q"), "q"], "p"))      # AFFIRMING THE CONSEQUENT
print(check_mt([("imp", "p", "q"), ("not", "q")], ("not", "p")))  # legit MT
# Expected output:
# True
# False
# True

Look closely at the middle result. The call check_mp([p->q, q], p) is the program being asked to bless affirming the consequent (§3.3) — and it returns False, because no ordering of the cited lines matches "an implication whose antecedent is the other line and whose consequent is the conclusion." The implication's antecedent is p, not q, so the pattern fails. The fallacy is rejected not by a special rule against it, but simply because it does not match the valid pattern — exactly the §3.3 point that a fallacy is "the valid form run the wrong way."

⚠️ Common Pitfall: Accepting only one citation order. On paper we might write the implication first, but a checker should not care whether the user cited [p->q, p] or [p, p->q]. The loop over ((a, b), (b, a)) handles both. Forgetting this makes a correct derivation get rejected for a cosmetic reason — a false negative, which erodes trust in the tool as surely as a false positive.

Phase 3: Walk the derivation and enforce the discipline

Now the driver. It walks the lines in order and, for each non-premise line, (1) confirms every cited number refers to an earlier line (the forward-only rule from §3.5 that prevents circular reasoning), then (2) dispatches to the matching rule checker. If every line passes, the derivation is valid.

RULES = {"mp": check_mp, "mt": check_mt}   # extended in Phase 4

def check_derivation(lines):
    """lines: list of (formula, rule_name, [1-indexed cited line numbers]).
    Return (True, None) if every line is legal, else (False, offending_index)."""
    for i, (formula, rule, cites) in enumerate(lines):     # i is 0-indexed
        if rule == "premise":
            continue                                       # premises are free
        # (1) Citations must point strictly backward (1-indexed <= i).
        if any(c < 1 or c > i for c in cites):             # line numbers are 1..i
            return (False, i + 1)
        cited_formulas = [lines[c - 1][0] for c in cites]  # translate to 0-indexed
        # (2) The cited rule must legitimately produce this line.
        checker = RULES.get(rule)
        if checker is None or not checker(cited_formulas, formula):
            return (False, i + 1)
    return (True, None)

# The valid derivation of q from (p->q) and p:
good = [(("imp", "p", "q"), "premise", []),
        ("p",               "premise", []),
        ("q",               "mp",      [1, 2])]
print(check_derivation(good))
# Expected output:
# (True, None)

The driver returns (True, None) — every line is a premise or a legal modus ponens citing earlier lines. Two safety properties are now enforced mechanically: a line can never cite itself or a later line (the c > i test, with i+1 being the current line's own number), and a line's rule must actually produce its formula. A human reader needed creativity to write good; the checker needs none to verify it — it just pattern-matches one line at a time.

🔄 Check Your Understanding The bad derivation from §3.5 reads: line 5 is $q \rightarrow r$, line 6 is $q$, and line 7 claims "$q$ — Modus ponens, 5, 6." Without running it, what will check_mp([("imp","q","r"), "q"], "q") return, and why?

Answer

It returns False. For the implication ("imp","q","r"), modus ponens licenses its consequent r, not q. The pattern requires the conclusion to equal imp_form[2] (here r), but the claimed conclusion is q. So the checker rejects the step — catching exactly the §3.5 error that MP from $q \rightarrow r$ and $q$ yields $r$, not $q$.

Phase 4: Extend to the full rule set and report errors

A one-rule checker is a toy; a trustworthy one covers §3.2 and tells you where a derivation broke. Add the remaining rules as pattern matchers and register them. Each is a few lines, because each rule is just a pattern (§3.2).

def check_hs(cited, concl):     # hypothetical syllogism: X->Y, Y->Z |- X->Z
    if len(cited) != 2 or not (isinstance(concl, tuple) and concl[0] == "imp"):
        return False
    X, Z = concl[1], concl[2]
    for f, g in ((cited[0], cited[1]), (cited[1], cited[0])):
        if (isinstance(f, tuple) and f[0] == "imp" and f[1] == X and
            isinstance(g, tuple) and g[0] == "imp" and g[1] == f[2] and g[2] == Z):
            return True
    return False

def check_ds(cited, concl):     # disjunctive syllogism: X v Y, ~X |- Y
    for disj, neg in ((cited[0], cited[1]), (cited[1], cited[0])):
        if (isinstance(disj, tuple) and disj[0] == "or"
                and neg == ("not", disj[1]) and disj[2] == concl):   # ~X, keep Y
            return True
        if (isinstance(disj, tuple) and disj[0] == "or"
                and neg == ("not", disj[2]) and disj[1] == concl):   # ~Y, keep X
            return True
    return False

def check_simp(cited, concl):   # simplification: X and Y |- X (or Y)
    f = cited[0]
    return (isinstance(f, tuple) and f[0] == "and"
            and concl in (f[1], f[2]))

def check_conj(cited, concl):   # conjunction: X, Y |- X and Y
    return (isinstance(concl, tuple) and concl[0] == "and"
            and concl[1] in cited and concl[2] in cited)

def check_add(cited, concl):    # addition: X |- X v Y (any Y)
    return (isinstance(concl, tuple) and concl[0] == "or"
            and cited[0] in (concl[1], concl[2]))

RULES.update({"hs": check_hs, "ds": check_ds,
              "simp": check_simp, "conj": check_conj, "add": check_add})

Now feed the checker the fully worked multi-rule derivation from §3.2 — deriving $s$ from $p \rightarrow q$, $\neg q$, $p \lor r$, $r \rightarrow s$ — exactly as the chapter wrote it (modus tollens, then disjunctive syllogism, then modus ponens):

derivation = [
    (("imp", "p", "q"), "premise", []),     # 1  p -> q
    (("not", "q"),      "premise", []),     # 2  ~q
    (("or", "p", "r"),  "premise", []),     # 3  p v r
    (("imp", "r", "s"), "premise", []),     # 4  r -> s
    (("not", "p"),      "mt",  [1, 2]),     # 5  ~p   (modus tollens)
    ("r",               "ds",  [3, 5]),     # 6  r    (disjunctive syllogism)
    ("s",               "mp",  [4, 6]),     # 7  s    (modus ponens)
]
print(check_derivation(derivation))

# Now break it: claim line 7 by the WRONG rule (cite 3 and 5 again, as if MP).
broken = derivation[:6] + [("s", "mp", [3, 5])]
print(check_derivation(broken))
# Expected output:
# (True, None)
# (False, 7)

The genuine derivation passes; the tampered one is pinned to line 7. In broken, line 7 claims s by modus ponens from line 3 (p v r) and line 5 (~p) — but neither is an implication with consequent s, so check_mp returns False and the driver reports the offending line number. That is the whole behavior of a proof-assistant kernel in miniature: accept a legal derivation; reject an illegal one and point at the break.

🐛 Find the Error. A user submits a derivation whose line 4 is ("p", "ds", [4, 2]) — it cites itself (line 4) among its inputs. Which guard in check_derivation catches this, and what does the function return?

Answer

The citation guard any(c < 1 or c > i for c in cites) catches it. For line 4, i is 3 (0-indexed), so the current line's own number is i + 1 = 4; the citation c = 4 satisfies c > i (since 4 > 3), so the guard fires and the function returns (False, 4). This is the forward-only discipline of §3.5 enforced in one line: no step may rest on itself or anything later, which is exactly what rules out circular reasoning.

Phase 5: What the checker does — and does not — establish

The checker certifies that a derivation is well-formed: every line is a premise or a legal application of a named rule to earlier lines. Because each rule in RULES is valid (truth-preserving, §3.2) and the driver only ever applies a rule to already-established lines, a derivation the checker accepts is a genuine certificate that the conclusion is a valid consequence of the premises — $\Gamma \vdash C$ in the notation of §3.5.

But notice the two things it deliberately does not do, both straight from §3.1:

It says nothing about soundness. The checker is happy to verify a flawless derivation from the premise "all integers are even." Validity of the derivation is orthogonal to truth of the premises — the premises are the user's responsibility, the form is the checker's.
It does not find proofs. Hand it a true conclusion with no derivation and it offers no help. The creativity of §3.5 (work backward from the goal) lives entirely outside the kernel.

🚪 Threshold Concept: trust concentrates in the checker. Once you have a small checker you believe is correct, you can let anything propose derivations — a clever search, a flaky heuristic, even a language model — and still trust the results, because nothing reaches "accepted" without passing the kernel. This inversion (don't trust the prover; trust the checker) is how Coq and Lean earn confidence in proofs far too large for any human to read, and it is the practical payoff of §3.5's "hard to find, easy to check." You just built the trusted half.

Discussion Questions

check_mp accepts both citation orders but check_simp and check_add read only cited[0]. Is that a bug? For which rules does citation order or count genuinely matter, and which should tolerate extra or reordered citations?
The checker rejects affirming the consequent purely because it fails to match check_mp's pattern — there is no rule "against" it. Explain why a checker built from only valid rules can never accept a fallacy, connecting to the §3.1 definition of validity.
check_add lets X license X v Y for any Y the user writes in the conclusion. Is that faithful to the addition rule of §3.2, or too permissive? Justify using the rule's truth-table validity.
Suppose you added a buggy check_mp that also returned True for affirming the consequent. What kind of failure would the kernel then have — false acceptances or false rejections — and why is that the more dangerous kind for a proof assistant?
The kernel checks propositional rules only. What new machinery (data and checks) would check_ui and check_ei (§3.4) require, and which one needs a freshness test on the introduced name?

Your Turn: Extensions

Option A (resolution). Add check_res for resolution ($X \lor Y,\ \neg X \lor Z \vdash Y \lor Z$, §3.2). It is the most pattern-heavy rule; write three hand-traced test lines (including one bad one) and their # Expected output:. Resolution is the engine of the SAT solvers in Chapter 37.
Option B (better diagnostics). Make check_derivation return a human-readable message instead of a line number: e.g., "line 7: 'mp' cannot produce s from lines [3, 5]". Hand-trace the message for the broken derivation above.
Option C (quantifier rules). Extend the formula type with ("forall", var, body) and ("exists", var, body) and implement check_ui (substitute a term for the bound variable) and check_ei (require the introduced name to be fresh — appear in no earlier line). State, in a comment, exactly which earlier-lines check enforces freshness (§3.4).
Option D (cross-check with is_valid). For each propositional rule in RULES, use this chapter's is_valid (Project Checkpoint) to confirm the rule's pattern is genuinely valid over all assignments — i.e., verify your checker only encodes valid rules. Write the is_valid call for modus tollens and its hand-derived output. This is theme four: prove the kernel's rules valid, then let the kernel certify derivations.

Key Takeaways

A derivation checker is short because checking is easy. Verifying a proof is one local pattern-match per line; finding the proof (the creative §3.5 work) lives outside the kernel. That asymmetry is the design.
Rules become patterns on data. With formulas as trees, each §3.2 rule is a few-line function asking "do these subtrees fit the shape?" — abstraction (theme three) turning logic into code.
Fallacies are rejected by omission. The checker has no rule against affirming the consequent; it simply never matches a valid pattern. Building only valid rules makes accepting a fallacy impossible — the §3.1–§3.3 lesson, enforced structurally.
The kernel checks validity, not soundness, and never finds proofs. It certifies $\Gamma \vdash C$ given the premises; the truth of the premises and the discovery of the derivation are the human's job.
Trust concentrates in a small checker. Believe the kernel and you can accept derivations from any source — the principle behind Coq, Lean, and Isabelle, and a direct preview of the certificate-and- verifier structure of NP in Chapter 37.