Case Study: Auditing a Vendor's "Perfect Infinite-Loop Detector"

DataField.Dev

Case Study: Auditing a Vendor's "Perfect Infinite-Loop Detector"

"The first principle is that you must not fool yourself — and you are the easiest person to fool." — Richard P. Feynman

Executive Summary

A vendor is pitching your engineering org a static-analysis product, TerminusGuard, whose headline claim is: "Point it at any codebase and it flags every function that can hang forever — with zero false alarms." The deal is large. Before signing, your team is asked to evaluate the claim on its merits. This case study is that evaluation. You will translate the marketing into the precise language of Chapter 36, locate the exact claim inside the decidable/undecidable map, prove it cannot be met, and then — crucially — figure out what an honest version of the product could actually deliver and how to test the real thing you are buying. The skill on display is not building a tool; it is reading a specification through the lens of computability and knowing, before any code runs, which promises are physics and which are fantasy.

This is analysis, not construction. We write small probes and trace their expected output by hand to expose the claim's structure, not to build a detector.

Skills applied

Translating an informal product claim into a decision problem stated as a language.
Classifying that problem as decidable / recognizable / undecidable using §36.3 and §36.4.
Reproducing the halting-problem contradiction to disprove an impossible guarantee.
Distinguishing sound from complete behavior and reasoning about which a real tool must give up.
Designing a fair acceptance test for the achievable product behind the impossible pitch.

Background

The scenario

TerminusGuard's one-page spec lists three guarantees. Your job is to grade each one.

G1 (completeness). "For every function in your code, if it can run forever on some input, TerminusGuard flags it."

G2 (soundness). "TerminusGuard never flags a function that always finishes — zero false alarms."

G3 (totality). "TerminusGuard itself always returns a verdict in bounded time; it never hangs."

Marketing summarizes G1+G2+G3 as "a perfect, always-on halting checker." The engineer next to you mutters, "That's the halting problem. They can't have all three." Your task is to make that intuition rigorous — and then to say what TerminusGuard could honestly promise instead, so the team can evaluate the product that actually exists.

💡 Intuition: A "perfect halting checker" is exactly a decider for the halting problem (§36.3): it must answer, for any program, halts or loops, always correctly, always terminating. §36.4 proved no such decider exists. So at least one of G1, G2, G3 must fail. The analysis below pins down which, and shows the failure is forced by logic, not by the vendor's competence.

Why this matters

Procurement decisions get made on slide decks. "AI-powered," "provably complete," "catches every bug" are common phrases, and most reviewers cannot tell an aggressive-but-real claim from an impossible one. A working knowledge of undecidability is a buyer's superpower: it lets you reject impossible guarantees in the meeting, redirect the conversation to the achievable trade-off (which side does the tool err on?), and write an acceptance test that measures the real product. The same reasoning applies when you write the spec for an internal linter, a CI gate, or a verification tool.

Phase 1: Translate the claim into a decision problem

Strip the branding. The function TerminusGuard claims to compute takes a program (and, to be careful, an input) and returns a yes/no verdict. Written as a language — the chapter's standard move (§36.3) — the problem it claims to decide is precisely

$$ \mathrm{HALT} = \{\, \langle P, x\rangle : P \text{ is a program that halts on input } x \,\}. $$

The three guarantees map cleanly onto the two halves of "decides $\mathrm{HALT}$" plus the always-halts requirement:

Spec line	Formal meaning in terms of $\mathrm{HALT}$
G2 (soundness)	Whenever TerminusGuard says "this halts" (does not flag), it is right: it never flags a halter. Equivalently, every "no-flag" verdict is a true member of $\mathrm{HALT}$.
G1 (completeness)	Whenever $P$ loops, TerminusGuard flags it: it catches every non-member of $\mathrm{HALT}$.
G3 (totality)	TerminusGuard always returns a verdict — it is a decider, not merely a recognizer.

🔄 Check Your Understanding Which one of G1, G2, G3 corresponds to the difference between a recognizer and a decider?

Answer

G3 (totality). A recognizer for $\mathrm{HALT}$ already gives G2-style soundness and accepts every halter; what it cannot promise is to always return — i.e., to halt on the looping cases. G3 is exactly the upgrade from recognizer to decider, and §36.4 says that upgrade is impossible.

So the marketing summary "G1 ∧ G2 ∧ G3" literally asserts: there is a decider for $\mathrm{HALT}$. That is the proposition Chapter 36 demolished. We now reproduce the demolition aimed straight at TerminusGuard.

Phase 2: Disprove "all three" by the halting argument

Suppose, for contradiction, TerminusGuard satisfies G1, G2, and G3 simultaneously. Then we have, in hand, a program terminus(P, x) that always returns — "loops" if $P$ runs forever on $x$ (G1 + G3), and "halts" otherwise, always correctly (G2). That is a decider $H$ for $\mathrm{HALT}$. We feed it the §36.4 trap.

Here is the contradiction made explicit in code. We cannot actually run terminus (the theorem says it cannot exist); instead we assume it and watch every possible verdict on the contrarian program become wrong — exactly the chapter's check_consistency demonstration.

def make_D(terminus):
    """D(p): ask terminus whether p halts on p, then DO THE OPPOSITE."""
    def D(p):
        if terminus(p, p) == "halts":
            while True:        # terminus predicted 'halts' -> D loops forever
                pass
        return "halted"        # terminus predicted 'loops' -> D halts
    return D

def audit_verdicts():
    # What could terminus(D, D) possibly answer? Try both; D defies each.
    for verdict in ("halts", "loops"):
        actual = "loops" if verdict == "halts" else "halts"   # D does the opposite
        print(f"terminus(D,D)='{verdict}'  ->  D actually {actual}  ->  CONTRADICTION")

audit_verdicts()
# Expected output:
# terminus(D,D)='halts'  ->  D actually loops  ->  CONTRADICTION
# terminus(D,D)='loops'  ->  D actually halts  ->  CONTRADICTION

Read the two lines. If terminus(D, D) says "halts", then by construction D loops — so the verdict was wrong, violating G2 (a halter wrongly... no: a looper wrongly cleared) — and if it says "loops", then D halts, so the flag was false, violating G2 the other way. Either verdict is incorrect, so the correct, always-returning terminus we assumed cannot exist. The only assumption was "G1 ∧ G2 ∧ G3 hold," so that conjunction is false. $\blacksquare$

🚪 Threshold Concept. Notice what the audit did not do: it never inspected TerminusGuard's algorithm, never benchmarked it, never found a bug in their code. It refuted the specification using the program $D$ that any decider must mishandle. This is the buyer's leverage — you can rule out a guarantee from the spec alone, before a single trial, because the impossibility is in the problem, not in the implementation. No amount of engineering, funding, or "AI" moves this wall (§36.2: the Church–Turing thesis makes it a wall for every possible tool, not just this one).

⚠️ Common Pitfall: "Maybe their AI is just smarter." A tempting rebuttal: "Sure, a simple checker can't do it, but a sophisticated one (machine learning, heuristics, a huge model) might." No. The proof assumed nothing about how terminus decides — only that it is correct and always returns. A smarter method is still a method; if it met G1 ∧ G2 ∧ G3 it would be the decider $D$ defeats. Sophistication can shrink the gap; it cannot close it.

Phase 3: Which single guarantee is the lie? Probe each pair

The conjunction is impossible, but pairs of the guarantees are achievable — and which pair you keep determines what kind of tool you are buying. Let's hold each guarantee fixed and see what the other two must become. We probe with three tiny "programs under test," whose halting behavior we know by hand, to feel out each design.

# Three test programs with KNOWN behavior (we reason about them by hand):
def always_halts(x):  return x + 1            # halts on every input
def never_halts(x):
    while True: pass                          # loops on every input
def halts_late(x):
    i = 0
    while i < 10**9: i += 1                    # halts, but only after ~1e9 steps
    return i

programs = [("always_halts", always_halts),
            ("never_halts",  never_halts),
            ("halts_late",   halts_late)]

# A SOUND, TOTAL, but INCOMPLETE checker: only ever clears the trivially-safe ones,
# and abstains (flag = "maybe-loops") on everything it cannot prove halts.
def conservative_check(name):
    provably_halts = {"always_halts"}          # the subset it can prove
    return "ok" if name in provably_halts else "maybe-loops"

for name, _ in programs:
    print(f"{name:13} -> {conservative_check(name)}")
# Expected output:
# always_halts  -> ok
# never_halts   -> maybe-loops
# halts_late    -> maybe-loops

Trace the design choices this exposes:

Keep G2 + G3, drop G1 (sound, total, incomplete). This is conservative_check. It never clears a looper (G2 holds: it only says ok for always_halts, a true halter) and it always returns (G3 holds). The price is G1: it flags halts_late as maybe-loops even though that program does halt — a false alarm on a halter, i.e., incompleteness. This is what type checkers, borrow checkers, and termination checkers do (§36.6): they accept a provable subset and hedge on the rest.
Keep G1 + G3, drop G2 (complete, total, unsound). Flip the default: clear nothing unless you can prove it halts, and flag everything else as looping. Now every real looper is flagged (G1 holds) and the tool always returns (G3 holds), but halts_late gets a wrong "loops" verdict — a false alarm, violating G2.
Keep G1 + G2, drop G3 (sound, complete, but may hang). This is the honest recognizer of §36.3: run $P$ on $x$ and report. It is never wrong when it answers, but on never_halts it itself loops forever and never returns — G3 fails. (This is the "just run it and watch" approach the chapter warns about.)

🔄 Check Your Understanding TerminusGuard's pitch (G1 ∧ G2 ∧ G3) is impossible. Of the three achievable pairs above, which one would you want for a CI gate that blocks a merge when it fires, and why?

Answer

For a blocking gate, false alarms are expensive (they stop good merges), so you want G2 (sound) kept — never block a program that actually finishes. That means the sound-total-incomplete design (conservative_check): it may miss some real loopers (you accept incompleteness and catch those another way), but it never blocks correct code. Keeping G1 instead (complete-but-unsound) would block halts_late and infuriate developers. Keeping G1∧G2 (the recognizer) is a non-starter for CI because it can hang the build.

Phase 3.5: "But we only check your code" — why narrowing the domain doesn't rescue G1 ∧ G2 ∧ G3

A clever sales engineer offers a retreat: "Fine — we don't claim to handle arbitrary programs. We only analyze functions written in your codebase, in your style. Surely on that restricted, realistic class we can be sound, complete, and total." This is the most plausible-sounding rescue, and it is worth dismantling carefully, because the answer reveals exactly what would have to be true for the claim to hold.

The honest reply is conditional: it depends entirely on how restricted the class really is.

If "your code" is restricted to a class that is not Turing complete — say, functions provably built from bounded loops only, or expressible in a total language — then yes, halting is decidable on that class, and G1 ∧ G2 ∧ G3 become achievable for that class. This is not a loophole in §36.4; it is §36.6's "restrict the language" strategy. The catch: the vendor must then prove your code lives in that class, and most real codebases contain general while loops, recursion, and unbounded data structures — i.e., they are Turing complete.
If "your code" is still Turing complete — if a programmer could write a general loop in it — then the rescue fails, and we can prove it fails by a reduction (§36.5), the chapter's main tool.

Here is the reduction, because it is the rigorous core of the rebuttal. Suppose TerminusGuard were sound, complete, and total on the class $\mathcal{C}$ of functions expressible in your (Turing-complete) codebase. We show this would decide the general halting problem, contradicting §36.4. Given an arbitrary halting instance $\langle P, x\rangle$, build (emit the source of, never run) a function $P'$ in your codebase:

# The reduction gadget: emit a codebase-legal function whose termination = P halts on x.
def build_probe(P_name, x):
    """Return SOURCE of a function probe() (legal in 'your codebase') that
    halts  <=>  P halts on x.  We emit text; we never run P."""
    return (f"def probe():           # a perfectly ordinary function in your codebase\n"
            f"    {P_name}({x!r})     # run P on x; if that halts, probe falls through\n"
            f"    return 'done'       # so probe() halts  iff  {P_name}({x!r}) halts")

print(build_probe("P", "data"))
# Expected output:
# def probe():           # a perfectly ordinary function in your codebase
#     P('data')     # run P on x; if that halts, probe falls through
#     return 'done'       # so probe() halts  iff  P('data') halts

By construction, probe() halts if and only if $P$ halts on $x$. Because your codebase is Turing complete, probe is a legal function in it — exactly the kind TerminusGuard claims to handle perfectly. So we feed probe to the (assumed sound-complete-total) TerminusGuard: it returns "halts" or "loops," always correctly, always terminating. Reading off its verdict decides whether $P$ halts on $x$ — for arbitrary $P$ and $x$. That is a decider for the general halting problem, which §36.4 forbids. Contradiction. $\blacksquare$

The reduction makes the trade-off precise: TerminusGuard can be sound-complete-total exactly to the extent that your codebase is not Turing complete. There is no middle ground where "realistic code" buys back the impossible guarantee while still letting programmers write ordinary loops. Whenever a vendor says "but only on realistic inputs," the buyer's question is sharp and answerable: is the input class Turing complete? If a user can write a general loop, the guarantee is impossible; if they provably cannot, the vendor has changed the product into a (legitimate, but different) total-language tool.

🔗 Connection. This is the §36.5 reduction pattern, used here defensively: instead of proving a new problem undecidable, we use a reduction to prove a product claim impossible. Same machinery — "if your tool worked, I could solve $\mathrm{HALT}$, which I can't, so your tool can't work" — which is why the chapter calls reduction the working theorist's (and now the skeptical buyer's) main tool. Note the gadget build_probe is again a program-builder: it emits source and never runs $P$, exactly as §36.5 requires (Exercise 36.14).

⚠️ Common Pitfall: confusing "rare in practice" with "decidable." A subtler version of the rescue: "programs that exploit the halting paradox are pathological; normal code never does that, so in practice we're fine." This conflates two different claims. It may well be true that the contrarian program $D$ never appears in your repo — but undecidability does not require $D$ to appear. The reduction above used a perfectly ordinary probe that just calls another function; the impossibility rides on all loops, not on exotic ones. "Pathological inputs are rare" is a fine reason to expect a heuristic tool to work often; it is not a reason to believe a guarantee holds always.

Phase 4: What TerminusGuard can honestly sell — and how to test it

Undecidability is always about the general, unbounded problem (§36.6). The achievable product lives in the restricted, bounded world. A credible TerminusGuard spec would read:

Honest G1′ (bounded recall). TerminusGuard flags every function that fails to halt within a configured step/time budget $N$. Honest G2′ (soundness preserved). Any function TerminusGuard clears provably halts (within a conservative analysis it can justify); it never clears a function it cannot prove safe. Honest G3′ (totality preserved). TerminusGuard always returns in bounded time.

G1′ is decidable — it is the bounded halting question $\mathrm{HALT}_N$ = "halts within $N$ steps?", which you simulate and answer (§36.6; and Exercise 36.21). G2′ keeps soundness by analyzing a decidable approximation. The cost, openly stated, is the gap between "halts within $N$" and "halts ever," plus the halters G2′ cannot prove and therefore hedges on. That is a product you can reason about and buy.

Now design the acceptance test — the analysis deliverable your team actually needs. Test the achievable behavior, and design inputs that separate the honest claims from the impossible ones.

# An acceptance harness: feed programs with KNOWN behavior and check the tool's
# verdict against the HONEST spec (G1' bounded, G2' sound). We hand-derive truth.
budget = 10**6
def truth_within(name):                       # ground truth we compute by hand
    return {"always_halts": "halts",          # halts at step 1
            "never_halts":  "loops",          # never halts
            "halts_late":   "loops"}[name]    # halts at ~1e9 > budget -> "loops within budget" is False

# A spec-compliant tool MUST: flag never_halts (G1'), and may flag halts_late too
# (it exceeds the budget) -- that is ALLOWED under G1', NOT a soundness bug, because
# G2' only forbids *clearing* a non-halter, never forbids flagging a slow halter.
for name in ("always_halts", "never_halts", "halts_late"):
    must_flag = truth_within(name) == "loops"
    print(f"{name:13} budget={budget}: compliant tool must flag? {must_flag}")
# Expected output:
# always_halts  budget=1000000: compliant tool must flag? False
# never_halts   budget=1000000: compliant tool must flag? True
# halts_late    budget=1000000: compliant tool must flag? True

The harness encodes the key subtlety you must explain to procurement: under the honest spec, flagging halts_late is correct, not a false alarm, because the bounded claim is "halts within the budget," and halts_late does not. A naive reviewer who tested against the unbounded truth would file a false "soundness bug." Stating the spec precisely — bounded recall, preserved soundness — is what makes the test fair and the purchase decision sound.

🔗 Connection. The move "replace the undecidable halts ever? with the decidable halts within $N$?" is the same one behind watchdog timers, request timeouts, and smart-contract gas limits (§36.6). It is also why CI systems impose per-test time limits: they cannot detect every hang, so they bound the wait and flag the overruns. You will design exactly such a limit in the Your Turn section.

Discussion Questions

The audit in Phase 2 refuted G1 ∧ G2 ∧ G3 without examining TerminusGuard's algorithm. Explain to a non-technical buyer, in three sentences, why "we didn't even look at their code" is a strength of the argument, not a gap. Which chapter result licenses it for every possible tool (§36.2)?
Phase 3 showed three achievable pairs. For a tool that highlights suspicious code in the editor (advisory, non-blocking), which guarantee would you drop, and why does the cost calculus differ from the blocking-CI-gate case?
The honest spec keeps soundness (G2′) and bounds recall (G1′). Construct a function that the honest TerminusGuard must flag under a budget of $10^3$ steps but that a human can see obviously halts. Is the flag a bug? Tie your answer to the difference between $\mathrm{HALT}$ and $\mathrm{HALT}_N$.
Rice's theorem (§36.5) generalizes beyond halting to any nontrivial behavioral property. A rival vendor claims to "detect every function that ever returns null." Restate this as a Rice-type claim and explain, in two sentences, why it is impossible in general and what the honest version looks like.
Suppose TerminusGuard quietly weakens G2 (allows a few false alarms) to strengthen G1 (catch more loopers). Is that a defensible engineering trade for a security-critical codebase? Argue both sides using the sound/complete framing of §36.6.

Your Turn: Extensions

Option A (write the honest spec). Rewrite TerminusGuard's three guarantees as a buyable specification: a bounded, decidable recall claim plus an explicit statement of which side the tool errs on. Then list three test programs (with hand-derived behavior) whose verdicts would distinguish a spec-compliant tool from one secretly claiming the impossible.
Option B (the recognizer in code). Implement the honest recognizer for halting: recognize(P, x) that runs P(x) and returns True if it halts. Document, in the docstring, exactly the input on which it violates G3 (never returns), and explain why that is unavoidable for a recognizer (§36.3).
Option C (bounded checker). Using the halts_within pattern from the chapter's code and Exercise 36.13, implement a terminus_bounded(step_fn, state, budget) that returns "flag", "ok-within-budget", and never the dishonest "proven-halts". Hand-trace it on a program that halts at step budget+1 and explain why the verdict it returns is the only honest one.
Option D (procurement memo). Write a one-paragraph recommendation to your VP: accept, reject, or renegotiate the TerminusGuard deal, grounded in the analysis above. Name the impossible guarantee, the achievable substitute, and the single acceptance test that would protect the company.

Key Takeaways

A "perfect halting checker" is a decider for $\mathrm{HALT}$ — and §36.4 says it cannot exist. Any spec asserting complete + sound + always-returns halting detection is refuted by logic alone.
You can refute an impossible guarantee from the specification, before any benchmark. The contrarian program $D$ defeats any claimed decider, so no implementation — however sophisticated — can satisfy the pitch. The Church–Turing thesis makes this hold for every possible tool.
The achievable product keeps two of {sound, complete, total} and openly drops the third. Which two you keep is the real design decision: sound-total-incomplete for blocking gates, the recognizer when hanging is acceptable, complete-total-unsound when missing a looper is the worst outcome.
Bounding the resource turns the undecidable "halts ever?" into the decidable "halts within $N$?" The honest spec lives there, and the fair acceptance test measures bounded behavior — flagging a slow halter is correct under that spec, not a soundness bug.
Computability literacy is buyer's leverage. Reading a vendor claim through §§36.3–36.6 lets you reject fantasy, redirect to the real trade-off, and write a test that measures the product you are actually paying for.