Case Study: Auditing the Reasoning in an Incident Postmortem

DataField.Dev

Case Study: Auditing the Reasoning in an Incident Postmortem

"It is the mark of an educated mind to be able to entertain a thought without accepting it." — commonly attributed to Aristotle

Executive Summary

A production outage has been "resolved," and an on-call engineer has written up the postmortem. The write-up reads smoothly and the team is ready to sign off. Your job is not to debug the system — it is to audit the argument: to read the chain of reasoning that led from symptoms to a declared root cause and decide, line by line, whether each inference is valid (good form) and whether the whole thing is sound (good form and true premises). This is the analysis skill of §3.1–§3.3 turned into a real workplace task. By the end you will have classified every inferential step, named two fallacies hiding in plain sight, separated genuine invalidity from a merely false premise, and used the chapter's is_valid checker to confirm your hand analysis on the disputed steps.

The punchline is the most important lesson of the chapter applied to engineering: a persuasive, fluent-sounding argument can be invalid, and an outage can be declared "fixed" on the strength of reasoning that proves nothing. Spotting that is a debugging skill for thought.

Skills applied

Extracting an argument from prose: identifying premises and the conclusion (§3.1).
Testing validity by the counterexample definition, and naming the rule when an argument is valid (§3.1–§3.2).
Diagnosing affirming the consequent and denying the antecedent in the wild (§3.3).
Distinguishing invalidity (broken form) from unsoundness (a false premise) — different defects, different fixes (§3.3).
Confirming a hand analysis with the Toolkit's is_valid (Project Checkpoint).

Background

The scenario

At 02:14, an alert fired: the checkout service's p99 latency crossed its threshold and a fraction of requests returned 504 Gateway Timeout. The on-call engineer, call them R, mitigated by restarting the database, latency recovered, and R wrote the postmortem below. (The scenario and write-up are a constructed teaching example, but every inferential move in it is one teams make constantly.)

Postmortem — Checkout 504s (excerpt).

If the database is overloaded, checkout requests time out. We know this from past incidents.

Checkout requests were timing out (we saw the 504s).

Therefore the database was overloaded.

If the database is overloaded, restarting it clears the overload.

We restarted the database and the timeouts stopped.

Therefore the restart cleared a database overload — root cause confirmed: database overload.

Note: if there were a connection-pool leak, we would also see timeouts. We have no alert for a connection-pool leak firing.

Therefore there is no connection-pool leak, so we can close this without touching the pool code.

It is tidy. It even cites past incidents. The team is inclined to accept it. But "tidy" is not "valid," and your audit is about to find that three of the conclusions (lines 3, 6, and 8) rest on broken inferences.

💡 Intuition: Read each "Therefore" as the $\therefore$ line of an argument. Above it sit the premises; below it sits a claim. The only question that matters for validity is the one from §3.1: is there any way for the premises to be true while this conclusion is false? If yes, the step is invalid — no matter how reasonable it sounds and no matter whether the conclusion later turns out to be true.

Why this matters

A postmortem's entire purpose is to establish a true root cause so the right fix is applied. If the root cause is accepted on invalid reasoning, two expensive things happen: the real bug stays in production (here, possibly a connection-pool leak), and the team "learns" a false lesson that pollutes the next incident ("last time it was the database, restart it"). Auditing the form of the reasoning — exactly the skill this chapter builds — is how you stop a confident-sounding story from closing a ticket that should stay open. This is theme two of the book in its rawest form: "the timeouts stopped" is not the same as "we found the cause."

Phase 1: Extract the arguments into symbols

Auditing prose starts by stripping each step to its skeleton (§3.1). Introduce propositional variables:

$D$ = "the database is overloaded"
$T$ = "checkout requests time out"
$C$ = "the restart cleared a database overload"
$L$ = "there is a connection-pool leak"
$A$ = "a connection-pool-leak alert fires"

Now rewrite the three disputed inferences in premises-over-line form.

Argument I (lines 1–3): the root-cause claim. $$ \begin{array}{l} D \rightarrow T \\ T \\ \hline \therefore\ D \end{array} $$

Argument III (lines 7–8): the "no leak" claim. ("If there's a leak, an alert fires" is the natural reading of line 7 combined with the team's monitoring assumption; line 7's stated premise is that the alert did not fire, i.e. $\neg A$.) $$ \begin{array}{l} L \rightarrow A \\ \neg A \\ \hline \therefore\ \neg L \end{array} $$

Argument II (lines 4–6) needs a little more care and gets its own phase, because its flaw is the subtler one. For now, notice that you have already done the hardest part of an audit: you have separated what is being claimed from the prose that decorates it.

🔄 Check Your Understanding Before reading on, look only at Argument I. Which classic pattern from §3.2–§3.3 does it match — modus ponens, modus tollens, affirming the consequent, or denying the antecedent?

Answer

Premises $D \rightarrow T$ and $T$, concluding $D$: that is affirming the consequent — the consequent $T$ is affirmed and the antecedent $D$ is (illegitimately) concluded. It is the cache argument from §3.1 wearing a database costume.

Phase 2: Audit Argument I (the root cause)

Argument I has the exact shape of the cache argument from §3.1. Settle it the honest way — a truth table over $D, T$ for premises $D \rightarrow T$ and $T$, conclusion $D$:

$D$	$T$	$D \rightarrow T$	$T$	$D$ (conclusion)
T	T	T	T	T
T	F	F	F	T
F	T	T	T	F
F	F	T	F	F

Row 3 is the counterexample: $D$ false, $T$ true makes both premises true while the conclusion is false. So Argument I is invalid — it commits affirming the consequent. In plain terms, the counterexample is the alternative explanation the team is ignoring: timeouts ($T$) can be true while the database is not overloaded ($D$ false) — for example, because a connection-pool leak is exhausting available connections, or a slow downstream dependency is blocking request threads. The implication "overload ⟹ timeouts" never promised that timeouts can only come from overload.

⚠️ Common Pitfall — the symptom-to-cause leap. "Symptom $S$ would be produced by cause $X$; we observe $S$; therefore $X$." This is affirming the consequent every time, and it is perhaps the single most common reasoning bug in incident response. A symptom is consistent with many causes; observing it raises $X$ to "a hypothesis worth testing," never to "established." The valid move is to find evidence that distinguishes $X$ from its rivals (here: were connections actually saturated? Was CPU high on the DB? Did pool checkout time spike?).

So is the root-cause claim wrong? Not necessarily — and this is the subtle part. The database may in fact have been overloaded. But Argument I does not establish it. An invalid argument can reach a true conclusion by luck (§3.1); that does not make the argument trustworthy. The fix is not to fact-check the premises — they may all be true — it is to replace the form with a valid one backed by distinguishing evidence.

Phase 3: Audit Argument II (the "the fix worked, so that was the cause")

Lines 4–6 are seductive because they invoke the fix that actually made the symptom go away. Written out, with $D$ = "database was overloaded" and reading line 4 as "overload implies a restart clears it":

The reasoning is: overload would be cleared by a restart; we restarted and timeouts stopped; therefore there was an overload that the restart cleared. Strip it to its inferential core — "if the cause $D$ were present, action $R$ would fix the symptom; we did $R$ and the symptom went away; therefore $D$ was the cause" — and you have, once again, affirming the consequent: the symptom-disappearance is consistent with $D$, but also consistent with other causes that a database restart incidentally resolves.

Here is the killer counterexample, and it is not hypothetical in spirit: restarting the database also resets the application's connection pool (clients reconnect). So if the true cause was a connection-pool leak ($L$), the restart would still make the timeouts stop — by clearing the leaked connections, not by relieving any overload. The symptom recovering tells you the restart touched the real cause; it does not tell you which cause. The conclusion "root cause confirmed: overload" does not follow.

🚪 Threshold Concept: a fix that works is not a diagnosis that's correct. This is the engineering face of validity versus soundness. "We changed $X$ and the symptom cleared" affirms a consequent: many changes clear many symptoms for reasons other than your hypothesis. Internalizing this changes how you read every postmortem and every "I fixed it" — you start asking the two separate questions of §3.1: Is the inference valid (could the symptom have cleared for another reason)? and Even if valid, are the premises true? The most dangerous incidents are the ones "fixed" by a change that worked for a reason nobody understood.

🔄 Check Your Understanding The team could rescue a valid conclusion from the restart evidence by weakening the claim. Which of these does the evidence actually support? (a) "The database was overloaded." (b) "The restart cleared some condition that was causing timeouts." (c) "There is no connection-pool leak."

Answer

Only (b). The restart demonstrably affected something tied to the symptom, so the existential "some condition" is supported (it is essentially existential generalization, §3.4, from "this restart helped"). Claim (a) over-specifies which condition (affirming the consequent); claim (c) is a different fallacy entirely, audited next.

Phase 4: Audit Argument III (the "no leak" dismissal)

Argument III is the one that will let a real bug survive. Its form, from Phase 1: $$ \begin{array}{l} L \rightarrow A \\ \neg A \\ \hline \therefore\ \neg L \end{array} $$

Pause: this one is not affirming the consequent. The premises are "$L \rightarrow A$" and the negation of the consequent, $\neg A$, concluding the negation of the antecedent, $\neg L$. That is the shape of modus tollens — which is valid. Has the audit found a good step at last?

Only if the premise $L \rightarrow A$ is true. And here is where validity and soundness part company. The premise claims every connection-pool leak fires the alert: "if there is a leak, the alert fires." Is that true? Almost certainly not. A leak alert fires only if (i) a leak detector exists, (ii) it is correctly configured, (iii) the leak crossed its threshold within the alerting window, and (iv) the alerting pipeline itself was healthy during the incident. A slow leak, or a missing/misconfigured detector, means a leak with no alert — exactly $L$ true while $A$ false.

So Argument III is valid but unsound: the form (modus tollens) is impeccable, but the premise $L \rightarrow A$ is false. This is a completely different defect from Arguments I and II, and it has a completely different fix. You do not rewrite the argument's form — modus tollens is fine. You repair the premise: either establish that the detector truly fires on all leaks (then $\neg A$ really would give $\neg L$), or, recognizing you can't, stop concluding $\neg L$ and go look at the pool metrics directly.

💡 Intuition: Lining up the three audited steps shows the whole §3.3 taxonomy in one incident:

Step Form Valid? If it fails, why? Fix

I (root cause) $D\!\to\!T,\ T \vdash D$ ❌ affirming the consequent replace the form; gather distinguishing evidence

II (fix worked) symptom cleared $\vdash$ cause was $D$ ❌ affirming the consequent weaken claim to "some condition," or test the cause

III (no leak) $L\!\to\!A,\ \neg A \vdash \neg L$ ✅ form unsound — premise $L\!\to\!A$ false repair the premise, or don't conclude $\neg L$

Two invalid steps, one valid-but-unsound step. Diagnosing which kind of failure you are looking at is the entire point of §3.3 — because the repair is different for each.

Phase 5: Confirm the audit in code

Hand analysis is the proof; the chapter's is_valid is the cross-check (theme four — computation and proof, working together). Re-create is_valid from this chapter's Project Checkpoint and point it at the forms of Arguments I and III. Validity is a property of form, so we test the patterns, not the English.

from itertools import product

def implies(a, b):                 # p -> q is false only when p True, q False
    return (not a) or b

def is_valid(premises, conclusion, names):
    """True iff no assignment makes every premise True and the conclusion False."""
    for vals in product([False, True], repeat=len(names)):
        if all(p(*vals) for p in premises) and not conclusion(*vals):
            return False           # counterexample found
    return True

# Argument I form: (D -> T), T  |-  D     [affirming the consequent]
arg1 = is_valid([lambda D, T: implies(D, T), lambda D, T: T],
                lambda D, T: D, ["D", "T"])

# Argument III form: (L -> A), ~A  |-  ~L  [modus tollens — valid form]
arg3 = is_valid([lambda L, A: implies(L, A), lambda L, A: not A],
                lambda L, A: not L, ["L", "A"])

print("Argument I (root cause) valid:", arg1)
print("Argument III (no leak) valid:", arg3)
# Expected output:
# Argument I (root cause) valid: False
# Argument III (no leak) valid: True

The output confirms the audit's hardest distinction. is_valid returns False for Argument I — it found the counterexample row $D=$False, $T=$True — and True for Argument III, because modus tollens is a valid form. But read the second result carefully: is_valid checks validity, not soundness. It says Argument III's reasoning is airtight; it cannot and does not check whether the premise $L \rightarrow A$ is true in the real system. That second question — the one that actually saves you here — is the engineer's job, not the checker's. The tool verifies form; you verify the world. That division of labor is precisely the §3.1 lesson, now load-bearing in an on-call rotation.

🐛 Find the Error. A teammate, having read your audit, "fixes" line 8 to read: "We restarted the database and the timeouts stopped; if there were no leak, the restart would stop the timeouts; therefore there is no leak." Symbolize it ($R$ = "restart stopped the timeouts", $\neg L$ via "no leak ⟹ restart stops timeouts") and name the new flaw.

Answer

The new argument is "$(\neg L) \rightarrow R$; $\ R$; $\ \therefore \neg L$" — premises $(\neg L) \rightarrow R$ and $R$, concluding $\neg L$. That is affirming the consequent again (the antecedent is $\neg L$; the consequent $R$ is affirmed). The teammate replaced an unsound step with an invalid one — strictly worse. The restart stopping the timeouts is consistent with both "no leak" and "a leak the restart cleared," so it cannot establish either.

Phase 6: Rewrite the postmortem with valid inferences

An audit that only finds faults is half an audit. Here is the same incident reasoned validly — note how every step is now either a named valid rule or an honestly hedged hypothesis:

Timeouts were observed ($T$). (Premise — observation.)
Candidate causes of $T$: database overload ($D$), connection-pool leak ($L$), slow dependency ($S$). Each satisfies "cause $\rightarrow T$", so observing $T$ does not select among them (we refuse to affirm the consequent).
Distinguishing evidence: DB CPU and active-connection metrics during the incident showed connection count pinned at the pool maximum while DB CPU stayed low. (Premise — measurement.)
"If the database were overloaded, DB CPU would be high" ($D \rightarrow \text{highCPU}$); CPU was not high ($\neg \text{highCPU}$); therefore not overloaded ($\neg D$). (Modus tollens — valid, and the premise is well-supported.)
Pinned-at-max connections with low DB load is the signature of a pool leak; combined with $\neg D$ and $\neg S$ (the dependency was healthy), the evidence points to $L$. (Inference to the best explanation — a hypothesis now worth a code fix and a regression test, not a closed ticket.)

The rewritten version reaches an actionable and defensible conclusion: investigate and patch the connection pool, add a real leak detector (so that a future $\neg A$ genuinely supports $\neg L$). The restart is recorded as mitigation, not diagnosis.

Discussion Questions

Argument II ("the fix worked, so that was the cause") and Argument I ("the symptom appeared, so that's the cause") are both affirming the consequent, but they affirm different consequents. State the $p \rightarrow q$ for each and identify what plays the role of $q$ in each.
Argument III was valid but unsound. Construct a different incident in which the same modus tollens form ($L \rightarrow A,\ \neg A \vdash \neg L$) is both valid and sound — i.e., describe monitoring under which $L \rightarrow A$ is actually true.
The rewrite (Phase 6, step 4) uses modus tollens on "$D \rightarrow \text{highCPU}$." Why is this modus tollens trustworthy here while Argument III's modus tollens was not? (Hint: both are valid; compare the premises.)
is_valid returned True for Argument III even though the step was the one that would let a real bug survive. Write two or three sentences for your team explaining what a "valid" verdict from such a tool does and does not guarantee.
Affirming the consequent shows up whenever we reason from effects to causes. Yet good engineers do reason from symptoms to likely causes all the time. Reconcile these: what makes "inference to the best explanation" (Phase 6, step 5) different from the fallacy in Phase 2?

Your Turn: Extensions

Option A (audit your own postmortems). Take a real incident write-up (yours or a public one) and mark every "therefore." Classify each as a named valid rule, affirming the consequent, denying the antecedent, or valid-but-unsound. Tally how many survive.
Option B (a fallacy linter). Extend is_valid into diagnose(premises, conclusion, names) that returns the counterexample assignment (the first vals tuple making premises true and conclusion false) when an argument is invalid, and None when it is valid. Hand-trace it on Argument I and write the # Expected output: (it should surface $D=$False, $T=$True). This makes the abstract "there exists a counterexample" concrete.
Option C (the four moves table). For the implication "$D \rightarrow T$" from this case study, write out all four moves of the §3.3 table (affirm/deny antecedent, affirm/deny consequent), give the database-flavored English for each, and mark which two are valid. Use it as a checklist next on-call.
Option D (model a richer system). Add a fourth candidate cause to Phase 6 (say, a noisy-neighbor VM) and write the distinguishing-evidence modus tollens that would rule it out. State the premise your metric must satisfy for that modus tollens to be sound, not just valid.

Key Takeaways

Auditing reasoning is extracting arguments, then testing form. Strip each "therefore" to premises and a conclusion, then ask the §3.1 question: could the premises be true and the conclusion false?
Symptom-to-cause and fix-worked-so-cause-found are both affirming the consequent. A symptom is consistent with many causes; a fix can work for reasons other than your hypothesis. Neither establishes the cause — they nominate it.
Invalid and valid-but-unsound are different defects with different repairs. Arguments I and II needed a new form; Argument III's form (modus tollens) was fine and needed a true premise. Diagnose which failure you have before you "fix" it — patching the wrong one (as in the Find the Error) makes things worse.
A validity checker verifies form, never the world. is_valid correctly called Argument III valid and still couldn't catch the bug — because the bug lived in an unsound premise. Tools check validity; engineers must check soundness.
Valid inference yields actionable conclusions. The rewritten postmortem replaced "restart, ticket closed" with distinguishing-evidence modus tollens and an honestly hedged best explanation — keeping the real bug's ticket open. That is theme two in practice: "it stopped failing" is not "we found the cause."

Step	Form	Valid?	If it fails, why?	Fix
I (root cause)	$D\!\to\!T,\ T \vdash D$	❌	affirming the consequent	replace the form; gather distinguishing evidence
II (fix worked)	symptom cleared $\vdash$ cause was $D$	❌	affirming the consequent	weaken claim to "some condition," or test the cause
III (no leak)	$L\!\to\!A,\ \neg A \vdash \neg L$	✅ form	unsound — premise $L\!\to\!A$ false	repair the premise, or don't conclude $\neg L$