Case Study 2: Raj Patel's Incident Report

DataField.Dev

Case Study 2: Raj Patel's Incident Report

A deep dive on the incident report introduced in §21.5: how Raj Patel turns a tired, defensive first draft into a factual, blameless report that produces real fixes. Raj recurs in Chapters 24–26 (documenting an open-source project) and Chapter 34 (the blameless postmortem). Raj and this incident are composite and fictional—built to be realistic.

The situation

It's 4 p.m. on a Tuesday. The payments service that Raj's team owns went down for 47 minutes during business hours—about 3,200 customer checkout attempts failed. The service is back up. Now Raj's manager wants the incident report by tomorrow morning, and the report will be read by people Raj has never met: the VP of engineering, the support team that fielded the angry tickets, and—because payments are involved—someone in finance who needs to know the dollar exposure.

Raj is tired, slightly embarrassed (the bug shipped in his team's release), and his first instinct is to write something that minimizes and moves on. That instinct is the enemy of a good incident report, and watching Raj overcome it is the lesson of this case.

Draft 1: tired and defensive

❌ Before: The payments service went down for a while on Tuesday afternoon. It looks like the new release had a bug in it that caused a memory leak, and eventually the service ran out of memory and crashed. We didn't catch it because the memory alert wasn't set up. Once we figured out what was happening we rolled back the release and things recovered. We should be more careful about testing releases and we'll try to set up better monitoring going forward.

Read it the way the three readers will. The VP can't size the event: "for a while" could be five minutes or five hours; "things recovered" doesn't say when. Finance can't find a number. Support can't reconstruct what happened when, so they can't tell customers anything precise. And the two "fixes" at the end—"be more careful" and "we'll try to set up better monitoring"—have no owner and no date, which means they will quietly never happen. The draft also carries a faint defensiveness ("It looks like," "we'll try to") that undercuts trust. It's not dishonest. It's just useless as a record and toothless as a plan.

What Raj fixes, and why

Raj takes a breath and works through the five sections of the §21.4 template, applying three disciplines:

1. Make the impact concrete. "For a while" becomes "47 minutes (14:18–15:05 UTC)." "Things recovered" gets a timestamp. He adds the customer count (~3,200 failed checkouts), a dollar estimate (~$18,000, noted as recoverable since customers can retry), and the SLA consequence. A reader can now size the event without asking him a single follow-up question—which is the whole point of a summary.

2. Separate the timeline from the root cause. Raj writes a factual, timestamped sequence with no judgment words: deploy at 13:50, memory hits 100% at 14:18, on-call paged at 14:21, cause identified at 14:35, rollback decided at 14:52, recovered at 15:05. Then, separately, he writes the root cause—the why—as its own section. This is the Results-vs-Discussion discipline from Chapter 13: the timeline is the part everyone can agree on; the root cause is the analysis, and keeping them apart means a reader who questions his analysis still trusts his facts.

3. Frame the cause as a system, not a person. This is the hard one. The leak shipped in Raj's team's release; the blameful version writes itself ("we pushed a bad release"). Raj refuses it. His root cause names the mechanism (payment-session objects retained after completion—a missing cleanup call) and the two systemic gaps that let it through: no memory alert (so the leak was invisible until the crash) and load tests that run only 5 minutes (too short to surface a slow leak). Nowhere does the report name a culprit. That's not Raj protecting himself—it's Raj pointing the corrective actions at fixes that will actually work.

4. Make the corrective actions real. "Be more careful" and "we'll try" become a table—four concrete changes, each with one named owner and a due date, so the report can be tracked to completion in the next review.

Draft 2: the report

✅ After: ```text INCIDENT REPORT — Payments service outage (out-of-memory crash) Incident ID: INC-2041 Severity: SEV-1 Status: Resolved Author: Raj Patel Date of incident: Mar 11 Report date: Mar 12

SUMMARY On Mar 11, the payments service was fully unavailable for 47 minutes (14:18–15:05 UTC) during business hours. ~3,200 checkout attempts failed. Cause: a memory leak in release v2.8.0 exhausted host memory; no memory alert existed to warn before the crash. Resolved by rolling back to v2.7.4. Service is stable; corrective actions below.

TIMELINE (UTC) 13:50 Release v2.8.0 deployed to production. 14:18 Host memory reached 100%; payments service killed by the OOM killer. 14:21 First customer reports of failed payments; on-call paged. 14:35 On-call identifies rising memory in v2.8.0 from host metrics. 14:52 Decision to roll back to v2.7.4. 15:05 Rollback complete; payments recovered; error rate normal.

ROOT CAUSE Release v2.8.0 introduced a memory leak: payment-session objects were retained after completion (a missing cleanup call). Under production load, host memory was exhausted in ~28 minutes. Contributing factors: (1) no alert on memory, so the leak was invisible until the crash; (2) load tests run only 5 minutes — too short to surface a slow leak.

IMPACT - Users affected: ~3,200 failed checkout attempts. - Duration: 47 minutes (14:18–15:05 UTC). - Est. lost transactions: ~$18,000 (recoverable — customers can retry). - SLA: breached the 99.9% monthly availability target for payments.

CORRECTIVE ACTIONS | # | Action | Owner | Due | Status | |---|-------------------------------------------------------|-------|--------|--------| | 1 | Add the missing session-cleanup call; patch + release | Raj | Mar 13 | Open | | 2 | Add memory alerts (warn 80%, page 90%) on payments | Priya | Mar 15 | Open | | 3 | Extend load tests to 1 hour to surface slow leaks | Sam | Mar 20 | Open | | 4 | Require approval for release + config to production | Raj | Mar 22 | Open | ```

The objection Raj had to answer

When Raj circulated the draft, a senior colleague pushed back: "Shouldn't the report say the release wasn't tested properly? Aren't we letting the team off the hook?" This is the standard objection to blameless reporting, and it deserves a straight answer.

Raj's response: the report does say the testing was inadequate—corrective action #3 exists precisely because the 5-minute load test couldn't catch a slow leak. What the report doesn't do is stop at "the team should have tested better," because that's not a fix—it's a hope that human vigilance will improve, and human vigilance reliably doesn't. The blameless version is more accountable, not less: it commits four specific people to four specific changes by four specific dates. "Be more careful" holds no one accountable for anything. The table holds four people accountable for concrete work. Blameless isn't the soft option; it's the version with teeth.

The takeaway

The difference between Raj's two drafts is not honesty—both are honest. It's usefulness. Draft 1 minimizes, blurs the impact, tangles facts with feelings, and ends in wishes. Draft 2 sizes the event precisely, separates what happened from why, attacks the system instead of the person, and converts "we should do better" into a tracked plan with owners and dates. That second report makes the organization smarter and the system safer; the first just closes the ticket. When you write the incident report at 2 a.m., write Raj's second draft. It's a direct line to the blameless postmortem you'll meet in Chapter 34.

Back to: Chapter 21 · Exercises · Case Study 1