Case Study 2 — Two Postmortems for the Same Outage: Blame vs. Blameless

DataField.Dev

Case Study 2 — Two Postmortems for the Same Outage: Blame vs. Blameless

Where Case Study 1 rewrote a single review comment, this one contrasts two whole documents written about the same incident—one that hunts for a culprit and one that debugs the system. It's the chapter's other threshold half (debug the system, not the human) shown at full length, and it makes visible why blamelessness is not a courtesy but a mechanism. The incident is the chronoparse time-zone bug from §34.6; the team is a composite; the pattern is one every engineering organization repeats.

The incident

chronoparse 2.2.0 shipped on a Friday afternoon. Within three minutes, a downstream monitor fired: every timestamp the library produced was off by exactly one hour. A refactor had quietly changed the default time zone—when a caller didn't specify one—from explicit UTC to the server's local zone, which happened to be UTC+1. For 32 minutes, until the team rolled back to 2.1.0, roughly 2.1 million timestamps were written an hour off, and three downstream services ingested the skewed data.

The engineer who wrote the refactor and ran the deploy was named Sam. That fact is about to matter—not because Sam did anything unusual, but because two different cultures will write about that fact in two opposite ways, and the difference will determine whether chronoparse ever ships this bug again.

Version A: the postmortem that hunts for a culprit

In a blame culture, the write-up reads like this:

"On Friday, Sam deployed version 2.2.0 without properly testing the time-zone handling, causing a 30-minute outage that corrupted millions of records. Sam should have caught this. This was avoidable. Going forward, Sam—and the team generally—needs to be more careful and test thoroughly before deploying. Reminder to everyone: always test your code before you ship it."

Walk through what this document accomplishes, because it's the opposite of what it intends:

It names a person and assigns fault. "Sam deployed… without properly testing," "Sam should have caught this." Every engineer reading it now knows that an incident here ends with your name attached to "should have caught this."
Its corrective action is an exhortation, not a change. "Be more careful," "test thoroughly," "always test your code." These ask humans to stop being human. They change nothing about the conditions that let the bug through.
It teaches exactly the wrong lesson. The lesson Sam learns is not "here's how we prevent this"; it's "deploys are dangerous to my reputation." The lesson everyone else learns is "if you're ever at the center of an incident, say as little as possible."

And here's the mechanism that makes Version A self-defeating: the next time something breaks—and something always breaks—the engineer involved will remember Sam. They'll be slower to raise the alarm, quicker to minimize, more careful about what they put in writing. The information the team needs to find the real cause will be the very information people have learned to withhold. Version A doesn't just fail to prevent the next outage; it actively makes the next one harder to learn from. It optimized for the appearance of accountability and destroyed the thing accountability is supposed to produce: a system that's safer tomorrow.

Notice, too, what Version A never actually establishes: why was it possible for a changed default to reach production untested? It doesn't ask. "Sam should have caught it" is a stopping point that prevents the investigation. The real causes go undiscovered because the document was satisfied the moment it found someone to blame.

Version B: the postmortem that debugs the system

Here is the same incident, written blamelessly—the document from §34.6, reproduced so you can read it against Version A:

# Postmortem: Incorrect time-zone offsets in chronoparse 2.2.0

**Status:** Resolved · **Severity:** SEV-2 · **Duration:** 32 min
**Author:** Sam · **Date:** 2024-06-14

## Summary
chronoparse 2.2.0 applied the system's local time zone instead of UTC when no
`tz` was specified, shifting every parsed timestamp by the local UTC offset.
For 32 minutes, ~40% of downstream timestamp writes were off by +1h until we
rolled back to 2.1.0.

## Root cause
A refactor in 2.2.0 (#345) changed the default time zone from explicit UTC to
the system local zone when `tz` was unspecified. This reached production because:
- No test asserted the *default* tz behavior — tests passed an explicit `tz`,
  so the changed default was never exercised.
- The change was a one-line default inside a large refactor PR; the reviewer's
  attention was on the parsing logic, and the default change wasn't called out
  in the PR description.
- Staging runs in UTC, so the bug was invisible there — it only manifests on
  servers with a non-UTC local zone.

## What went well
- The downstream skew monitor caught it in 3 minutes; rollback was clean.

## Action items
| Action | Owner | Due |
|--------|-------|-----|
| Add tests asserting default-tz behavior (no `tz` → UTC) | Sam | 2024-06-17 |
| Run one staging worker in a non-UTC zone | Priya | 2024-06-21 |
| Add a PR-description checklist item: "Does this change any default?" | Raj | 2024-06-19 |
| Add a deploy guard: block release if tz-default test is absent | Sam | 2024-06-28 |

Read it against Version A and the difference is total—and yet Sam is still here. Sam is the author. Sam is named in the action items, owning two of the fixes. Blameless did not mean anonymous; it meant the analysis is aimed at the system, not at Sam's competence. That distinction is the whole craft:

The root cause is a chain of system failures, not a personal one. Not "Sam didn't test," but "no test exercised the default; the change wasn't flagged in the PR; staging only runs UTC." Each link is a condition that can be changed.
Every action item changes the system and is owned and dated. A test that would have caught it. A staging worker in a non-UTC zone so the bug can't hide. A PR-description prompt so default changes get flagged. A deploy guard so the missing test blocks release. None of them says "be careful." (This is Chapter 21's discipline—an action item without an owner and a date is a wish—now in service of a system fix.)
"What went well" is included. The monitor that caught it in three minutes and the clean rollback are things to keep. A postmortem that only catalogs failures misses half its job.

Crucially, Version B answers the question Version A never asked: why was it possible? Because asking that question—and not stopping at "Sam"—is what surfaces the four real fixes. Version A found a culprit and learned nothing. Version B found a system and learned four concrete ways to make it safer.

What this case teaches

Blameless is a mechanism, not a manner. It's not that Version B is nicer to Sam (though it is). It's that Version B works: it produces the information and the fixes that prevent recurrence, while Version A produces fear and silence that guarantee it. The choice between them is a choice between learning and not learning.
"Why was it possible?" is the question that does the work. Blame stops at "who." Prevention requires pushing past the person to the conditions—and you push past by asking "why was it possible for this to reach production?" until you reach something you can change.
Naming a person and blaming a person are different acts. Sam is named throughout Version B—as author, as timeline actor, as action-item owner—and is blamed nowhere, because the analysis targets the system. You can be fully transparent about who was involved while aiming every causal claim at the system that involved them.

This is the chapter's threshold seen from its second side. In a code review you critique the code, not the coder (Case Study 1). In a postmortem you debug the system, not the human. They are one move—separate the work from the person—and that move is the most important social skill in technical collaboration, because it's the only one that turns mistakes into improvement instead of into fear. And it is, first and last, a writing skill: the difference between Version A and Version B is entirely a difference in how the same facts were written down.

Try it yourself: Take an incident you know—a bug shipped, a deploy broken, an outage—and draft its root-cause section twice. First, the way it would come out if you let the blame instinct lead ("X did Y"). Then rewrite it asking only "why was it possible for this to reach production?", tracing the chain to system conditions you could change, and converting each into an owned, dated action item. Read the two side by side. The second one is the only one that would have prevented the next occurrence—and notice it didn't require letting anyone off the hook, only aiming the analysis at the system instead of the person.

Back to: Chapter 34 · Case Study 1 · Key Takeaways