Case Study 2: The Tidy Diagram That Hid the Outage

DataField.Dev

Case Study 2: The Tidy Diagram That Hid the Outage

A composite, fictional-but-realistic deep dive into the chapter's most dangerous failure mode: the diagram that misleads precisely because it looks clean. No real company, no real incident—an illustrative reconstruction of a pattern that recurs.

The diagram everyone trusted

A mid-size SaaS company had one architecture diagram that lived everywhere: the sales deck, the onboarding wiki, the security questionnaire they sent to enterprise prospects. It looked like this:

flowchart LR
    Users([Users]) --> App[Application] --> DB[(Database)]

Figure (described): three boxes in a clean left-to-right line—Users to Application to Database—projecting a simple, well-understood, low-risk system.

It was reassuring. It said, in effect, this system is simple and under control. Prospects liked it. New engineers learned from it. Nobody questioned it, because there was nothing in it to question—and that was the problem.

What the diagram didn't say

The real system was not three boxes. Sitting invisibly between "Application" and "Database" was a caching layer that served reads to keep the database from melting. The cache had a time-to-live, and under certain conditions it could serve stale data. There was also a read replica with replication lag, a third-party authentication provider the app called on every login, and a background job queue that, if it backed up, delayed order confirmations.

None of that appeared in the diagram. Each box in the diagram was accurate—there really was an application, and a database. But the picture made an implicit claim that was false: nothing surprising can happen here. The cache, the replica lag, the third-party dependency, the queue—every one of them was a way the system could behave unexpectedly, and the diagram had erased all of them.

This is the §32.8 trap exactly. Prose that oversimplified would have looked thin and invited questions. The diagram oversimplified and looked authoritative, so it invited none.

The day it mattered

During a routine deploy, a config change doubled the cache TTL. Reads started returning data that was up to ten minutes stale. Users saw orders that appeared unpaid (already paid), inventory counts that were wrong, and dashboards that disagreed with reality. The on-call engineer—relatively new, and trained on the three-box diagram—spent the first thirty minutes looking at the application and the database, because those were the only two things the diagram said existed. The cache, the actual culprit, wasn't on his mental map. It wasn't on anyone's map, because the map didn't include it.

The diagram didn't cause the outage. But it lengthened it, by shaping a mental model that omitted the component that failed. A picture that hides a failure mode trains people not to look there.

The honest redraw

In the postmortem, the team didn't throw out the simple diagram—it was genuinely useful for sales and the thirty-second pitch. Instead they applied the chapter's honest moves:

They captioned the simple one for what it is. It became: "Figure 1: conceptual overview for non-technical audiences. Not the operational architecture—see the runbook diagrams for caching, replication, and failure handling." The simplicity was now true, because it was scoped.
They drew the operational view the on-call engineer actually needed—a diagram that did show the cache, the replica, the auth provider, and the queue, with the read path through the cache made explicit:

flowchart LR
    Users([Users]) --> App[Application]
    App -- "reads (cache-first)" --> Cache[(Redis Cache<br/>TTL-bounded, can be stale)]
    Cache -- "on miss" --> Replica[(Read Replica<br/>lag-bounded)]
    App -- "writes" --> Primary[(Primary DB)]
    App -- "every login" --> Auth[Auth Provider<br/>external]
    App -- "async jobs" --> Queue[[Job Queue]]
    Primary --> Replica

Figure (described): an operational architecture diagram. Users hit the Application. Reads go cache-first to a Redis Cache labeled "TTL-bounded, can be stale," falling through on a miss to a lag-bounded Read Replica. Writes go to a Primary DB, which replicates to the replica. The Application also calls an external Auth Provider on every login and pushes async work to a Job Queue. The labels name the two failure-relevant properties—cache staleness and replica lag—directly on the boxes. What it shows: the read path and its two staleness risks (cache TTL, replica lag) plus the external dependency and the async queue—the things an on-call engineer must hold in their head, which the three-box diagram erased.

They put the operational diagram in the runbook, in Mermaid, in the repo, so it diffs in review and stays in sync as the system evolves (§32.9). The day someone changes the caching strategy, the diagram changes in the same pull request.

The lesson

A diagram can be accurate in every box and still mislead by what it leaves out—the same lesson Chapter 9 drew from the Challenger charts, where the fatal pattern was never isolated into one unmissable figure. Omission is a choice, and a clean diagram makes that choice invisible, which is why it's an ethics question and not just a craft one (a thread Chapter 38 takes up directly).

The fix is never "show absolutely everything"—that's the spaghetti of Case Study 1. The fix is honesty about scope: know what audience each diagram serves, caption what it leaves out, and keep an operational view that doesn't lie to the person at 3 a.m. A diagram simplified for the boardroom can be dangerously wrong for the runbook. Audience is everything (Chapter 2)—and a picture, because it persuades so quietly, owes its reader the truth about what it isn't showing.

Back to: Chapter 32 · Case Study 1 · Key Takeaways