Case Study 1: A Vague Procedure Made Testable

DataField.Dev

Case Study 1: A Vague Procedure Made Testable

A worked example. We take a real-feeling, vague procedure—the kind that lives in a wiki nobody updates—and rebuild it into instructions a first-timer can follow. This is the chapter's anchor (the before/after transformation) applied end to end. The scenario is fictional but realistic.

The situation

Priya just joined a small SaaS company as a support engineer. On her second day, a customer asks her to restore their account from a backup. She finds the company's runbook entry titled "Restoring a Customer Account." Here it is, in full:

Restoring a Customer Account

To restore an account, you basically just need to grab the latest good backup, make sure it's not corrupted, and then run the restore against the customer's environment. Be careful with production. Once it's done, verify everything came back and let the customer know. If anything looks off, escalate.

It was written two years ago by the engineer who built the restore tooling. To him, it's a complete reminder. To Priya, it is useless. She doesn't know where backups live, what "not corrupted" means or how to check it, what "the customer's environment" refers to, how to "run the restore," what "be careful with production" requires her to do, how to "verify everything came back," or what counts as "off." She can't perform a single line without already knowing the task—which defeats the entire purpose of a runbook.

This is the gap the chapter named: the author wrote a description of a task he remembers, not instructions someone can execute. And it's the curse of knowledge in pure form—every vague phrase ("make sure it's not corrupted," "the customer's environment") collapses several steps the author does automatically into a word a beginner can't unpack.

The diagnosis

Before rewriting, Priya (with the author's help) names exactly what's broken. Each defect maps to a fix.

Defect in the runbook	What a doer needs instead
No prerequisites. "Grab the latest good backup" assumes access, tools, the customer's account ID.	A "Before you begin" block: required access, the account ID, the backup-tool login.
Vague condition: "not corrupted." No definition, no method.	A checkable step: run the integrity check; pass = a specific output.
Undefined term: "the customer's environment."	Name it: the customer's tenant, identified by account ID, in the restore console.
Missing the destructive-action warning at the point of use. "Be careful with production" is at the top and says nothing actionable.	A warning immediately before the restore step, naming the hazard (restore overwrites current data) and what to do (confirm the target tenant, never production-global).
Vague verification: "verify everything came back."	Specific checks: record count matches, the customer can log in, a known record is present.
Vague escalation: "if anything looks off."	A concrete trigger and destination: if integrity fails or counts don't match, stop and page the on-call engineer with the account ID.
No success-confirmation for the customer.	A defined "done" state before notifying the customer.

The rewrite

Here is the same procedure, rebuilt with everything the chapter teaches: a scope line, a prerequisites block, a warning placed before the destructive step, one-action imperative steps in performance order, and verification.

Restoring a Customer Account

Use this to restore a single customer's account from a backup after data loss. Audience: support engineers with restore access. Time: ~15 minutes.

Before you begin: - You need the Restore role in the admin console. If you don't have it, ask your team lead before starting—it is not granted by default. - Get the customer's account ID (find it in the support ticket or in Admin → Accounts → search by email). - Confirm you can log in to the Backup Console (backups.internal).

In the Backup Console, search for the account ID.

In the backups list, select the most recent backup with status Completed (ignore any marked Partial or Failed).

Click Verify Integrity. Wait for the result.

Confirm the result reads Integrity: OK. If it reads anything else, stop—go to Troubleshooting → Integrity check failed below; do not proceed with a bad backup.

Click Prepare Restore. Note the target tenant shown in the dialog.

⚠️ Warning: Restoring overwrites the account's current data with the backup—any changes since the backup are lost, and this cannot be undone. Before continuing, confirm the target tenant matches the customer's account ID exactly. Never select "All tenants" or a production-global target.

Confirm the target tenant matches the customer's account ID.

Type the account ID into the confirmation field to enable the button (this guard exists to prevent restoring the wrong account).

Click Run Restore. Wait for the Restore complete banner (typically 2–5 minutes).

Verify the restore: - Open Admin → Accounts → the customer's account. Confirm the record count matches the backup's record count (shown in the restore summary). - Confirm a known record the customer reported missing is now present. - Confirm the customer can log in (ask them to try, or use "Log in as" if permitted).

Once all three checks pass, send the customer the "Account Restored" template from the ticket and note the backup timestamp you restored from.

Troubleshooting - Integrity check failed (step 4): The most recent backup is unusable. Select the next-most-recent Completed backup and verify it. If none pass, stop and page the on-call engineer with the account ID—do not attempt a restore from a backup that failed integrity. - Target tenant doesn't match the account ID (step 6): Stop. You may be in the wrong account. Re-search by the account ID from the ticket; if it still mismatches, page on-call before running anything. - Record counts don't match after restore (step 9): Do not tell the customer it's done. Page the on-call engineer with the account ID, the backup timestamp, and both counts.

What changed, and why it matters

Priya can now perform every line without prior knowledge. The transformation did four things:

Description → instruction. Every vague noun phrase ("the latest good backup," "the customer's environment") became a specific, performable action with a named target.
The warning moved to the point of danger. "Be careful with production" became a warning immediately before the overwrite step, naming the irreversible hazard and the exact precaution (match the tenant; type the ID to confirm). A reader who jumps to step 8 still meets it.
Vague conditions became checkable. "Not corrupted" → Integrity: OK. "Verify everything came back" → three named checks. "If anything looks off" → specific triggers and a paging destination.
It became testable. Crucially, the team then did the thing the original author never could: they had a different new hire follow this runbook on a test account, watched silently, and found one more gap—the new hire didn't know the restore summary showed the expected record count, so step 9's "matches the backup's record count" got the parenthetical "(shown in the restore summary)." One observed stall, one fix. That is the difference between a runbook that looks complete and one that works.

The lesson isn't that the original author was careless—he wasn't. He was expert, and his expertise made the gaps invisible to him. The vague runbook was a perfectly good reminder for the one person who didn't need it. Procedures are written for the people who aren't the author, and the only way to know they work is to watch one of those people try.

Back to: Chapter 22 · Case Study 2 · Exercises · Key Takeaways