Case Study 1 — The Backups That Couldn't Be Restored

Every database team thinks it has backups. Far fewer have tested them. A company discovered, during a real disaster, that its nightly backups had been silently broken for months — and the data was gone. The cheapest insurance in computing is the one habit they'd skipped: restoring.

Background

A company ran a nightly pg_dump of its production database to a backup server. The job had been set up years ago, ran every night, reported success, and the resulting files appeared on disk. Everyone believed the database was safely backed up. Backups were, in everyone's mind, "handled."

Then disaster struck: a storage failure corrupted the production database beyond repair. No problem — restore from last night's backup. Except, when they tried, the restore failed.

What had gone wrong (silently, for months)

Investigation revealed the backups had been broken for months:

  • A schema change long ago had introduced an object the backup job's options didn't capture correctly, so recent dumps were incomplete — missing data — but the job still exited "successfully."
  • The backup files existed and were the expected size-ish, so nobody suspected anything. The job's "success" was about running, not about producing a restorable backup.
  • Because no one had ever restored a backup to verify it, the breakage was invisible. Each night produced another unusable file, and everyone's confidence grew while their actual recoverability was zero.

The result: months-old data at best, and significant permanent loss. A recoverable incident became a business catastrophe — entirely because the backups, though present, were never tested.

What should have happened

The principle from this chapter: a backup you have never restored is a hope, not a backup. The fix is a restore discipline, not just a backup one:

  1. Automated test-restores. A scheduled job that takes the latest backup, restores it to a scratch environment, and verifies it — row counts, key tables present, a checksum or smoke-test query. If the restore fails or the data looks wrong, alert immediately. This turns a silently-broken backup into a same-day alert.
  2. Monitor backup validity, not just completion. "The job ran" ≠ "the backup is restorable." Verify the output.
  3. 3-2-1 — 3 copies, 2 media, 1 off-site (and ideally immutable), so a single failure (or ransomware) can't take out both production and the backups.
  4. Document and rehearse the recovery procedure — a runbook the team has actually practiced, so recovery during a real incident is calm and fast, not improvised.

After the disaster, the company implemented automated nightly test-restores. Within weeks, the restore job caught a new backup problem — and this time it was an alert and a fix, not a catastrophe.

The analysis

  1. Backups fail silently. A backup job can "succeed" while producing an incomplete or corrupt file — misconfigured options, a schema change, a full disk, a permissions issue. The only proof a backup works is restoring it. "We have backups" is a dangerous belief without "we have tested restores."

  2. Test-restore is the single most important backup practice. It's the one habit that converts "backups exist" into "we can actually recover." Automate it, verify the data, and alert on failure — so a broken backup is caught in days, not discovered in a disaster.

  3. Monitor validity, not completion. Success of the job is not success of the backup. Verify the restored data (counts, key tables, smoke tests). A green checkmark on the cron job lulled this team for months.

  4. 3-2-1 protects against correlated failure. Backups on the same system/site as production can be lost with production (hardware failure, ransomware). Multiple copies, media, and an off-site/immutable one ensure a single event can't destroy both.

  5. Recovery is a rehearsed procedure. During a real incident is the worst time to figure out how to restore. A documented, practiced runbook makes recovery fast and reliable. (This connects to backups-as-security, Chapter 32.)

Discussion questions

  1. How could the backup job "succeed" every night while producing unrestorable backups?
  2. Why did the absence of test-restores make the breakage invisible for months?
  3. What's the difference between monitoring backup completion and backup validity?
  4. How does 3-2-1 protect against a scenario where production and backups are lost together?
  5. ⭐ Design an automated test-restore + alerting process. What exactly would it verify, and how often?