Case Study 2 — The Disk That Filled at 2 a.m.

DataField.Dev

Case Study 2 — The Disk That Filled at 2 a.m.

One of the most common — and most preventable — database outages: the disk fills up, and PostgreSQL grinds to a halt because it can no longer write. The cause was a stalled WAL-archiving process and the absence of disk-space monitoring. Both are basic operational hygiene.

Background

A company ran self-hosted PostgreSQL with WAL archiving enabled for point-in-time recovery (Chapter 28/38): each WAL segment was copied to an archive location (a network share) before being recycled. One night, the network share became unreachable (a credentials expiry). PostgreSQL, doing exactly what it's designed to do, refused to recycle WAL segments that hadn't been successfully archived — because discarding un-archived WAL would break recoverability.

So WAL piled up on the local disk. With no one watching disk space, it filled steadily through the night. At ~2 a.m., the disk hit 100% full. PostgreSQL could no longer write WAL — and since it can't commit anything without writing WAL (durability, Chapter 28) — the database effectively stopped accepting writes. The application went down. On-call was paged into a confusing 2 a.m. incident: "the database is up but nothing works."

What went wrong: two missing safeguards

A stalled archive command silently accumulated WAL. When archive_command keeps failing (the unreachable share), PostgreSQL correctly retains WAL rather than lose it — but that means WAL grows without bound on the local disk until the archive recovers. Nobody was alerted that archiving was failing.
No disk-space monitoring/alerting. The disk filled over hours — ample time to react — but there was no alert on free space. The first signal anyone got was the outage itself. A disk-usage alert at 80% would have turned a 2 a.m. catastrophe into a routine daytime fix.

A full disk is one of the most preventable outages: it happens gradually, it's trivial to monitor, and the failure mode (database halts because it can't write) is well known.

The fix and recovery

Immediate recovery: the on-call freed space carefully (you must not delete WAL files manually — that breaks recovery; instead they archived the backlog or expanded the volume), restored the archive destination (renewed credentials), let PostgreSQL drain its WAL backlog to the archive, and the database resumed.

Prevention:

Monitor and alert on free disk space — alert well before full (e.g., 80%/90%), for the data volume and the WAL volume. The single most important fix.
Monitor archive_command success / archiver lag — alert when archiving starts failing, so a stalled archive is caught while there's still disk, not after WAL has piled up for hours.
Put WAL on its own volume (a common practice) so WAL growth can't fill the data disk (and vice versa), and so it's monitored separately.
Set retention/cleanup correctly and consider tools (pgBackRest, barman) that manage WAL archiving and retention robustly.

The analysis

A full disk halts the database. PostgreSQL must write WAL to commit (durability, Chapter 28); no disk space → no writes → effective outage. It's among the most common production database incidents, and among the most preventable.
WAL accumulates when archiving stalls — by design. PostgreSQL retains un-archived WAL rather than risk recoverability. That's correct, but it means a failing archive_command is a slow-motion disk-fill. You must monitor archiving health, not assume it.
Monitor disk space and alert early. The outage was preventable with a single threshold alert; the disk filled over hours. Free space (data and WAL) is a must-monitor metric. (Chapter 38's monitoring discipline applied.)
Never manually delete WAL to free space. It's tempting at 2 a.m., but deleting un-archived/needed WAL breaks PITR and can corrupt recovery. Free space the right way (expand the volume, fix archiving, use proper tooling).
Separate concerns with separate volumes and robust tooling. Putting WAL on its own volume contains the blast radius; dedicated backup tools (pgBackRest/barman) handle archiving and retention more safely than a hand-rolled archive_command.

Discussion questions

Why did WAL pile up when the archive destination became unreachable? Why is that PostgreSQL behaving correctly?
Why does a full disk halt PostgreSQL specifically (tie to the WAL/durability)?
What single monitoring alert would have prevented the 2 a.m. outage?
Why must you not manually delete WAL files to free space?
⭐ Design the monitoring/alerting you'd put in place for a self-hosted PostgreSQL with WAL archiving. What metrics and thresholds?