Case Study 2 — The Performance "Optimization" That Lost Committed Data

DataField.Dev

Case Study 2 — The Performance "Optimization" That Lost Committed Data

The write-ahead log is what makes "committed" mean "durable." A team chasing write throughput disabled the very settings that guarantee durability — and a routine power loss erased committed transactions. A cautionary tale about why the WAL discipline exists.

Background

An analytics ingestion pipeline wrote millions of rows per hour, and the team wanted more write throughput. Searching for "make PostgreSQL writes faster," someone found settings that dramatically increase write speed and set them in postgresql.conf:

fsync = off                  # ⚠️ don't force WAL/data to physical disk
synchronous_commit = off     # ⚠️ don't wait for WAL flush before reporting COMMIT success
full_page_writes = off       # ⚠️

Write throughput jumped — the database stopped waiting for the disk to confirm writes. For weeks it was "faster," and everyone was pleased. Then the server lost power (a data-center incident). On restart, PostgreSQL recovered — but committed transactions from the final window before the crash were gone, and worse, with fsync = off and full_page_writes = off, parts of the database were corrupted (torn pages), requiring a restore from backup. Committed data — data the application had been told was safely saved — had vanished.

What went wrong: durability turned off

Recall the durability mechanism (Chapter 28): on COMMIT, PostgreSQL flushes the transaction's WAL records to physical disk before reporting success — so a crash can replay them. These settings each break that guarantee:

fsync = off tells PostgreSQL not to force writes to physical disk. The OS may buffer WAL and data in memory and lose it on power failure. The "write-ahead" guarantee — log safely on disk before the change is considered durable — is gone. This also risks corruption, because data pages can be partially written ("torn") with no WAL to repair them.
synchronous_commit = off tells COMMIT to return success without waiting for the WAL flush. The transaction is acknowledged to the application but its WAL record may not be on disk yet — so a crash in that window loses committed transactions (though it doesn't risk corruption like fsync = off does).
full_page_writes = off removes protection against torn pages on crash.

In short, the team had traded away the D in ACID. The speed came precisely from skipping the durability work — they were fast because they were no longer guaranteeing that committed data survived.

The fix

Restore from backup (recovering the corrupted database), then turn durability back on:

fsync = on                   # the default — never turn off in production
full_page_writes = on        # the default
synchronous_commit = on      # the default (durable commits)

To get legitimate write throughput without sacrificing durability, the team used the right tools:

Batch the writes — load with COPY and multi-row inserts (Chapter 31), and group work into fewer, larger transactions, so each commit's fixed cost is amortized over many rows. This was the real win they needed.
Faster storage — put the WAL on fast disks; more I/O capacity.
(Carefully) synchronous_commit = off only where acceptable — for some workloads, losing the last fraction of a second of commits on a crash is genuinely acceptable (e.g., non-critical telemetry), and synchronous_commit = off is a legitimate, documented trade there — but it must be a conscious decision for data you can afford to lose, not a blanket "go faster" flip. Crucially, it does not risk corruption (unlike fsync = off). The team left it on for their financial-adjacent data.

Throughput recovered via batching — without betting the data's integrity on the power never failing.

The analysis

The WAL/fsync settings are durability itself. fsync, full_page_writes, and synchronous_commit aren't "performance knobs" — they're the machinery that makes "committed" mean "survives a crash." Turning them off makes writes faster by not guaranteeing durability. The speed is the lost guarantee.
fsync = off risks corruption, not just lost data. Without forced writes and full-page-write protection, a crash can leave torn/partially-written pages with no WAL to repair them — corrupting the database. Never set fsync = off in production. It's an order of magnitude more dangerous than losing the last few commits.
synchronous_commit = off is a legitimate trade — when chosen knowingly. It risks losing the last fraction of a second of committed transactions on a crash (no corruption). For data you can afford to lose, that's a defensible, documented choice. For data you can't, leave it on. The mistake was flipping it (and worse) blindly for all data to "go faster."
Speed without losing durability comes from batching, not from disabling safety. The real fix for high write throughput is fewer/larger transactions, COPY/bulk loading, and faster storage (Chapters 27, 31, 38) — amortizing the durable-commit cost, not abolishing it.
Understand the WHY before flipping a knob. "Found a setting that makes it faster" without understanding what it trades away is how durability gets silently disabled. Knowing the internals (the WAL, fsync) is what tells you these settings are not free speed — they're the safety you'd be removing.

Discussion questions

How does synchronous_commit = on make a COMMIT durable, and what does turning it off risk?
Why is fsync = off far more dangerous than synchronous_commit = off?
The settings made writes genuinely faster. Where did that speed come from?
What are the legitimate ways to increase write throughput without sacrificing durability?
⭐ Describe a workload where synchronous_commit = off is a reasonable, conscious choice — and one where it absolutely is not.