Case Study 2 — The Migration That Locked Production for an Hour
A single "data cleanup" migration — one big
UPDATEplus aNOT NULLadd — locked a critical table for an hour during business hours. The change was correct; the way it ran was the disaster. The fix was batching, ordering, and timing.
Background
A team needed to (1) backfill a new status column (added empty in a prior migration) for all existing orders, then (2) make it NOT NULL and add a CHECK constraint. A developer wrote it as the obvious single migration and merged it; it deployed at 11 a.m.:
-- one migration, run at 11am on a 60-million-row orders table ❌
UPDATE orders SET status = 'completed' WHERE status IS NULL; -- one huge UPDATE
ALTER TABLE orders ALTER COLUMN status SET NOT NULL; -- validating scan
ALTER TABLE orders ADD CONSTRAINT status_valid CHECK (status IN (...)); -- another scan
Within seconds, order processing stalled. The checkout API timed out. The migration ran for nearly an hour, holding locks the whole time, and the team couldn't safely cancel it midway. It was a full-blown incident during peak traffic.
What went wrong
Three compounding problems, each from this chapter:
-
One giant
UPDATE. Updating 60 million rows in a single statement created an enormous transaction: it locked vast numbers of rows, generated huge write-ahead-log volume, and ran for a long time — blocking and being blocked by normal order writes (Chapters 14, 27). A batched update (10k rows at a time, committing each) would have let normal traffic interleave. -
SET NOT NULLandADD CHECKscan/lock the table. Both must verify every existing row, taking strong locks on a 60M-row table while doing so (Chapter 14). Combined with the longUPDATEin the same migration, the table was effectively unavailable for the whole duration. -
Run at 11 a.m., with no
lock_timeout. The migration ran during peak business hours, and without alock_timeoutit didn't fail fast when it couldn't get a clean lock — it queued, and application queries queued behind it, freezing the system. It also wasn't tested against production-scale data, so its true runtime was a surprise.
The safe version
The same change, restructured into safe steps run at low traffic:
-- (the column was already added nullable in an earlier migration — good)
-- 1. Backfill in BATCHES (a script/loop, committing each batch; re-runnable):
UPDATE orders SET status = 'completed'
WHERE status IS NULL AND order_id IN (
SELECT order_id FROM orders WHERE status IS NULL LIMIT 10000
);
-- ...repeat until zero rows remain...
-- 2. Add the CHECK as NOT VALID (fast, no full scan under exclusive lock):
ALTER TABLE orders ADD CONSTRAINT status_valid CHECK (status IN (...)) NOT VALID;
-- 3. Validate it separately (scans, but doesn't block writes the way the inline add does):
ALTER TABLE orders VALIDATE CONSTRAINT status_valid;
-- 4. Add NOT NULL during a low-traffic window once backfill is complete.
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;
-- and globally for the session:
SET lock_timeout = '3s'; -- fail fast instead of freezing app queries
Run during a low-traffic window, with batching and lock_timeout, the same logical change applied without an outage — each step held locks only briefly, and normal order processing continued throughout.
The process changes
Beyond the technical fix, the team changed how migrations ship:
- Migrations are reviewed for lock behavior, not just correctness. A reviewer now asks "does this scan/rewrite/lock a big table? is the data change batched?"
- Big data backfills run as separate, batched jobs, not inside the schema migration.
- Risky migrations run in maintenance/low-traffic windows, with
lock_timeoutset. - Migrations are tested on a production-sized dataset (a restored copy) to learn their real runtime before prod.
The analysis
-
The change was right; the execution was the disaster. Backfilling and constraining
statuswas correct. Doing it as one big locking statement at peak hours turned a routine change into an incident. How a migration runs matters as much as what it does. -
Batch large data migrations. A single 60M-row
UPDATEis a long, lock-heavy mega-transaction. Batches of ~10k, each committed, let production traffic interleave and let a partial run resume. -
SET NOT NULLandADD CHECK/FK take scanning locks. UseNOT VALID+VALIDATE CONSTRAINTto avoid the long blocking scan, and addNOT NULLonly after backfill, in a quiet window. -
lock_timeoutprevents a blocked migration from freezing the app. Without it, the migration queues for a lock and app queries pile up behind it; with it, the migration fails fast and you retry later — far better than an outage. -
Test at scale and time it right. A migration's behavior on 100 dev rows tells you nothing about 60M prod rows. Test on a production-sized copy, and schedule risky changes for low traffic. (Chapter 14's "ALTER that locked the store," now generalized to migrations.)
Discussion questions
- Identify the three compounding mistakes and the fix for each.
- Why does one 60M-row
UPDATEcause locking/availability problems that batches avoid? - How do
NOT VALID+VALIDATE CONSTRAINTreduce lock time vs. an inlineADD CONSTRAINT? - What does
lock_timeoutchange about the failure mode, and why is failing fast better here? - ⭐ Write a review checklist for migrations that would have flagged this one before it merged.