Case Study 2 — The Migration That Locked Production for an Hour

DataField.Dev

Case Study 2 — The Migration That Locked Production for an Hour

A single "data cleanup" migration — one big UPDATE plus a NOT NULL add — locked a critical table for an hour during business hours. The change was correct; the way it ran was the disaster. The fix was batching, ordering, and timing.

Background

A team needed to (1) backfill a new status column (added empty in a prior migration) for all existing orders, then (2) make it NOT NULL and add a CHECK constraint. A developer wrote it as the obvious single migration and merged it; it deployed at 11 a.m.:

-- one migration, run at 11am on a 60-million-row orders table   ❌
UPDATE orders SET status = 'completed' WHERE status IS NULL;          -- one huge UPDATE
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;                  -- validating scan
ALTER TABLE orders ADD CONSTRAINT status_valid CHECK (status IN (...));  -- another scan

Within seconds, order processing stalled. The checkout API timed out. The migration ran for nearly an hour, holding locks the whole time, and the team couldn't safely cancel it midway. It was a full-blown incident during peak traffic.

What went wrong

Three compounding problems, each from this chapter:

One giant UPDATE. Updating 60 million rows in a single statement created an enormous transaction: it locked vast numbers of rows, generated huge write-ahead-log volume, and ran for a long time — blocking and being blocked by normal order writes (Chapters 14, 27). A batched update (10k rows at a time, committing each) would have let normal traffic interleave.
SET NOT NULL and ADD CHECK scan/lock the table. Both must verify every existing row, taking strong locks on a 60M-row table while doing so (Chapter 14). Combined with the long UPDATE in the same migration, the table was effectively unavailable for the whole duration.
Run at 11 a.m., with no lock_timeout. The migration ran during peak business hours, and without a lock_timeout it didn't fail fast when it couldn't get a clean lock — it queued, and application queries queued behind it, freezing the system. It also wasn't tested against production-scale data, so its true runtime was a surprise.

The safe version

The same change, restructured into safe steps run at low traffic:

-- (the column was already added nullable in an earlier migration — good)

-- 1. Backfill in BATCHES (a script/loop, committing each batch; re-runnable):
UPDATE orders SET status = 'completed'
WHERE status IS NULL AND order_id IN (
    SELECT order_id FROM orders WHERE status IS NULL LIMIT 10000
);
-- ...repeat until zero rows remain...

-- 2. Add the CHECK as NOT VALID (fast, no full scan under exclusive lock):
ALTER TABLE orders ADD CONSTRAINT status_valid CHECK (status IN (...)) NOT VALID;
-- 3. Validate it separately (scans, but doesn't block writes the way the inline add does):
ALTER TABLE orders VALIDATE CONSTRAINT status_valid;

-- 4. Add NOT NULL during a low-traffic window once backfill is complete.
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;

-- and globally for the session:
SET lock_timeout = '3s';   -- fail fast instead of freezing app queries

Run during a low-traffic window, with batching and lock_timeout, the same logical change applied without an outage — each step held locks only briefly, and normal order processing continued throughout.

The process changes

Beyond the technical fix, the team changed how migrations ship:

Migrations are reviewed for lock behavior, not just correctness. A reviewer now asks "does this scan/rewrite/lock a big table? is the data change batched?"
Big data backfills run as separate, batched jobs, not inside the schema migration.
Risky migrations run in maintenance/low-traffic windows, with lock_timeout set.
Migrations are tested on a production-sized dataset (a restored copy) to learn their real runtime before prod.

The analysis

The change was right; the execution was the disaster. Backfilling and constraining status was correct. Doing it as one big locking statement at peak hours turned a routine change into an incident. How a migration runs matters as much as what it does.
Batch large data migrations. A single 60M-row UPDATE is a long, lock-heavy mega-transaction. Batches of ~10k, each committed, let production traffic interleave and let a partial run resume.
SET NOT NULL and ADD CHECK/FK take scanning locks. Use NOT VALID + VALIDATE CONSTRAINT to avoid the long blocking scan, and add NOT NULL only after backfill, in a quiet window.
lock_timeout prevents a blocked migration from freezing the app. Without it, the migration queues for a lock and app queries pile up behind it; with it, the migration fails fast and you retry later — far better than an outage.
Test at scale and time it right. A migration's behavior on 100 dev rows tells you nothing about 60M prod rows. Test on a production-sized copy, and schedule risky changes for low traffic. (Chapter 14's "ALTER that locked the store," now generalized to migrations.)

Discussion questions

Identify the three compounding mistakes and the fix for each.
Why does one 60M-row UPDATE cause locking/availability problems that batches avoid?
How do NOT VALID + VALIDATE CONSTRAINT reduce lock time vs. an inline ADD CONSTRAINT?
What does lock_timeout change about the failure mode, and why is failing fast better here?
⭐ Write a review checklist for migrations that would have flagged this one before it merged.