Case Study 2 — The ALTER That Locked the Store
A "quick" schema change took a busy production database down for several minutes during peak hours. The lesson: on a live, populated table, how you run an
ALTERmatters as much as what it does — a preview of why migrations (Chapter 22) are their own discipline.
Background
A team needed to add a currency column to a large orders table (tens of millions of rows), defaulting to 'USD', and make it NOT NULL. A developer ran what looked like a single obvious statement during the afternoon:
-- Run on a live, 40-million-row orders table at 2pm ❌
ALTER TABLE orders ADD COLUMN currency text NOT NULL DEFAULT 'USD';
Within seconds, the checkout API began timing out. Orders couldn't be written. The on-call was paged; the store was effectively down for several minutes until the statement finished. What was supposed to be a trivial column add became an incident.
What went wrong: locks and table rewrites
ALTER TABLE takes an ACCESS EXCLUSIVE lock on the table — the strongest lock, which blocks all other access (reads and writes) for the duration. For most ALTERs that's fine because they're instantaneous (a metadata change). The danger is alterations that must rewrite every row or scan the whole table while holding that lock:
- On older PostgreSQL (pre-11),
ADD COLUMN ... DEFAULT <value>had to write the default into every existing row — a full table rewrite of 40M rows, all while holding the exclusive lock. Nothing else could touchordersuntil it finished. - Even on modern PostgreSQL (where a constant default is a fast metadata-only change), a volatile default (e.g.,
DEFAULT some_function()) or adding aCHECK/foreign-key constraint still scans/rewrites and holds the lock long enough to cause an outage on a hot table.
The developer's mental model — "adding a column is instant" — was true for a nullable, no-default column, but not for this combination on this much data at this traffic level.
The safe approach
The fix is to break the one risky statement into steps that each hold the lock only briefly, and to do it during low traffic:
-- 1. Add the column NULLable with no default — fast, metadata-only, brief lock.
ALTER TABLE orders ADD COLUMN currency text;
-- 2. Backfill in batches, so no single statement locks the table for long.
-- (Loop in the app or a script: update N thousand rows at a time, commit each.)
UPDATE orders SET currency = 'USD' WHERE currency IS NULL AND order_id BETWEEN 1 AND 100000;
-- ... repeat across id ranges ...
-- 3. Set the default for FUTURE rows (fast).
ALTER TABLE orders ALTER COLUMN currency SET DEFAULT 'USD';
-- 4. Add NOT NULL. To avoid a long lock from the validating scan, modern PostgreSQL
-- lets you add a NOT VALID check first, validate it concurrently, then enforce —
-- or run the SET NOT NULL during a maintenance window once backfill is done.
ALTER TABLE orders ADD CONSTRAINT currency_not_null CHECK (currency IS NOT NULL) NOT VALID;
ALTER TABLE orders VALIDATE CONSTRAINT currency_not_null; -- scans without an exclusive lock for writes
Each step is either metadata-only (brief lock) or batched (short locks), so the table stays available. The same care applies to indexes: CREATE INDEX CONCURRENTLY builds an index without blocking writes (Chapter 23), versus a plain CREATE INDEX which locks the table.
The analysis
-
ALTER TABLEtakes an exclusive lock. Quick metadata changes are harmless; changes that rewrite rows or scan the whole table hold that lock for the whole operation, blocking the application. On a hot table, that's an outage. -
"Adding a column is instant" is conditional. A nullable, no-default column is instant. A
NOT NULLadd, a non-constant default, or an added constraint may rewrite/scan. Know which kind ofALTERyou're running. -
Break risky changes into safe steps. Add nullable → backfill in batches → set default → enforce
NOT NULL(withNOT VALID/VALIDATEto avoid the long validating lock). Each step minimizes lock time. -
Timing and tooling matter. Run schema changes in low-traffic windows; use
CONCURRENTLYfor indexes; set alock_timeoutso a blockedALTERfails fast instead of queuing behind (and ahead of) application queries. -
This is why migrations are a discipline (Chapter 22). Production schema changes need to be reviewed, ordered, reversible, and lock-aware — not typed ad hoc into a shell at 2 p.m. The next chapter on schema evolution exists precisely because of incidents like this.
Discussion questions
- Why did a single
ADD COLUMN ... NOT NULL DEFAULTcause an outage on a 40M-row table? - Which kinds of
ALTERare metadata-only (safe) versus row-rewriting/scanning (risky)? - Rewrite the change as a sequence of low-lock steps and explain what each minimizes.
- What do
lock_timeoutandCREATE INDEX CONCURRENTLYcontribute to safe DDL? - ⭐ Why can't a simple
CHECKorSET NOT NULLalways be added instantly, and how doNOT VALID+VALIDATE CONSTRAINThelp?