Case Study 2 — The Intermittent Deadlock
Deadlocks that appear only under load, seemingly at random, are almost always inconsistent lock ordering. A money-transfer feature deadlocked sporadically until the team made every transaction lock accounts in the same order.
Background
A wallet service let users transfer balance between accounts. Each transfer, correctly, ran in a transaction and locked both accounts (with SELECT ... FOR UPDATE) before moving money — pessimistic concurrency control to prevent lost updates (Case Study 1). The code locked the accounts in the order they appeared in the request:
BEGIN;
SELECT balance FROM accounts WHERE account_id = :from FOR UPDATE; -- lock sender
SELECT balance FROM accounts WHERE account_id = :to FOR UPDATE; -- lock receiver
UPDATE accounts SET balance = balance - :amt WHERE account_id = :from;
UPDATE accounts SET balance = balance + :amt WHERE account_id = :to;
COMMIT;
Most transfers worked. But under load, some transfers failed with:
ERROR: deadlock detected
DETAIL: Process A waits for ShareLock on ... Process B waits for ShareLock on ...
The failures were intermittent and seemingly random — impossible to reproduce on demand, frustrating to debug.
What went wrong: inconsistent lock ordering
The deadlock arose whenever two transfers ran concurrently in opposite directions between the same two accounts:
Transfer T1: A → B locks A (from), then wants B (to)
Transfer T2: B → A locks B (from), then wants A (to)
T1 holds A, waits for B ──┐
T2 holds B, waits for A ──┘ ← cycle: classic deadlock
Because each transfer locked accounts in request order (sender first), a transfer A→B locked A then B, while a simultaneous transfer B→A locked B then A. Each held what the other needed — a cyclic wait. PostgreSQL detected it and aborted one transaction (good — it didn't hang), but the abort surfaced as an error to the user, and the application had no retry logic. The randomness was just the timing of opposite-direction transfers between the same pair of accounts colliding.
The fix: consistent lock ordering (+ retry)
The cure for a deadlock cycle is to make cycles impossible by acquiring locks in a consistent global order — here, always lock the lower account id first, regardless of transfer direction:
BEGIN;
-- lock both accounts in a canonical order (lower id first), independent of from/to
SELECT balance FROM accounts WHERE account_id IN (:from, :to)
ORDER BY account_id
FOR UPDATE; -- acquires locks in account_id order
UPDATE accounts SET balance = balance - :amt WHERE account_id = :from;
UPDATE accounts SET balance = balance + :amt WHERE account_id = :to;
COMMIT;
Now every transfer between accounts A and B locks the lower-id account first, whichever direction the money flows. A→B and B→A both lock min(A,B) then max(A,B) — so one simply waits for the other; no cycle can form. The intermittent deadlocks vanished.
As defense in depth, the team also added retry-on-deadlock: deadlocks (and serialization failures) are recoverable, so the application now catches the error and re-runs the transaction a few times before giving up. Consistent ordering prevents the common deadlocks; retry handles any rare residual ones gracefully instead of erroring to the user.
The analysis
-
Inconsistent lock ordering causes deadlocks. When different transactions acquire the same set of locks in different orders, concurrent runs can form a cycle. This is the #1 cause of application deadlocks, and it's intermittent because it needs a specific concurrent collision.
-
Consistent ordering makes cycles impossible. If every transaction locks resources in the same canonical order (lower id first, alphabetical, by primary key — any consistent rule), no cycle can form: there's always a transaction that can proceed. This is the primary deadlock-prevention technique.
-
PostgreSQL detects, doesn't prevent. The database breaks deadlocks by aborting a victim (so the system never hangs), but the application must handle the abort — which means retry. An unhandled deadlock error is a recoverable condition leaking to the user.
-
Retry is the safety net, ordering is the fix. Consistent ordering eliminates the deadlocks you can foresee; retry handles the rare residual ones (and serialization failures under high isolation, Chapter 26). Use both: ordering to make deadlocks rare, retry to make the rare ones harmless.
-
Intermittent = concurrency. A bug that "happens randomly under load" and "can't be reproduced" is almost always a concurrency issue — a race or a deadlock. The fix is reasoning about interleavings, not staring at single-threaded code.
Discussion questions
- Draw the lock cycle that two opposite-direction transfers create. Why is it intermittent?
- How does "always lock the lower account id first" make the cycle impossible?
- Why does PostgreSQL abort one transaction rather than hang, and what must the app do about it?
- Why use both consistent ordering and retry, rather than just one?
- ⭐ Generalize the rule: for any operation that locks multiple rows, how do you choose a lock order that prevents deadlocks across the whole application?