Case Study 2 — The Intermittent Deadlock

DataField.Dev

Case Study 2 — The Intermittent Deadlock

Deadlocks that appear only under load, seemingly at random, are almost always inconsistent lock ordering. A money-transfer feature deadlocked sporadically until the team made every transaction lock accounts in the same order.

Background

A wallet service let users transfer balance between accounts. Each transfer, correctly, ran in a transaction and locked both accounts (with SELECT ... FOR UPDATE) before moving money — pessimistic concurrency control to prevent lost updates (Case Study 1). The code locked the accounts in the order they appeared in the request:

BEGIN;
SELECT balance FROM accounts WHERE account_id = :from FOR UPDATE;   -- lock sender
SELECT balance FROM accounts WHERE account_id = :to   FOR UPDATE;   -- lock receiver
UPDATE accounts SET balance = balance - :amt WHERE account_id = :from;
UPDATE accounts SET balance = balance + :amt WHERE account_id = :to;
COMMIT;

Most transfers worked. But under load, some transfers failed with:

ERROR: deadlock detected
DETAIL: Process A waits for ShareLock on ... Process B waits for ShareLock on ...

The failures were intermittent and seemingly random — impossible to reproduce on demand, frustrating to debug.

What went wrong: inconsistent lock ordering

The deadlock arose whenever two transfers ran concurrently in opposite directions between the same two accounts:

   Transfer T1: A → B   locks A (from), then wants B (to)
   Transfer T2: B → A   locks B (from), then wants A (to)

        T1 holds A, waits for B ──┐
        T2 holds B, waits for A ──┘   ← cycle: classic deadlock

Because each transfer locked accounts in request order (sender first), a transfer A→B locked A then B, while a simultaneous transfer B→A locked B then A. Each held what the other needed — a cyclic wait. PostgreSQL detected it and aborted one transaction (good — it didn't hang), but the abort surfaced as an error to the user, and the application had no retry logic. The randomness was just the timing of opposite-direction transfers between the same pair of accounts colliding.

The fix: consistent lock ordering (+ retry)

The cure for a deadlock cycle is to make cycles impossible by acquiring locks in a consistent global order — here, always lock the lower account id first, regardless of transfer direction:

BEGIN;
-- lock both accounts in a canonical order (lower id first), independent of from/to
SELECT balance FROM accounts WHERE account_id IN (:from, :to)
ORDER BY account_id
FOR UPDATE;                       -- acquires locks in account_id order
UPDATE accounts SET balance = balance - :amt WHERE account_id = :from;
UPDATE accounts SET balance = balance + :amt WHERE account_id = :to;
COMMIT;

Now every transfer between accounts A and B locks the lower-id account first, whichever direction the money flows. A→B and B→A both lock min(A,B) then max(A,B) — so one simply waits for the other; no cycle can form. The intermittent deadlocks vanished.

As defense in depth, the team also added retry-on-deadlock: deadlocks (and serialization failures) are recoverable, so the application now catches the error and re-runs the transaction a few times before giving up. Consistent ordering prevents the common deadlocks; retry handles any rare residual ones gracefully instead of erroring to the user.

The analysis

Inconsistent lock ordering causes deadlocks. When different transactions acquire the same set of locks in different orders, concurrent runs can form a cycle. This is the #1 cause of application deadlocks, and it's intermittent because it needs a specific concurrent collision.
Consistent ordering makes cycles impossible. If every transaction locks resources in the same canonical order (lower id first, alphabetical, by primary key — any consistent rule), no cycle can form: there's always a transaction that can proceed. This is the primary deadlock-prevention technique.
PostgreSQL detects, doesn't prevent. The database breaks deadlocks by aborting a victim (so the system never hangs), but the application must handle the abort — which means retry. An unhandled deadlock error is a recoverable condition leaking to the user.
Retry is the safety net, ordering is the fix. Consistent ordering eliminates the deadlocks you can foresee; retry handles the rare residual ones (and serialization failures under high isolation, Chapter 26). Use both: ordering to make deadlocks rare, retry to make the rare ones harmless.
Intermittent = concurrency. A bug that "happens randomly under load" and "can't be reproduced" is almost always a concurrency issue — a race or a deadlock. The fix is reasoning about interleavings, not staring at single-threaded code.

Discussion questions

Draw the lock cycle that two opposite-direction transfers create. Why is it intermittent?
How does "always lock the lower account id first" make the cycle impossible?
Why does PostgreSQL abort one transaction rather than hang, and what must the app do about it?
Why use both consistent ordering and retry, rather than just one?
⭐ Generalize the rule: for any operation that locks multiple rows, how do you choose a lock order that prevents deadlocks across the whole application?