Case Study 01: Deadlock Investigation at an Online Retailer
Background
FastCart is an online retailer processing 50,000 orders per hour during peak periods. Their DB2 LUW database handles inventory management, order processing, and payment processing. The system had been running smoothly for two years.
Three weeks ago, FastCart launched a "flash sale" feature that allows time-limited discounts on popular items. Since launch, the operations team has observed a dramatic increase in deadlock events — from fewer than 5 per hour to over 200 per hour during flash sale periods.
The Environment
- Platform: DB2 11.5 on Linux (RHEL 8)
- Key Tables:
INVENTORY— 500,000 SKUs, updated on every orderORDERS— 100 million rows, append-only during order placementORDER_ITEMS— 300 million rows, append-onlyPAYMENTS— 80 million rows- Isolation Level: CS (default) for all application code
- LOCKTIMEOUT: 30 seconds
- DLCHKTIME: 10000 (10 seconds)
- Application Architecture: Java Spring Boot with connection pooling (HikariCP, 200 connections)
Symptom Definition
- What: Deadlock count spiked from < 5/hour to > 200/hour
- When: Only during flash sale events (typically 2-3 times per week, lasting 2-4 hours each)
- Impact: Each deadlock results in one transaction being rolled back and retried. At 200/hour, this adds latency for 200 customers per hour and wastes database resources on rollback/retry. Customer complaints about "order failed, please try again" errors are increasing.
Investigation
Step 1: Confirm the Deadlock Rate
SELECT
DEADLOCKS,
LOCK_TIMEOUTS,
LOCK_ESCALS,
LOCK_WAITS,
LOCK_WAIT_TIME / 1000 AS LOCK_WAIT_SEC
FROM TABLE(MON_GET_DATABASE(-2)) AS T;
Results during a flash sale:
DEADLOCKS: 847 (accumulated over 4 hours)
LOCK_TIMEOUTS: 23
LOCK_ESCALS: 0
LOCK_WAITS: 45,230
LOCK_WAIT_SEC: 12,340
Observation: High deadlocks but zero escalation. This is a pure deadlock problem, not an escalation problem.
Step 2: Identify the Tables Involved
SELECT TABSCHEMA, TABNAME, LOCK_WAITS, LOCK_WAIT_TIME / 1000 AS WAIT_SEC
FROM TABLE(MON_GET_TABLE('', '', -2)) AS T
WHERE LOCK_WAITS > 0
ORDER BY LOCK_WAIT_TIME DESC
FETCH FIRST 5 ROWS ONLY;
Results:
FASTCART INVENTORY 34,500 9,800
FASTCART ORDERS 8,200 1,900
FASTCART PAYMENTS 2,530 640
The INVENTORY table accounts for 79% of all lock wait time.
Step 3: Analyze the Deadlock Graph
The DBA team enabled a deadlock event monitor:
CREATE EVENT MONITOR DL_MON FOR DEADLOCKS WITH DETAILS HISTORY
WRITE TO UNFORMATTED EVENT TABLE;
SET EVENT MONITOR DL_MON STATE = 1;
After capturing several deadlocks, analysis revealed a consistent pattern:
Deadlock Graph:
Transaction A (ORDER_PROCESS):
HOLDS: X lock on INVENTORY row SKU=12345
WAITS: X lock on INVENTORY row SKU=67890
Transaction B (ORDER_PROCESS):
HOLDS: X lock on INVENTORY row SKU=67890
WAITS: X lock on INVENTORY row SKU=12345
Both transactions are ORDER_PROCESS operations. Both are updating INVENTORY rows to decrement stock quantities. They are accessing the same two SKUs in opposite orders.
Step 4: Examine the Application Code
The order processing code (simplified):
public void processOrder(Order order) {
for (OrderItem item : order.getItems()) {
// Decrement inventory for each item in the order
jdbcTemplate.update(
"UPDATE inventory SET qty_available = qty_available - ? " +
"WHERE sku_id = ? AND qty_available >= ?",
item.getQuantity(), item.getSkuId(), item.getQuantity()
);
}
// Insert order record
jdbcTemplate.update("INSERT INTO orders ...");
// Insert order items
for (OrderItem item : order.getItems()) {
jdbcTemplate.update("INSERT INTO order_items ...");
}
}
Root Cause Found: The order processing loop updates inventory rows in the order they appear in the customer's shopping cart. When two customers order the same two items but in different cart order, the inventory updates happen in opposite order, creating the deadlock condition.
During flash sales, thousands of customers order the same popular items simultaneously, dramatically increasing the probability of this deadlock pattern.
The Fix
Option 1: Sort Items Before Processing (Chosen)
public void processOrder(Order order) {
// Sort items by SKU to ensure consistent lock ordering
List<OrderItem> sortedItems = order.getItems().stream()
.sorted(Comparator.comparing(OrderItem::getSkuId))
.collect(Collectors.toList());
for (OrderItem item : sortedItems) {
jdbcTemplate.update(
"UPDATE inventory SET qty_available = qty_available - ? " +
"WHERE sku_id = ? AND qty_available >= ?",
item.getQuantity(), item.getSkuId(), item.getQuantity()
);
}
// ... rest of order processing
}
By sorting items by SKU before processing, every transaction that needs to update SKUs 12345 and 67890 will always update 12345 first, then 67890. The circular wait condition is eliminated.
Option 2: Batch Update with Single Statement (Considered but rejected)
UPDATE inventory
SET qty_available = qty_available - CASE sku_id
WHEN 12345 THEN 2
WHEN 67890 THEN 1
END
WHERE sku_id IN (12345, 67890)
AND qty_available >= CASE sku_id
WHEN 12345 THEN 2
WHEN 67890 THEN 1
END;
This was rejected because DB2 does not guarantee the order of row updates within a single statement, and the CASE-based validation logic was considered too fragile for production.
Option 3: Retry Logic Enhancement (Supplementary)
Regardless of the primary fix, the retry logic was enhanced:
@Retryable(value = DeadlockException.class, maxAttempts = 3,
backoff = @Backoff(delay = 100, multiplier = 2))
public void processOrder(Order order) {
// ... order processing with sorted items
}
Verification
Before Fix (During Flash Sale)
Deadlocks/hour: 212
Lock waits/hour: 11,300
Avg lock wait time: 273 ms
Order processing avg latency: 890 ms
Customer "order failed" errors: 180/hour
After Fix (During Flash Sale of Equal Magnitude)
Deadlocks/hour: 3
Lock waits/hour: 2,100
Avg lock wait time: 45 ms
Order processing avg latency: 210 ms
Customer "order failed" errors: 2/hour
The deadlock rate dropped by 98.6%. The remaining 3 deadlocks per hour are attributed to rare edge cases involving concurrent inventory restocking operations, which update inventory in a different code path. This secondary path was also corrected in a follow-up deployment.
Lessons Learned
-
Lock ordering is the most effective deadlock prevention technique. A one-line code change (sorting by SKU) eliminated 98.6% of deadlocks.
-
Deadlock patterns may be latent for years. The vulnerability existed since the system was built, but only manifested under the specific concurrency pattern of flash sales (many concurrent orders for the same items).
-
The deadlock event monitor is essential for diagnosis. Without seeing the exact lock graph, the team would have spent hours guessing at the cause.
-
Retry logic is necessary but not sufficient. Retries handle the symptoms (the occasional deadlock), but eliminating the root cause (inconsistent lock ordering) is far more effective.
-
Performance testing should include concurrency scenarios. The flash sale feature was load-tested for throughput but not for concurrent access to the same data items.
Discussion Questions
- Why did the deadlock problem only appear during flash sales and not during normal operation?
- Could a different isolation level have prevented the deadlocks? What tradeoffs would be involved?
- What monitoring alerts should FastCart implement to detect future lock-related problems early?
- How would this problem manifest differently on z/OS with page-level locking instead of row-level locking?