Case Study 01: Deadlock Investigation at an Online Retailer

DataField.Dev

Case Study 01: Deadlock Investigation at an Online Retailer

Background

FastCart is an online retailer processing 50,000 orders per hour during peak periods. Their DB2 LUW database handles inventory management, order processing, and payment processing. The system had been running smoothly for two years.

Three weeks ago, FastCart launched a "flash sale" feature that allows time-limited discounts on popular items. Since launch, the operations team has observed a dramatic increase in deadlock events — from fewer than 5 per hour to over 200 per hour during flash sale periods.

The Environment

Platform: DB2 11.5 on Linux (RHEL 8)
Key Tables:
INVENTORY — 500,000 SKUs, updated on every order
ORDERS — 100 million rows, append-only during order placement
ORDER_ITEMS — 300 million rows, append-only
PAYMENTS — 80 million rows
Isolation Level: CS (default) for all application code
LOCKTIMEOUT: 30 seconds
DLCHKTIME: 10000 (10 seconds)
Application Architecture: Java Spring Boot with connection pooling (HikariCP, 200 connections)

Symptom Definition

What: Deadlock count spiked from < 5/hour to > 200/hour
When: Only during flash sale events (typically 2-3 times per week, lasting 2-4 hours each)
Impact: Each deadlock results in one transaction being rolled back and retried. At 200/hour, this adds latency for 200 customers per hour and wastes database resources on rollback/retry. Customer complaints about "order failed, please try again" errors are increasing.

Investigation

Step 1: Confirm the Deadlock Rate

SELECT
    DEADLOCKS,
    LOCK_TIMEOUTS,
    LOCK_ESCALS,
    LOCK_WAITS,
    LOCK_WAIT_TIME / 1000 AS LOCK_WAIT_SEC
FROM TABLE(MON_GET_DATABASE(-2)) AS T;

Results during a flash sale:

DEADLOCKS: 847  (accumulated over 4 hours)
LOCK_TIMEOUTS: 23
LOCK_ESCALS: 0
LOCK_WAITS: 45,230
LOCK_WAIT_SEC: 12,340

Observation: High deadlocks but zero escalation. This is a pure deadlock problem, not an escalation problem.

Step 2: Identify the Tables Involved

SELECT TABSCHEMA, TABNAME, LOCK_WAITS, LOCK_WAIT_TIME / 1000 AS WAIT_SEC
FROM TABLE(MON_GET_TABLE('', '', -2)) AS T
WHERE LOCK_WAITS > 0
ORDER BY LOCK_WAIT_TIME DESC
FETCH FIRST 5 ROWS ONLY;

Results:

FASTCART  INVENTORY     34,500    9,800
FASTCART  ORDERS         8,200    1,900
FASTCART  PAYMENTS       2,530      640

The INVENTORY table accounts for 79% of all lock wait time.

Step 3: Analyze the Deadlock Graph

The DBA team enabled a deadlock event monitor:

CREATE EVENT MONITOR DL_MON FOR DEADLOCKS WITH DETAILS HISTORY
  WRITE TO UNFORMATTED EVENT TABLE;
SET EVENT MONITOR DL_MON STATE = 1;

After capturing several deadlocks, analysis revealed a consistent pattern:

Deadlock Graph:
  Transaction A (ORDER_PROCESS):
    HOLDS: X lock on INVENTORY row SKU=12345
    WAITS: X lock on INVENTORY row SKU=67890

  Transaction B (ORDER_PROCESS):
    HOLDS: X lock on INVENTORY row SKU=67890
    WAITS: X lock on INVENTORY row SKU=12345

Both transactions are ORDER_PROCESS operations. Both are updating INVENTORY rows to decrement stock quantities. They are accessing the same two SKUs in opposite orders.

Step 4: Examine the Application Code

The order processing code (simplified):

public void processOrder(Order order) {
    for (OrderItem item : order.getItems()) {
        // Decrement inventory for each item in the order
        jdbcTemplate.update(
            "UPDATE inventory SET qty_available = qty_available - ? " +
            "WHERE sku_id = ? AND qty_available >= ?",
            item.getQuantity(), item.getSkuId(), item.getQuantity()
        );
    }
    // Insert order record
    jdbcTemplate.update("INSERT INTO orders ...");
    // Insert order items
    for (OrderItem item : order.getItems()) {
        jdbcTemplate.update("INSERT INTO order_items ...");
    }
}

Root Cause Found: The order processing loop updates inventory rows in the order they appear in the customer's shopping cart. When two customers order the same two items but in different cart order, the inventory updates happen in opposite order, creating the deadlock condition.

During flash sales, thousands of customers order the same popular items simultaneously, dramatically increasing the probability of this deadlock pattern.

The Fix

Option 1: Sort Items Before Processing (Chosen)

public void processOrder(Order order) {
    // Sort items by SKU to ensure consistent lock ordering
    List<OrderItem> sortedItems = order.getItems().stream()
        .sorted(Comparator.comparing(OrderItem::getSkuId))
        .collect(Collectors.toList());

    for (OrderItem item : sortedItems) {
        jdbcTemplate.update(
            "UPDATE inventory SET qty_available = qty_available - ? " +
            "WHERE sku_id = ? AND qty_available >= ?",
            item.getQuantity(), item.getSkuId(), item.getQuantity()
        );
    }
    // ... rest of order processing
}

By sorting items by SKU before processing, every transaction that needs to update SKUs 12345 and 67890 will always update 12345 first, then 67890. The circular wait condition is eliminated.

Option 2: Batch Update with Single Statement (Considered but rejected)

UPDATE inventory
SET qty_available = qty_available - CASE sku_id
    WHEN 12345 THEN 2
    WHEN 67890 THEN 1
    END
WHERE sku_id IN (12345, 67890)
  AND qty_available >= CASE sku_id
    WHEN 12345 THEN 2
    WHEN 67890 THEN 1
    END;

This was rejected because DB2 does not guarantee the order of row updates within a single statement, and the CASE-based validation logic was considered too fragile for production.

Option 3: Retry Logic Enhancement (Supplementary)

Regardless of the primary fix, the retry logic was enhanced:

@Retryable(value = DeadlockException.class, maxAttempts = 3,
           backoff = @Backoff(delay = 100, multiplier = 2))
public void processOrder(Order order) {
    // ... order processing with sorted items
}

Verification

Before Fix (During Flash Sale)

Deadlocks/hour: 212
Lock waits/hour: 11,300
Avg lock wait time: 273 ms
Order processing avg latency: 890 ms
Customer "order failed" errors: 180/hour

After Fix (During Flash Sale of Equal Magnitude)

Deadlocks/hour: 3
Lock waits/hour: 2,100
Avg lock wait time: 45 ms
Order processing avg latency: 210 ms
Customer "order failed" errors: 2/hour

The deadlock rate dropped by 98.6%. The remaining 3 deadlocks per hour are attributed to rare edge cases involving concurrent inventory restocking operations, which update inventory in a different code path. This secondary path was also corrected in a follow-up deployment.

Lessons Learned

Lock ordering is the most effective deadlock prevention technique. A one-line code change (sorting by SKU) eliminated 98.6% of deadlocks.
Deadlock patterns may be latent for years. The vulnerability existed since the system was built, but only manifested under the specific concurrency pattern of flash sales (many concurrent orders for the same items).
The deadlock event monitor is essential for diagnosis. Without seeing the exact lock graph, the team would have spent hours guessing at the cause.
Retry logic is necessary but not sufficient. Retries handle the symptoms (the occasional deadlock), but eliminating the root cause (inconsistent lock ordering) is far more effective.
Performance testing should include concurrency scenarios. The flash sale feature was load-tested for throughput but not for concurrent access to the same data items.

Discussion Questions

Why did the deadlock problem only appear during flash sales and not during normal operation?
Could a different isolation level have prevented the deadlocks? What tradeoffs would be involved?
What monitoring alerts should FastCart implement to detect future lock-related problems early?
How would this problem manifest differently on z/OS with page-level locking instead of row-level locking?