Case Study 1: CNB's Production Deadlock Investigation

DataField.Dev

Case Study 1: CNB's Production Deadlock Investigation

The Incident

Date: Tuesday, March 14 Time of first alert: 02:07 EST Systems affected: Core banking batch cycle, online wire transfer processing Duration: 4 hours 23 minutes Business impact: Branch network opening delayed 47 minutes; 12,400 wire transfers queued; 3 SLA violations with correspondent banks

Background

Continental National Bank processes approximately 500 million transactions per day across its core banking platform. The platform runs on a z15 Parallel Sysplex with two DB2 data sharing members. Online processing runs 23.5 hours per day (30-minute maintenance window at 3:00am Sunday). The nightly batch cycle runs from midnight to 4:00am while online remains active.

The core banking team: - Kwame Mensah, Lead Systems Programmer, 22 years at CNB - Lisa Tran, Senior DB2 Application Programmer, 6 years at CNB - Rob Chen, Senior DBA, 14 years at CNB

The Night It Happened

02:07 — First Alert

The automated monitoring system (IBM OMEGAMON for DB2) triggered a P1 alert: batch job CNBBAT01 abended with SQLCODE -911, reason code 00C90088 (deadlock). The batch restartable framework automatically attempted restart.

Lisa Tran received the page and logged in remotely.

02:12 — Initial Assessment

Lisa checked the batch job log:

DSNT501I DSNILMCL RESOURCE UNAVAILABLE
  REASON 00C90088
  TYPE 00000200
  RESOURCE DBACCT.TSACCTM.X'00001A2F'
DSNT375I DEADLOCK DETECTED - VICTIM IS PLAN CNBBAT01
  CORRELATION-ID BATCH01
  CONNECTION-ID TSO
  LUW-ID CNBDB201.CNBLUW.B7F22A01A2C3

Key observations: - Deadlock, not timeout (reason 00C90088, not 00C9008E) - Resource is in tablespace DBACCT.TSACCTM — the main account master tablespace - The batch plan CNBBAT01 was chosen as victim

02:15 — Calling for Backup

Lisa called Rob Chen. While on the phone, the batch restart hit another deadlock. Same tablespace, different page.

02:23 — Pulling IFCID 0172

Rob activated the IFCID 0172 trace (it should have been active already — this was flagged as a process gap). He pulled the deadlock records from the DB2 trace:

IFCID 0172 DEADLOCK RECORD
==========================================
DEADLOCK TIME: 2024-03-14 02:11:43.287421

PARTICIPANT 1:
  PLAN NAME:    CNBBAT01
  CORR ID:      BATCH01
  CONN TYPE:    TSO BATCH
  RESOURCE HELD:
    DBID=0042 OBID=0007 TYPE=ROW
    PAGENO=X'00001A2F' ROW=X'03'
    LOCK MODE: X (EXCLUSIVE)
    DURATION: 00:00:14.332
  RESOURCE WAITED:
    DBID=0042 OBID=0007 TYPE=ROW
    PAGENO=X'00003B71' ROW=X'11'
    LOCK MODE: X (EXCLUSIVE)
    LOCK WAIT TIME: 00:00:02.104

PARTICIPANT 2:
  PLAN NAME:    CNBWIRE
  CORR ID:      WIRE0847
  CONN TYPE:    CICS
  RESOURCE HELD:
    DBID=0042 OBID=0007 TYPE=ROW
    PAGENO=X'00003B71' ROW=X'11'
    LOCK MODE: X (EXCLUSIVE)
    DURATION: 00:00:00.847
  RESOURCE WAITED:
    DBID=0042 OBID=0007 TYPE=ROW
    PAGENO=X'00001A2F' ROW=X'03'
    LOCK MODE: X (EXCLUSIVE)
    LOCK WAIT TIME: 00:00:02.104

VICTIM SELECTED: PARTICIPANT 1 (CNBBAT01)
VICTIM REASON: LOWER PRIORITY (BATCH vs ONLINE)
==========================================

02:30 — Diagnosis

Rob decoded the resource identifiers:

SELECT DBNAME, NAME, CREATOR
FROM SYSIBM.SYSTABLESPACE
WHERE DBID = 66  -- X'0042'
  AND OBID = 7;  -- X'0007'

Result: DBACCT.TSACCTM — confirmed as the ACCT_MASTER tablespace.

Lisa and Rob mapped the pages to account numbers using the row data: - Page X'00001A2F', Row X'03' = Account 1000447 - Page X'00003B71', Row X'11' = Account 1000892

The deadlock chain:

Time	CNBBAT01 (Batch)	CNBWIRE (Online)
T1	UPDATE Account 1000447 (gets X lock)
T2		Wire transfer: debit Account 1000892 (gets X lock)
T3	UPDATE Account 1000892 — WAITS (X held by CNBWIRE)
T4		Wire transfer: credit Account 1000447 — WAITS (X held by CNBBAT01)
T5	DEADLOCK — IRLM detects cycle, rolls back CNBBAT01	Proceeds

02:45 — Root Cause Identification

Lisa examined the CNBBAT01 source code. The batch program processed accounts in account-number order, which was correct. But the problem was deeper:

CNBBAT01 processed accounts sequentially (1000001, 1000002, ..., 1000447, ..., 1000892). It held locks on all accounts processed since the last COMMIT. The commit frequency was every 50,000 rows.

This meant when CNBBAT01 was processing account 1000892, it still held the X lock on account 1000447 — it had processed that account ~445 rows earlier, and the next COMMIT was ~49,555 rows away.

CNBWIRE processed wire transfers as they arrived. A wire from Account 1000892 to Account 1000447 was submitted at 02:11am. CNBWIRE locked Account 1000892 first (the debit side) and then tried to lock Account 1000447 (the credit side).

Two root causes: 1. Excessive lock holding by batch. Committing every 50,000 rows meant holding up to 50,000 X locks simultaneously. Any online transaction needing any of those 50,000 rows would contend. 2. Different access orders. Batch accessed accounts in ascending order. The wire program accessed accounts in transaction order (debit first, credit second), which was effectively random relative to account number.

02:50 — Why It Hadn't Happened Before

Lisa asked the question everyone was thinking: this batch program had been running for 3 years. Why now?

Rob checked the tablespace DDL change history. Two weeks earlier, the DBA team had changed ACCT_MASTER from LOCKSIZE PAGE to LOCKSIZE ROW as part of a "concurrency improvement initiative."

Under LOCKSIZE PAGE, the batch program acquired ~1,250 page locks for 50,000 rows (40 rows per page). LOCKMAX was set to 5,000 — well above 1,250.

Under LOCKSIZE ROW, the batch program acquired 50,000 row locks. LOCKMAX was still 5,000.

Lock escalation was occurring. The batch program was hitting LOCKMAX at roughly 5,000 rows into each commit interval, escalating to a tablespace-level X lock, and blocking ALL online access to ACCT_MASTER for the remainder of the commit interval.

The deadlock happened during a brief window where two separate commit intervals overlapped with a wire transfer that needed two specific accounts.

But the real damage was the escalation: for most of the batch run, the entire ACCT_MASTER tablespace was exclusively locked by the batch program.

03:00 — Immediate Fix

Rob increased LOCKMAX to 60,000 to prevent escalation while the team designed a proper fix. This was a stopgap — it consumed more IRLM storage but eliminated the immediate problem.

Lisa manually restarted the batch cycle. It completed at 05:23am — 1 hour 23 minutes late, causing the delayed branch opening.

Post-Incident Design Review

Kwame convened a formal design review the following week. The team produced a comprehensive remediation plan.

Finding 1: Batch COMMIT Frequency

Problem: CNBBAT01 committed every 50,000 rows. Under LOCKSIZE ROW, this guaranteed escalation (50,000 > LOCKMAX of 5,000 or even the emergency increase to 60,000 would be exceeded with a lower commit interval for safety).

Fix: Reduce commit frequency to 500 rows. Analysis showed that COMMIT overhead at this frequency added approximately 4% to batch elapsed time — acceptable given the 4-hour window typically completed in 2.5 hours.

Code change:

      * BEFORE:
       01  WS-COMMIT-FREQ    PIC S9(8) COMP VALUE 50000.

      * AFTER:
       01  WS-COMMIT-FREQ    PIC S9(8) COMP VALUE 500.

The cursor was also changed to WITH HOLD to maintain position across commits.

Finding 2: Wire Transfer Access Order

Problem: CNBWIRE locked accounts in the order (debit account, credit account). For a transfer from Account A to Account B, it locked A first, then B. If A > B, this was the opposite order from batch (which processes in ascending order).

Fix: CNBWIRE was modified to always lock the lower-numbered account first, regardless of which was the debit and which was the credit.

      * AFTER: Always lock lower account first
       IF WS-DEBIT-ACCT < WS-CREDIT-ACCT
           PERFORM LOCK-AND-DEBIT
           PERFORM LOCK-AND-CREDIT
       ELSE
           PERFORM LOCK-AND-CREDIT
           PERFORM LOCK-AND-DEBIT
       END-IF

Finding 3: LOCKMAX Configuration

Problem: When LOCKSIZE was changed from PAGE to ROW, LOCKMAX was not recalculated. The same LOCKMAX (5,000) that was adequate for page locks was grossly inadequate for row locks.

Fix: LOCKMAX set to 10,000 as a safety net. With COMMIT every 500 rows, the maximum lock count would be well below this, but the higher LOCKMAX provides margin for abnormal conditions.

Finding 4: IFCID 0172 Trace

Problem: IFCID 0172 was not active in production. Rob had to activate it after the incident, losing the trace data from the first deadlock.

Fix: IFCID 0172 is now permanently active in production. The trace overhead is negligible (it only writes records when deadlocks occur). Kwame's standing order: "If there are no deadlocks, this trace costs us nothing. If there are deadlocks, this trace is priceless."

Finding 5: Monitoring

Problem: The monitoring system alerted on batch abends but not on lock escalation. The escalation had been occurring for two weeks (since the LOCKSIZE change) but went undetected because no transactions happened to deadlock until March 14.

Fix: Real-time alert on IFCID 0196 (lock escalation). Any escalation event triggers a P2 alert for investigation within 24 hours. Target: zero escalations.

Finding 6: Change Management

Problem: The LOCKSIZE change was approved through the standard database change process, but the review did not include application programmers. No one analyzed the impact on batch lock counts.

Fix: Any DDL change to LOCKSIZE or LOCKMAX requires signoff from both the DBA team and the application programming team lead. A mandatory checklist now includes: "Has the impact on batch commit intervals been analyzed?"

Results

The remediation was implemented over three weeks. Six months after the changes:

Metric	Before	After
Deadlocks per month	0 (pre-LOCKSIZE change), 12+ (post-change)	0
Lock escalations per month	~840 (two weeks of daily batch)	0
Batch elapsed time	2.5 hours	2.6 hours (+4%)
Online response time (during batch)	850ms avg (with escalation spikes to 5s+)	585ms avg
2am pages	Yes	No

Discussion Questions

The LOCKSIZE change from PAGE to ROW was intended to improve concurrency. In what sense did it succeed? In what sense did it fail? What analysis should have been done before the change?
Rob's immediate fix was to increase LOCKMAX to 60,000. What are the risks of this approach? Why is it a stopgap, not a solution?
The wire transfer access-order fix requires the program to lock and debit/credit accounts in a potentially unnatural order (credit before debit if the credit account has a lower number). Could this cause any business logic issues? How would you verify?
Kwame mandated that application programmers sign off on LOCKSIZE/LOCKMAX changes. Is this sufficient? What other DDL changes could affect application locking behavior?
The team chose a commit frequency of 500 rows. How would you determine the optimal commit frequency? What factors would you consider?
Could this entire incident have been prevented by keeping LOCKSIZE PAGE? What would the trade-offs be?
In a Parallel Sysplex with data sharing, how would this incident's symptoms and diagnosis differ? What additional considerations apply?