Case Study 1: CNB's MAXT Crisis

DataField.Dev

Case Study 1: CNB's MAXT Crisis

A Tuesday Afternoon That Changed Operations Forever

Background

Continental National Bank processes 500 million online transactions per day across a 16-region CICS topology spanning 4 LPARs. The core banking AOR pool — CNBAORB1 and CNBAORB2 on LPAR SYSB — handles the highest-value workload: ATM authorizations, branch teller transactions, online fund transfers, and balance inquiries. Together, these two AORs process approximately 4,200 TPS at peak.

The MXT on each AOR was 250 — a value calculated by the previous systems programmer in 2019 and never revisited. The calculation used a 1.5x safety factor against the 2019 peak, which was lower than current volumes.

On the Tuesday of the incident, three circumstances converged:

Lisa Tran scheduled a DB2 REORG on the general ledger (GL) tablespace. The REORG was approved in the change management process. Lisa scheduled it for 2:30 PM — a period she classified as "off-peak" based on DB2 metrics. She was correct for DB2: DB2 batch activity was minimal at 2:30 PM. She was incorrect for CICS: 2:30 PM is within the afternoon peak window for online transactions.
A new regulatory audit transaction (RGAU) had been deployed two weeks earlier. RGAU performed a complex join across the GL tablespace and three subsidiary tables. Under normal conditions, RGAU completed in 300ms. Under a REORG drain lock, it blocked indefinitely until the drain window released.
The TRANCLASS configuration did not include RGAU. The transaction was unclassed, meaning it competed for tasks from the general MXT pool with no throttling.

The Incident Timeline

14:27 — Lisa initiates the DB2 REORG on tablespace DSNDB04.GENLGR01. The REORG begins with an initial drain of the tablespace.

14:28 — The drain lock is acquired. All SQL access to GENLGR01 is suspended during the drain phase. The drain is expected to last 10–15 seconds before the REORG enters its sort-and-rebuild phase (which allows read access).

14:29 — The drain lasts longer than expected because of the tablespace's size (45 GB after recent growth from the regulatory audit workload). During the drain, every CICS transaction that touches the GL — approximately 35% of the core banking workload — stalls.

14:30 — CNBAORB1's active task count rises from its normal 160 to 220. Tasks are not completing because they are waiting for the DB2 drain lock. New transactions continue to arrive at normal rates (2,100 TPS per AOR), and each arriving transaction receives a TCA and joins the wait.

14:31 — Active task count on CNBAORB1 reaches 248. Rob Calloway's monitoring console shows an amber alert.

14:32 — CNBAORB1 hits MXT (250). DFHZC0101 message appears on the system log. New transactions begin queuing in the TOR. The TOR's dynamic routing algorithm detects the MAXT on AORB1 and begins shifting traffic to AORB2.

14:33 — CNBAORB2, now receiving the combined load of both AORs, hits MXT (250). Both AORs are in MAXT. The TOR queues all new transactions. Response times visible to customers: 30+ seconds. ATM terminals begin timing out.

14:34 — Rob pages Kwame, who is in the capacity planning meeting. "Both B-region AORs are in MAXT. No volume spike. Something changed."

14:35 — Kwame pulls up CICS statistics remotely. He sees the task wait breakdown: 89% of active tasks are in DB2 wait state. He asks Lisa: "Are you running anything on the GL tablespace?"

14:36 — Lisa confirms the REORG. She pauses it immediately: TERM UTILITY(DSNUTIL.GENLGR01). The drain lock releases.

14:37 — The backlog of 500+ waiting transactions (across both AORs) begins draining. Tasks that were blocked on the GL tablespace complete within milliseconds. Active task counts drop from 250 to 180 within 15 seconds.

14:38 — Active task counts return to normal (160 per AOR). Response times normalize. ATM terminals recover.

Total impact: 6 minutes of degradation. No data loss. No transaction corruption. Estimated 12,000 transactions delayed beyond SLA. Three ATM networks reported timeout alerts.

Root Cause Analysis

Kwame convened a post-incident review the following day. The root cause was multi-factorial:

Factor 1: REORG Scheduling Conflict

Lisa scheduled the REORG based on DB2 workload patterns. The change management process did not include a cross-reference to CICS workload patterns. The GL tablespace is a shared resource between batch (DB2-centric) and online (CICS-centric) workloads. Off-peak for DB2 is not off-peak for CICS.

Fix: All REORG scheduling now requires cross-referencing both DB2 and CICS peak windows. REORGs on tablespaces accessed by CICS transactions are restricted to the batch window (22:00–06:00) unless explicitly approved by the CICS operations lead.

Factor 2: No TRANCLASS for RGAU

The regulatory audit transaction was deployed without a TRANCLASS assignment. During normal operation, this was harmless — RGAU ran at low volume (50 TPS). But during the drain, RGAU transactions accumulated without limit, consuming TCA resources that could have been reserved for critical ATM and transfer transactions.

Fix: Every new transaction deployed to a production CICS region must have a TRANCLASS assignment. Unclassed transactions are flagged by an automated CSD audit that runs nightly.

Factor 3: MXT Never Revisited

The MXT of 250 was set in 2019 when peak TPS per AOR was 1,600. By the time of the incident, peak TPS had grown to 2,100. The MXT should have been 264 (2,100 x 0.050 x 2.0, rounded up to 270) — still close to 250, but the safety margin had eroded from 2.0x to approximately 1.6x.

More importantly, the storage-bounded MXT was never calculated. Kwame's post-incident analysis found it was 450 — well above 250, meaning MXT could have been raised without SOS risk. But raising MXT would not have prevented the incident — it would only have delayed the MAXT by perhaps 30 seconds.

Fix: MXT is recalculated quarterly as part of the capacity planning cycle. Both the task-based and storage-bounded calculations are performed and documented.

Factor 4: No CICS Health Check for DB2 Utilities

There was no automated mechanism to detect that a DB2 utility was running on a tablespace with active CICS affinity. The DB2 utility scheduler and the CICS monitoring system operated independently.

Fix: Kwame's team implemented a cross-system health check. When a DB2 utility starts on a tablespace listed in the CICS-affinity inventory (a mapping of tablespaces to CICS transaction types), an alert is generated. If the utility involves a drain lock during CICS peak hours, the alert is Severity 1.

The New Configuration

After the incident, Kwame's team implemented the following changes to the core banking AORs:

MXT and TRANCLASS

SIT Parameter: MXT=300

TRANCLASS CLSCRIT  MAXACTIVE(100)   *> ATM auth, wire transfers
TRANCLASS CLSONLN  MAXACTIVE(120)   *> Balance inquiry, fund transfer
TRANCLASS CLSRGAU  MAXACTIVE(30)    *> Regulatory audit transactions
TRANCLASS CLSSYS   MAXACTIVE(20)    *> System monitoring

Total MAXACTIVE: 270 out of MXT 300. The 30-task buffer accommodates any unclassed transaction that slips through the CSD audit.

Storage Monitoring

EDSALIM was increased from 800M to 900M based on measured peak usage plus 30% headroom. A storage monitoring transaction (STGM) runs every 5 minutes and records EUDSA, ECDSA, and ERDSA current and peak values to a CICS TS queue that is batch-extracted nightly for trending.

DB2 Integration Monitoring

A new CICS health check transaction (HLTH) runs every 60 seconds and queries DB2 catalog tables for active utilities on CICS-affinity tablespaces. If a drain-lock utility is detected during CICS peak hours (08:00–18:00), HLTH writes a Severity 1 alert to the operator console.

Incident Response Runbook Update

The MAXT runbook was updated with a new first step: "Check for active DB2 utilities on CICS-affinity tablespaces." Previously, the first step was "Check transaction volume for spike." The new step catches the DB2-induced MAXT pattern that volume checks would miss.

Lessons for the Reader

Lesson 1: Cross-System Awareness

The most dangerous CICS performance problems originate outside CICS. DB2 utilities, DASD reconfigurations, WLM policy changes, coupling facility structure alterations — any of these can cascade into CICS degradation. Your monitoring must cross system boundaries.

Lesson 2: The MAXT Reflex

When operations teams see MAXT, they reach for MXT. This is conditioning — raising the number makes the immediate symptom disappear. But MAXT is a symptom. Raising MXT without diagnosis is like raising the temperature threshold on a fever alarm — the patient is still sick.

Lesson 3: TRANCLASS as Insurance

TRANCLASS costs nothing to implement and provides enormous value during incidents. A TRANCLASS-limited transaction that saturates during an incident degrades only itself — not the entire region. Without TRANCLASS, every transaction competes equally, and critical workloads suffer alongside bulk operations.

Lesson 4: Change Management Must Cross Silos

Lisa's REORG was properly approved in the change management system — for DB2. The change management system did not require CICS impact assessment. This organizational gap allowed a technically correct DB2 decision (schedule REORG during DB2 off-peak) to create a CICS production incident. After the incident, CNB's change management system requires cross-platform impact assessment for any change affecting shared resources.

Lesson 5: Revisit Your Baselines

The MXT of 250 was correct in 2019. By the time of the incident, it was inadequate — not because it caused the incident (raising it would not have helped), but because the eroded safety margin meant the MAXT hit sooner and harder than it would have with a properly sized MXT. Baselines that are set and forgotten become liabilities.

Discussion Questions

If Kwame's team had not paused the REORG, and instead raised MXT to 500, what would have happened? Model the scenario using the storage-bounded MXT calculation.
The TRANCLASS for RGAU is set to MAXACTIVE(30). With 50 TPS normal volume and 300ms normal response time, what is the expected steady-state concurrent task count for RGAU? Is MAXACTIVE(30) appropriate?
The cross-system health check queries DB2 catalog tables every 60 seconds. Could this health check itself become a performance problem? Under what circumstances?
CNB's post-incident runbook puts "check DB2 utilities" before "check transaction volume." Is this the right priority ordering for all CICS environments, or is it specific to CNB's incident pattern? How would you structure the runbook for a general-purpose CICS operations team?
Lisa scheduled the REORG for 2:30 PM because DB2 batch was quiet. Propose a scheduling framework that accounts for both DB2 and CICS workload patterns. What data would you need, and who should own the scheduling decision?