Case Study 1: CNB's CICS Region Failure and Recovery

DataField.Dev

Case Study 1: CNB's CICS Region Failure and Recovery

47 Seconds from Failure to Full Service

Background

Continental National Bank's CICS environment spans 16 regions across 4 LPARs in a Parallel Sysplex. The core banking workload — account inquiries, funds transfers, ATM authorizations, bill payments — processes through AOR pairs on SYSA and SYSB, with CICSPlex SM managing dynamic workload routing.

On October 17, 2023, at 14:17:03 EST, CNBAORA1 on SYSA suffered an abrupt failure during peak afternoon transaction volume. This case study examines the complete failure-to-recovery sequence, the architectural decisions that made 47-second recovery possible, and the post-incident improvements.

The Failure

Timeline: 14:17:03.000 — The Crash

A foreign exchange rate calculation program (CNBFXRT) had been deployed to production 72 hours earlier. The program contained a subscript error in its rate lookup table that was triggered only when the USD/JPY rate crossed a specific threshold — a threshold that hadn't occurred during testing.

At 14:17:03, a currency conversion transaction invoked CNBFXRT. The subscript overflow wrote 240 bytes past the end of the program's working storage. Those 240 bytes landed in CICS's Dynamic Storage Area, corrupting the dispatch control area for the Task Control block chain.

The corruption was catastrophic. CICS's dispatcher attempted to dispatch the next task, read the corrupted TCB chain, and generated an S0C4 abend in the CICS kernel. z/OS terminated the CICS address space.

At the moment of failure:

Metric	Value
Active tasks	847
Tasks holding DB2 locks	312
Tasks with MRO sessions to CNBFORA1	45
Tasks with pending MQ puts	18
Transactions per second (TPS)	3,214
In-flight funds transfers	23
In-flight ATM authorizations	147

What Happened to the 847 Active Tasks

Every task in CNBAORA1 ceased execution instantaneously. No graceful shutdown, no task completion, no syncpoint. The tasks were simply gone. But their effects persisted:

312 DB2 transactions remained active in DB2's perspective. DB2 still held their locks (row locks on the ACCOUNTS table, page locks on the TRANSACTION_LOG table). DB2 would not release these locks until CICS either committed or backed out each transaction.
45 MRO sessions to CNBFORA1 were broken. The FOR detected the session failures within 2 seconds (MRO heartbeat interval). The FOR's recovery logic marked the sessions as "connection failure" and began holding any function-shipped VSAM operations from those sessions in a pending state.
18 MQ puts that were within CICS's unit of work but not yet committed remained as uncommitted messages in MQ's internal queue. MQ would hold these messages until CICS resolved the UOWs.

The Recovery Sequence

T+0.5s — ARM Detection (14:17:03.500)

z/OS Automatic Restart Management detected the CICS address space termination. ARM's monitoring interval was configured at 500ms for CICS regions.

ARM evaluated the restart policy:

ELEMENT(CNBAORA1)
  TYPE(CICS)
  RESTART_GROUP(CNBCICS)
  RESTART_ATTEMPTS(3)
  RESTART_INTERVAL(600)
  RESTART_TIMEOUT(120)
  RESTART_METHOD(STC)

This was the first failure within the 600-second interval. ARM initiated the restart by submitting the CNBAORA1 started task.

T+1.2s — CICSPlex SM Routing Update (14:17:04.200)

CICSPlex SM's CMAS on SYSB detected the failure of CNBAORA1 through the heartbeat mechanism. CPSM immediately:

Removed CNBAORA1 from the active routing table. No new transactions would be routed to CNBAORA1.
Adjusted routing weights for the remaining AORs: - CNBAORA2 (SYSA): 40% → 40% (unchanged — same LPAR, already carrying its share) - CNBAORB1 (SYSB): 30% → 45% - CNBAORB2 (SYSB): 30% → 45%

The SYSA AOR (CNBAORA2) was not given additional load because the SYSA LPAR was already absorbing the overhead of the CNBAORA1 failure (ARM restart processing, DB2 lock resolution).

Sent WTO message to the z/OS console:

EYUSC0201I CNBAORA1 CICS REGION NOT AVAILABLE - REMOVED FROM WORKLOAD CNBCORE

Triggered SNMP trap to the enterprise monitoring system (IBM OMEGAMON).

T+2.0s — CICS Initialization Begins (14:17:05.000)

The CNBAORA1 started task began execution. CICS initialization:

Read SIT parameters. START=AUTO triggered the startup decision logic. CICS detected that the previous execution ended abnormally (no shutdown keypoint on the log). Result: EMERGENCY restart.
Connected to the system log. The log stream CNB.CICS.CNBAORA1.DFHLOG resided on the coupling facility. Despite the LPAR disruption from the address space termination, the coupling facility was unaffected — the log was intact and accessible.
Loaded resource definitions from the CSD and CICSPlex SM BAS.

T+8.0s — Log Scan Begins (14:17:11.000)

The recovery manager began scanning the system log backward from the end.

Log scan parameters: - KEYINTV=60 → maximum 60 seconds of log records to scan - Transaction volume: ~3,200 TPS with 3 records per transaction → ~576,000 records in 60 seconds - Actual records between last keypoint and failure: 487,319 records (52.3 seconds of activity) - Log scan rate: ~65,000 records/second - Log scan duration: 7.5 seconds

The recovery manager classified every active UOW:

Classification	Count	Action
Committed (commit record on log)	0	None needed — changes already applied
In-flight (no commit, no prepare)	844	Backout
Indoubt (prepare complete, no commit)	3	Resolve with resource managers

Three transactions were in the indoubt state — they had completed phase 1 PREPARE with DB2 and MQ but the coordinator (CICS) had not yet written the commit record.

T+15.5s — Backout Processing (14:17:18.500)

The recovery manager coordinated backout of the 844 in-flight UOWs:

DB2 backout: CICS reconnected to DB2 Member A on SYSA via the CICS-DB2 attachment. DB2 received backout requests for 309 active threads (3 of the original 312 DB2-holding transactions were in the indoubt category, not the in-flight category). DB2 processed the backouts, releasing row locks and page locks. Duration: 12 seconds.

VSAM backout: CICS sent MRO backout requests to CNBFORA1 for the 45 sessions with pending VSAM operations. The FOR received the backout requests, restored before-images for uncommitted VSAM records, and released record locks. Duration: 3 seconds (overlapped with DB2 backout).

MQ backout: CICS sent backout to the MQ queue manager for 15 uncommitted message puts (3 of the original 18 were in the indoubt category). MQ removed the uncommitted messages from the queues. Duration: 1 second (overlapped).

T+27.5s — Indoubt Resolution (14:17:30.500)

The three indoubt transactions required resolution. The recovery manager examined the system log for each:

Indoubt UOW 1 (funds transfer, XFER): PREPARE to DB2 and MQ completed. No commit record. Since there was no commit record, the recovery manager decided BACKOUT. DB2 and MQ received backout signals and rolled back their prepared changes.
Indoubt UOW 2 (bill payment, BPAY): PREPARE to DB2 completed. No commit record. Single-resource-manager indoubt (only DB2). DB2 received backout signal.
Indoubt UOW 3 (ATM authorization, AUTH): PREPARE to DB2 and VSAM (via FOR) completed. No commit record. DB2 and FOR received backout signals.

All three indoubt transactions were resolved as backout. No shunted UOWs.

Duration: 4 seconds.

T+31.5s — Resource Reopen (14:17:34.500)

CICS reopened resources: - VSAM files via CNBFORA1 (MRO reconnection with AUTCONN=YES) - DB2 attachment fully operational (150 threads available) - MQ adapter reconnected (CKQQ initialization) - Transient data queues reopened - Temporary storage queues restored

Duration: 8 seconds.

T+39.5s — Health Check (14:17:42.500)

CICSPlex SM ran the health check transaction (HCHK) against the restarted CNBAORA1:

      * HCHK — CICSPlex SM Health Check Transaction
      * Verifies DB2, MQ, and VSAM connectivity

       PROCEDURE DIVISION.
           EXEC SQL
               SELECT CURRENT TIMESTAMP
               INTO :WS-DB2-TIMESTAMP
               FROM SYSIBM.SYSDUMMY1
           END-EXEC
           IF SQLCODE NOT = 0
               MOVE 'DB2-FAIL' TO WS-HEALTH-STATUS
               EXEC CICS RETURN END-EXEC
           END-IF

           CALL 'CSQCCONN' USING ...
           IF WS-MQ-CC NOT = MQCC-OK
               MOVE 'MQ-FAIL' TO WS-HEALTH-STATUS
               EXEC CICS RETURN END-EXEC
           END-IF

           EXEC CICS READ FILE('CNBREF01')
                INTO(WS-TEST-RECORD)
                RIDFLD(WS-TEST-KEY)
                RESP(WS-RESP)
           END-EXEC
           IF WS-RESP NOT = DFHRESP(NORMAL)
               MOVE 'VSAM-FAIL' TO WS-HEALTH-STATUS
               EXEC CICS RETURN END-EXEC
           END-IF

           MOVE 'HEALTHY' TO WS-HEALTH-STATUS
           EXEC CICS RETURN END-EXEC
           .

The health check passed on the first attempt.

T+42.0s — Routing Restored (14:17:45.000)

CICSPlex SM added CNBAORA1 back to the routing table with a gradual ramp-up: - T+42s: 10% of workload routed to CNBAORA1 - T+47s: 25% of workload (after 5 seconds of successful processing) - T+60s: Full 50% workload restored

T+47.0s — Full Service (14:17:50.000)

47 seconds after the failure, CNBAORA1 was processing transactions at 25% capacity and ramping up. From the customer perspective, the outage was invisible — other AORs absorbed the workload during the 42-second routing gap.

Impact Assessment

Customer Impact

Channel	Impact	Duration
ATM	Zero — transactions rerouted to SYSB AORs	N/A
3270 branch	23 in-flight transactions received "retry" message	<3 seconds
Mobile API	Zero — API retried automatically with idempotency	N/A
Web portal	Zero — transactions rerouted to SYSB AORs	N/A

Data Impact

Category	Count	Resolution
Transactions backed out	844	Automatically retried by users/systems
Indoubt transactions	3	Backed out during recovery — no data loss
Shunted UOWs	0	N/A
DB2 lock contention window	27.5 seconds	312 row-level locks held during log scan and backout

Financial Impact

The 27.5-second DB2 lock contention window caused 14 transactions on other AORs to timeout waiting for locked rows. All 14 were automatically retried and succeeded. The estimated business cost: $63 in staff time for the post-incident review. Zero customer-facing financial impact.

Post-Incident Review

Root Cause

The CNBFXRT program's subscript error passed through unit testing, integration testing, and the QA environment because the USD/JPY threshold that triggered it hadn't occurred during the test period. The test data used historical rates from the previous 90 days; the triggering rate pattern had not occurred in that window.

Corrective Actions

Code fix: The subscript error in CNBFXRT was corrected. Array bounds checking was added to the rate lookup logic.
Test data enhancement: Lisa Tran's team expanded the test data to include extreme and boundary-condition exchange rates, not just historical ranges.
CICS storage protection: Kwame proposed enabling CICS storage protection (STGPROT=YES in the SIT), which would have trapped the storage overlay before it corrupted the kernel. The team evaluated the 3–5% CPU overhead and approved it for all AORs.

SIT Override:
  STGPROT=YES          Enable storage protection
  RENTPGM=PROTECT      Protect reentrant programs from overlays

Faster keypoint interval: KEYINTV was reduced from 60 to 45 seconds on high-volume AORs, reducing the worst-case log scan from 576,000 records to 432,000 records.
Recovery test validation: The incident was added to the quarterly recovery test suite as a scripted scenario.

What Went Right

Kwame's post-incident summary to the architecture review board:

ARM worked exactly as designed. Detection in 0.5 seconds. Restart initiated within 1 second.
CICSPlex SM routing update was immediate. New transactions were never sent to the dead region. The routing table update at T+1.2 seconds prevented cascading failures.
Coupling facility log stream was the hero. If the system log had been on DASD, the emergency restart would have required the same LPAR (SYSA) to be healthy enough to access the log. The coupling facility log stream was accessible from any LPAR, ensuring reliable recovery.
RESYNCMEMBER(GROUPRESYNC) was configured correctly. Although DB2 Member A was on the same LPAR and remained available (the LPAR itself didn't fail — only the CICS address space), the configuration would have handled a more severe failure where the LPAR was also affected.
Row-level locking limited the lock contention blast radius. With page-level locking, the 312 in-flight transactions would have locked approximately 20,000 rows. With row-level locking, only the specific modified rows (approximately 850 rows across 312 transactions) were locked.
The 47-second recovery time was well within the 99.99% availability budget. CNB's SLA allows 52 minutes of unplanned downtime per year. This incident consumed 47 seconds — less than 2% of the annual budget.

Lessons for Architects

Lesson 1: The Recovery Architecture Was Designed, Not Accidental

Every component of the 47-second recovery was the result of a deliberate architectural decision made months or years before the incident: - ARM policy defined and tested - System log on coupling facility (not DASD) - KEYINTV=60 (not the default 300) - RESYNCMEMBER(GROUPRESYNC) - RMRETRY=30 (not the default 300) - CICSPlex SM health check configured - Active-active AOR pairs with failover capacity

None of these configurations are the CICS defaults. Every one was a conscious choice.

Lesson 2: The Storage Overlay Was Preventable

STGPROT=YES would have trapped the overlay before it reached the kernel. The 3–5% CPU overhead was considered "too expensive" before the incident. After the incident, the cost was accepted without debate. Kwame's observation: "We spent $2,000/month in MSU cost for STGPROT. The post-incident review cost $6,000 in staff time. The risk of a 47-second outage recurring is worth far more than $2,000/month."

Lesson 3: Row-Level Locking Pays for Itself During Recovery

The Chapter 8 decision to use row-level locking for the ACCOUNTS table — which was originally motivated by concurrency (avoiding false contention between unrelated transactions) — had a direct recovery benefit. During the 27.5-second lock contention window, only 850 rows were locked instead of an estimated 20,000 with page-level locking. This reduced the blast radius of the failure by a factor of 23x.

Lesson 4: Test Data Must Include Boundary Conditions

The CNBFXRT bug was a reminder that production failures are often caused by conditions that testing didn't anticipate. Expanding test data to include extreme values, boundary conditions, and historically rare scenarios is not thoroughness for its own sake — it's production stability insurance.

Discussion Questions

If CNBAORA1's system log had been on DASD instead of the coupling facility, how would the recovery sequence have changed? What would the recovery time have been?
If the three indoubt transactions had been resolved as COMMIT instead of BACKOUT, what data integrity issues could have resulted? (Hint: the commit record was never written, so the application never received confirmation.)
CNB's recovery took 47 seconds. What changes would be needed to achieve 15-second recovery? Is 15-second recovery architecturally feasible with CICS?
The CNBFXRT storage overlay corrupted the CICS kernel. What other CICS mechanisms (besides STGPROT) could have contained the damage to the failing task without bringing down the entire region?
Kwame's team configured the CICSPlex SM routing update to NOT increase load on CNBAORA2 (the same-LPAR AOR) during the recovery. Why? What would have happened if CNBAORA2 had absorbed the full 100% of SYSA's workload during recovery?