Case Study 2: Pinnacle Health's Indoubt Transaction Resolution

DataField.Dev

Case Study 2: Pinnacle Health's Indoubt Transaction Resolution

When Seven Minutes Changed the Recovery Architecture

Background

Pinnacle Health Insurance processes 50 million claims per month through a CICS-based claims adjudication system. During open enrollment (November 1–December 15), transaction volumes triple as members verify eligibility, change plans, and submit pre-authorization requests.

The CICS topology consists of 8 regions across 2 LPARs:

LPAR PRDA                              LPAR PRDB
┌─────────────────────┐               ┌─────────────────────┐
│  PHTORA1 (TOR)      │               │  PHTORB1 (TOR)      │
│  PHAORA1 (AOR)      │◄───IPIC──────►│  PHAORB1 (AOR)      │
│  PHAORA2 (AOR)      │◄───IPIC──────►│  PHAORB2 (AOR)      │
│  PHFORA1 (FOR)      │               │  PHFORB1 (FOR)      │
│  DB2 Member A       │               │  DB2 Member B       │
│  MQ QMgr PHQA       │               │  MQ QMgr PHQB       │
└─────────────────────┘               └─────────────────────┘

Each AOR handles eligibility verification, claims status inquiry, and pre-authorization transactions. The AOR transactions span DB2 (member eligibility records), MQ (claims notification queue for downstream systems), and VSAM (audit journal in the FOR).

The Incident: November 14, 2022

09:47:12 — AOR Failure

PHAORA1 on LPAR PRDA terminated abnormally at 09:47:12 during peak open enrollment activity. The cause: a z/OS S878 abend — the CICS region exceeded its MEMLIMIT (maximum storage above the 2GB bar). A memory-intensive claims adjudication program had a storage leak that accumulated over 72 hours of continuous operation, gradually consuming 64-bit storage until the MEMLIMIT threshold was breached.

At the moment of failure:

Metric	Value
Active tasks	423
Tasks holding DB2 locks	189
Tasks with MQ puts	67
Tasks with MRO to PHFORA1	34
Transaction volume	1,800 TPS
In-flight eligibility verifications	156
In-flight pre-authorizations	42

09:47:13 — CICSPlex SM Response

CICSPlex SM detected the failure and removed PHAORA1 from the routing table within 1.1 seconds. Remaining AORs absorbed the workload:

PHAORA2 (PRDA): 33% → 50%
PHAORB1 (PRDB): 33% → 25%
PHAORB2 (PRDB): 33% → 25%

The load was intentionally skewed toward PRDB to reduce pressure on PRDA, which was handling the ARM restart and DB2 lock resolution.

09:47:14 — ARM Restart Initiated

ARM detected the failure and initiated restart of PHAORA1. The SIT specified START=AUTO, triggering emergency restart.

09:48:22 — Emergency Restart: Log Scan Complete (T+70s)

The recovery manager scanned 412,847 log records (KEYINTV was 90 seconds at the time — a setting that would be reduced after this incident).

UOW Classification:

Classification	Count	Action
In-flight	400	Backout
Indoubt (PREPARE complete, no commit)	23	Resolve with resource managers

The 23 indoubt transactions were an unusually high number. The root cause: the memory leak had slowed CICS's commit processing in the seconds before the failure, widening the indoubt window. Normally, the window is microseconds. Under memory pressure, commit log writes were taking 2–5ms, and more transactions were caught in the prepare-to-commit gap.

09:48:42 — Indoubt Resolution Phase 1 (T+90s)

The recovery manager attempted to resolve all 23 indoubt transactions:

DB2 resolution: CICS reconnected to DB2 Member A on PRDA. DB2 was available and responsive. All 23 indoubt transactions had DB2 as a participant. DB2 received backout signals (no commit records existed on the log) and rolled back all 23 transactions' DB2 changes. Duration: 3 seconds. 21 of 23 resolved.

MQ resolution — first attempt: 21 of the 23 indoubt transactions had MQ as a participant. For 19 of those 21, the MQ queue manager PHQA on PRDA was available and processed the backout. 19 of 21 MQ participants resolved.

MQ resolution — failure: 2 of the 23 indoubt transactions could not resolve their MQ participant. The MQ queue manager PHQA had also experienced a problem at 09:47:15 — 3 seconds after the CICS failure, PHQA entered a "channel reconnection" state caused by the sudden disconnect of 67 CICS-MQ sessions. During this state, PHQA was unavailable for indoubt resolution.

VSAM resolution: All 23 transactions had VSAM participants in PHFORA1. The FOR was available. All VSAM backouts completed successfully.

09:48:45 — Two UOWs Shunted (T+93s)

The recovery manager shunted the 2 unresolved UOWs. CICS became available for new transactions with 2 shunted UOWs.

DFHRM0501 UNIT OF WORK DISPLAY
  URID: 00000000007C3A20
  TRANSID: ELIG
  STATUS: SHUNTED
  PARTICIPANTS:
    DB2(DB2P) - RESOLVED (BACKOUT)
    MQ(PHQA)  - UNRESOLVED (UNAVAILABLE)
    VSAM(PHFORA1) - RESOLVED (BACKOUT)
  LOCKED RESOURCES:
    DB2 ROW: MEMBER_ELIGIBILITY WHERE MEMBER_ID = 'MBR-2847291'

  URID: 00000000007C3B48
  TRANSID: ELIG
  STATUS: SHUNTED
  PARTICIPANTS:
    DB2(DB2P) - RESOLVED (BACKOUT)
    MQ(PHQA)  - UNRESOLVED (UNAVAILABLE)
    VSAM(PHFORA1) - RESOLVED (BACKOUT)
  LOCKED RESOURCES:
    DB2 ROW: MEMBER_ELIGIBILITY WHERE MEMBER_ID = 'MBR-1093847'

09:48:45 to 09:55:32 — The Seven-Minute Lock Contention Window

For 6 minutes and 47 seconds, two DB2 rows were locked by the shunted UOWs. During this window:

Member MBR-2847291: Patricia Hawkins, age 67, enrolled in a Pinnacle Platinum plan. During the shunt window, 3 eligibility verification transactions were attempted for her record (one from a pharmacy, one from her primary care physician's office, one from an oncology department pre-authorization system).

All 3 transactions received DB2 SQLCODE -911 (lock timeout) after IRLMRWT=30 seconds each: - 09:49:15 — Pharmacy verification timeout (patient waiting at counter) - 09:50:02 — Primary care verification timeout (nurse on hold) - 09:53:18 — Oncology pre-auth timeout (scheduler attempting to confirm treatment eligibility)

Member MBR-1093847: David Chen-Williams, age 34, enrolled in a Pinnacle Standard plan. During the shunt window, 1 eligibility verification was attempted (from a dental office). It timed out at 09:51:44.

09:52:15 — RMRETRY First Attempt (T+303s)

CICS's RMRETRY interval was 300 seconds (the default, which had never been changed). At T+303 seconds, the recovery manager retried resolution of the 2 shunted UOWs. MQ queue manager PHQA was still in its recovery sequence. Resolution failed. Next retry scheduled for T+603 seconds.

09:55:20 — MQ Queue Manager Recovery Complete

PHQA completed its recovery and was available for indoubt resolution. But CICS would not retry until the next RMRETRY interval.

09:55:32 — Manual Intervention

Ahmad Rashidi, monitoring the CICS console from the compliance operations center, noticed the shunted UOW alerts. He contacted the CICS system programmer on duty, who issued:

CEMT SET UOWENQ(*) ACTION(FORCE)

This forced CICS to immediately retry resolution of all shunted UOWs. Since PHQA was now available, both UOWs resolved (backout) within 2 seconds. DB2 locks were released. The eligibility records for both members were accessible again.

09:55:34 — Full Service Restored

Total time from failure to full service: 8 minutes 22 seconds. Of that, 90 seconds was emergency restart (acceptable) and 6 minutes 47 seconds was shunted UOW lock contention (unacceptable).

Impact Assessment

Patient Impact

Member	Transactions Affected	Clinical Impact
MBR-2847291 (Patricia Hawkins)	3 timeouts	Pharmacy couldn't verify eligibility — patient waited 12 minutes; Oncology pre-auth delayed 7 minutes
MBR-1093847 (David Chen-Williams)	1 timeout	Dental office couldn't verify — patient rescheduled

Compliance Impact

Ahmad Rashidi logged the incident as a HIPAA-relevant service availability event. Under Pinnacle's interpretation of the HIPAA Security Rule (45 CFR 164.308(a)(7)(ii)(B)), any disruption to electronic eligibility verification that affects patient care must be documented and reviewed.

The incident did not rise to the level of a reportable breach (no PHI was exposed or compromised), but it was flagged in the quarterly compliance review as a "system availability gap."

Financial Impact

Category	Cost
Staff time for incident response	$1,200
Post-incident review and remediation	$8,500
Compliance documentation and review	$3,200
Patient goodwill (estimated)	$500
Total	$13,400

Post-Incident Review

Root Cause Analysis

Diane Okoye led the post-incident review with a focus on two questions: why did 23 transactions end up indoubt (unusually high), and why did resolution take 7 minutes instead of seconds?

Why 23 indoubt transactions:

The memory leak in the claims adjudication program had been consuming 64-bit storage for 72 hours. By the time of the failure, CICS was experiencing memory pressure that slowed log I/O. The commit record write (normally <0.1ms) was taking 2–5ms under the storage pressure. This widened the indoubt window by a factor of 20–50x, catching 23 transactions instead of the expected 0–2.

Why 7-minute resolution:

Three factors compounded:

MQ queue manager disruption. The sudden loss of 67 CICS-MQ sessions triggered a reconnection storm in PHQA, making it unavailable for indoubt resolution for approximately 8 minutes.
RMRETRY=300 (default). CICS only retried shunted UOW resolution every 5 minutes. Even after PHQA recovered at 09:55:20, CICS would not have retried until 09:57:18 — another 2 minutes of unnecessary lock contention.
No RESYNCMEMBER equivalent for MQ. Unlike DB2 (where GROUPRESYNC allows resolution via any data sharing group member), Pinnacle's MQ configuration required resolution through the specific queue manager (PHQA). PHQB on PRDB could not have resolved the indoubt transactions.

Corrective Actions

Diane's remediation plan addressed all three factors:

Action 1: Fix the memory leak (Root Cause)

The claims adjudication program's storage leak was identified and corrected. A GETMAIN/FREEMAIN audit of the program revealed that a dynamically allocated work area was not being freed when the program took an early exit path (a rarely triggered business rule).

Action 2: Reduce RMRETRY from 300 to 30 seconds

SIT Override (all AORs):
  RMRETRY=30

With RMRETRY=30, the shunted UOWs would have been retried at 09:49:15, 09:49:45, 09:50:15, and so on. Once PHQA recovered at 09:55:20, the next retry would have resolved the UOWs within 30 seconds — reducing the shunt window from 6:47 to approximately 8 minutes (still limited by MQ recovery time, but with faster retry after MQ becomes available).

Action 3: Reduce KEYINTV from 90 to 60 seconds

SIT Override (all AORs):
  KEYINTV=60

Faster keypoints reduce the log scan time during emergency restart, slightly reducing the overall recovery window.

Action 4: Configure MQ shared queue for indoubt resolution

Diane worked with the MQ team to implement shared queues across PHQA and PHQB. With shared queues, indoubt resolution for MQ can proceed through either queue manager — similar to DB2's GROUPRESYNC capability.

This was the most significant architectural change. It required: - Coupling facility structures for shared queue data - Queue redefinition to specify QSGDISP(SHARED) - Testing of the shared queue failover path - Updated operational procedures

Action 5: Add storage monitoring alerts

A CICS monitoring exit was configured to alert operations when 64-bit storage utilization exceeds 70% of MEMLIMIT. This provides early warning of memory leaks before they cause region failures.

CICS Storage Alert Thresholds:
  70% MEMLIMIT → INFO alert (investigate within 24 hours)
  80% MEMLIMIT → WARNING alert (investigate within 4 hours)
  90% MEMLIMIT → CRITICAL alert (recycle region at next maintenance window)
  95% MEMLIMIT → EMERGENCY (consider immediate controlled shutdown and restart)

Action 6: Establish indoubt UOW escalation procedure

Ahmad Rashidi's manual intervention at 09:55:32 was effective but ad hoc. Diane formalized the escalation:

Threshold	Action	Authority
Shunted UOW detected	Alert Level 1 (CICS team)	Automatic
Shunted > 2 minutes	Alert Level 2 (Operations manager)	CICS team lead
Shunted > 5 minutes	Evaluate FORCE resolution	Operations manager + CICS lead
Shunted > 10 minutes	FORCE resolution mandatory	Operations manager
Any shunted UOW affecting patient care	Immediate FORCE evaluation	Compliance officer (Ahmad)

The Deeper Lesson: Recovery Architecture Is Holistic

Diane Okoye's summary to Pinnacle's IT leadership:

"This incident taught us that recovery architecture cannot be designed in silos. Our CICS recovery was well-designed — emergency restart worked, indoubt detection worked, shunting worked. Our MQ configuration was well-designed — it recovered from the session storm within 8 minutes. But the interaction between CICS recovery and MQ recovery created a 7-minute gap that neither team had anticipated.

"The problem was not that either system failed to recover. The problem was that the systems recovered independently instead of cooperatively. CICS shunted UOWs because MQ was unavailable. MQ was unavailable because of a cascade from the CICS failure. Each system was following its own recovery procedure correctly. But the combined effect was a 7-minute lock contention window that affected patient care.

"The fix — shared queues for MQ indoubt resolution — is the architectural equivalent of DB2's GROUPRESYNC. It allows CICS to resolve its MQ-related indoubt transactions through any available MQ queue manager, not just the one on the same LPAR. This transforms the recovery from serial (CICS waits for MQ) to parallel (CICS resolves through the surviving MQ instance).

"But the deeper lesson is this: every time we add a resource manager to a transaction's two-phase commit scope, we add a recovery dependency. DB2. MQ. VSAM. Each is a potential point where indoubt resolution can be blocked. The architectural question isn't just 'does this resource need to be in the 2PC?' — it's 'what happens to recovery if this resource manager is unavailable at resolution time?'"

Ahmad Rashidi's Compliance Perspective

Ahmad added his own section to the post-incident report:

"From a compliance standpoint, the 7-minute eligibility verification outage for two specific members is within our HIPAA operational tolerance. But it revealed a systemic risk: our recovery architecture can create targeted, per-member outages where specific patients lose access to eligibility verification while the system as a whole appears healthy.

"This is worse than a system-wide outage in some ways. A system-wide outage triggers disaster recovery procedures, management notification, and visible incident management. A per-member outage — two locked rows affecting two patients — is invisible to our monitoring dashboards, invisible to our SLA metrics, and invisible to management. It only becomes visible when a pharmacist calls the help desk because a patient can't fill a prescription.

"We need monitoring that detects per-resource lock contention, not just system-level availability. I've added this to the compliance requirements for the next architecture review."

Discussion Questions

The 23 indoubt transactions were caused by memory pressure widening the indoubt window. How would you design monitoring to detect this widening before it leads to an incident?
Ahmad's observation about "invisible per-member outages" highlights a gap in traditional availability monitoring. Design a monitoring approach that detects lock contention affecting specific business entities (patients, accounts, claims) rather than just system-level metrics.
Diane's corrective actions include MQ shared queues. What are the trade-offs of shared queues vs. dedicated queues for recovery? (Consider performance, complexity, and failure modes.)
The CEMT SET UOWENQ(*) ACTION(FORCE) command that Ahmad triggered forces immediate retry of all shunted UOWs. What would happen if PHQA was still unavailable when this command was issued? Is FORCE always safe?
Pinnacle's RMRETRY was at the default 300 seconds for years. Why do you think this was never changed? What does this tell you about the importance of reviewing default configurations for recovery parameters?
If Pinnacle migrated the VSAM audit journal from the FOR to a DB2 table (eliminating the FOR from the 2PC scope), how would the recovery profile change? Would it be faster, slower, or about the same? Why?