Case Study 2: Pinnacle Health's Indoubt Transaction Resolution
When Seven Minutes Changed the Recovery Architecture
Background
Pinnacle Health Insurance processes 50 million claims per month through a CICS-based claims adjudication system. During open enrollment (November 1–December 15), transaction volumes triple as members verify eligibility, change plans, and submit pre-authorization requests.
The CICS topology consists of 8 regions across 2 LPARs:
LPAR PRDA LPAR PRDB
┌─────────────────────┐ ┌─────────────────────┐
│ PHTORA1 (TOR) │ │ PHTORB1 (TOR) │
│ PHAORA1 (AOR) │◄───IPIC──────►│ PHAORB1 (AOR) │
│ PHAORA2 (AOR) │◄───IPIC──────►│ PHAORB2 (AOR) │
│ PHFORA1 (FOR) │ │ PHFORB1 (FOR) │
│ DB2 Member A │ │ DB2 Member B │
│ MQ QMgr PHQA │ │ MQ QMgr PHQB │
└─────────────────────┘ └─────────────────────┘
Each AOR handles eligibility verification, claims status inquiry, and pre-authorization transactions. The AOR transactions span DB2 (member eligibility records), MQ (claims notification queue for downstream systems), and VSAM (audit journal in the FOR).
The Incident: November 14, 2022
09:47:12 — AOR Failure
PHAORA1 on LPAR PRDA terminated abnormally at 09:47:12 during peak open enrollment activity. The cause: a z/OS S878 abend — the CICS region exceeded its MEMLIMIT (maximum storage above the 2GB bar). A memory-intensive claims adjudication program had a storage leak that accumulated over 72 hours of continuous operation, gradually consuming 64-bit storage until the MEMLIMIT threshold was breached.
At the moment of failure:
| Metric | Value |
|---|---|
| Active tasks | 423 |
| Tasks holding DB2 locks | 189 |
| Tasks with MQ puts | 67 |
| Tasks with MRO to PHFORA1 | 34 |
| Transaction volume | 1,800 TPS |
| In-flight eligibility verifications | 156 |
| In-flight pre-authorizations | 42 |
09:47:13 — CICSPlex SM Response
CICSPlex SM detected the failure and removed PHAORA1 from the routing table within 1.1 seconds. Remaining AORs absorbed the workload:
- PHAORA2 (PRDA): 33% → 50%
- PHAORB1 (PRDB): 33% → 25%
- PHAORB2 (PRDB): 33% → 25%
The load was intentionally skewed toward PRDB to reduce pressure on PRDA, which was handling the ARM restart and DB2 lock resolution.
09:47:14 — ARM Restart Initiated
ARM detected the failure and initiated restart of PHAORA1. The SIT specified START=AUTO, triggering emergency restart.
09:48:22 — Emergency Restart: Log Scan Complete (T+70s)
The recovery manager scanned 412,847 log records (KEYINTV was 90 seconds at the time — a setting that would be reduced after this incident).
UOW Classification:
| Classification | Count | Action |
|---|---|---|
| In-flight | 400 | Backout |
| Indoubt (PREPARE complete, no commit) | 23 | Resolve with resource managers |
The 23 indoubt transactions were an unusually high number. The root cause: the memory leak had slowed CICS's commit processing in the seconds before the failure, widening the indoubt window. Normally, the window is microseconds. Under memory pressure, commit log writes were taking 2–5ms, and more transactions were caught in the prepare-to-commit gap.
09:48:42 — Indoubt Resolution Phase 1 (T+90s)
The recovery manager attempted to resolve all 23 indoubt transactions:
DB2 resolution: CICS reconnected to DB2 Member A on PRDA. DB2 was available and responsive. All 23 indoubt transactions had DB2 as a participant. DB2 received backout signals (no commit records existed on the log) and rolled back all 23 transactions' DB2 changes. Duration: 3 seconds. 21 of 23 resolved.
MQ resolution — first attempt: 21 of the 23 indoubt transactions had MQ as a participant. For 19 of those 21, the MQ queue manager PHQA on PRDA was available and processed the backout. 19 of 21 MQ participants resolved.
MQ resolution — failure: 2 of the 23 indoubt transactions could not resolve their MQ participant. The MQ queue manager PHQA had also experienced a problem at 09:47:15 — 3 seconds after the CICS failure, PHQA entered a "channel reconnection" state caused by the sudden disconnect of 67 CICS-MQ sessions. During this state, PHQA was unavailable for indoubt resolution.
VSAM resolution: All 23 transactions had VSAM participants in PHFORA1. The FOR was available. All VSAM backouts completed successfully.
09:48:45 — Two UOWs Shunted (T+93s)
The recovery manager shunted the 2 unresolved UOWs. CICS became available for new transactions with 2 shunted UOWs.
DFHRM0501 UNIT OF WORK DISPLAY
URID: 00000000007C3A20
TRANSID: ELIG
STATUS: SHUNTED
PARTICIPANTS:
DB2(DB2P) - RESOLVED (BACKOUT)
MQ(PHQA) - UNRESOLVED (UNAVAILABLE)
VSAM(PHFORA1) - RESOLVED (BACKOUT)
LOCKED RESOURCES:
DB2 ROW: MEMBER_ELIGIBILITY WHERE MEMBER_ID = 'MBR-2847291'
URID: 00000000007C3B48
TRANSID: ELIG
STATUS: SHUNTED
PARTICIPANTS:
DB2(DB2P) - RESOLVED (BACKOUT)
MQ(PHQA) - UNRESOLVED (UNAVAILABLE)
VSAM(PHFORA1) - RESOLVED (BACKOUT)
LOCKED RESOURCES:
DB2 ROW: MEMBER_ELIGIBILITY WHERE MEMBER_ID = 'MBR-1093847'
09:48:45 to 09:55:32 — The Seven-Minute Lock Contention Window
For 6 minutes and 47 seconds, two DB2 rows were locked by the shunted UOWs. During this window:
Member MBR-2847291: Patricia Hawkins, age 67, enrolled in a Pinnacle Platinum plan. During the shunt window, 3 eligibility verification transactions were attempted for her record (one from a pharmacy, one from her primary care physician's office, one from an oncology department pre-authorization system).
All 3 transactions received DB2 SQLCODE -911 (lock timeout) after IRLMRWT=30 seconds each: - 09:49:15 — Pharmacy verification timeout (patient waiting at counter) - 09:50:02 — Primary care verification timeout (nurse on hold) - 09:53:18 — Oncology pre-auth timeout (scheduler attempting to confirm treatment eligibility)
Member MBR-1093847: David Chen-Williams, age 34, enrolled in a Pinnacle Standard plan. During the shunt window, 1 eligibility verification was attempted (from a dental office). It timed out at 09:51:44.
09:52:15 — RMRETRY First Attempt (T+303s)
CICS's RMRETRY interval was 300 seconds (the default, which had never been changed). At T+303 seconds, the recovery manager retried resolution of the 2 shunted UOWs. MQ queue manager PHQA was still in its recovery sequence. Resolution failed. Next retry scheduled for T+603 seconds.
09:55:20 — MQ Queue Manager Recovery Complete
PHQA completed its recovery and was available for indoubt resolution. But CICS would not retry until the next RMRETRY interval.
09:55:32 — Manual Intervention
Ahmad Rashidi, monitoring the CICS console from the compliance operations center, noticed the shunted UOW alerts. He contacted the CICS system programmer on duty, who issued:
CEMT SET UOWENQ(*) ACTION(FORCE)
This forced CICS to immediately retry resolution of all shunted UOWs. Since PHQA was now available, both UOWs resolved (backout) within 2 seconds. DB2 locks were released. The eligibility records for both members were accessible again.
09:55:34 — Full Service Restored
Total time from failure to full service: 8 minutes 22 seconds. Of that, 90 seconds was emergency restart (acceptable) and 6 minutes 47 seconds was shunted UOW lock contention (unacceptable).
Impact Assessment
Patient Impact
| Member | Transactions Affected | Clinical Impact |
|---|---|---|
| MBR-2847291 (Patricia Hawkins) | 3 timeouts | Pharmacy couldn't verify eligibility — patient waited 12 minutes; Oncology pre-auth delayed 7 minutes |
| MBR-1093847 (David Chen-Williams) | 1 timeout | Dental office couldn't verify — patient rescheduled |
Compliance Impact
Ahmad Rashidi logged the incident as a HIPAA-relevant service availability event. Under Pinnacle's interpretation of the HIPAA Security Rule (45 CFR 164.308(a)(7)(ii)(B)), any disruption to electronic eligibility verification that affects patient care must be documented and reviewed.
The incident did not rise to the level of a reportable breach (no PHI was exposed or compromised), but it was flagged in the quarterly compliance review as a "system availability gap."
Financial Impact
| Category | Cost |
|---|---|
| Staff time for incident response | $1,200 |
| Post-incident review and remediation | $8,500 |
| Compliance documentation and review | $3,200 |
| Patient goodwill (estimated) | $500 |
| Total | $13,400 |
Post-Incident Review
Root Cause Analysis
Diane Okoye led the post-incident review with a focus on two questions: why did 23 transactions end up indoubt (unusually high), and why did resolution take 7 minutes instead of seconds?
Why 23 indoubt transactions:
The memory leak in the claims adjudication program had been consuming 64-bit storage for 72 hours. By the time of the failure, CICS was experiencing memory pressure that slowed log I/O. The commit record write (normally <0.1ms) was taking 2–5ms under the storage pressure. This widened the indoubt window by a factor of 20–50x, catching 23 transactions instead of the expected 0–2.
Why 7-minute resolution:
Three factors compounded:
-
MQ queue manager disruption. The sudden loss of 67 CICS-MQ sessions triggered a reconnection storm in PHQA, making it unavailable for indoubt resolution for approximately 8 minutes.
-
RMRETRY=300 (default). CICS only retried shunted UOW resolution every 5 minutes. Even after PHQA recovered at 09:55:20, CICS would not have retried until 09:57:18 — another 2 minutes of unnecessary lock contention.
-
No RESYNCMEMBER equivalent for MQ. Unlike DB2 (where GROUPRESYNC allows resolution via any data sharing group member), Pinnacle's MQ configuration required resolution through the specific queue manager (PHQA). PHQB on PRDB could not have resolved the indoubt transactions.
Corrective Actions
Diane's remediation plan addressed all three factors:
Action 1: Fix the memory leak (Root Cause)
The claims adjudication program's storage leak was identified and corrected. A GETMAIN/FREEMAIN audit of the program revealed that a dynamically allocated work area was not being freed when the program took an early exit path (a rarely triggered business rule).
Action 2: Reduce RMRETRY from 300 to 30 seconds
SIT Override (all AORs):
RMRETRY=30
With RMRETRY=30, the shunted UOWs would have been retried at 09:49:15, 09:49:45, 09:50:15, and so on. Once PHQA recovered at 09:55:20, the next retry would have resolved the UOWs within 30 seconds — reducing the shunt window from 6:47 to approximately 8 minutes (still limited by MQ recovery time, but with faster retry after MQ becomes available).
Action 3: Reduce KEYINTV from 90 to 60 seconds
SIT Override (all AORs):
KEYINTV=60
Faster keypoints reduce the log scan time during emergency restart, slightly reducing the overall recovery window.
Action 4: Configure MQ shared queue for indoubt resolution
Diane worked with the MQ team to implement shared queues across PHQA and PHQB. With shared queues, indoubt resolution for MQ can proceed through either queue manager — similar to DB2's GROUPRESYNC capability.
This was the most significant architectural change. It required: - Coupling facility structures for shared queue data - Queue redefinition to specify QSGDISP(SHARED) - Testing of the shared queue failover path - Updated operational procedures
Action 5: Add storage monitoring alerts
A CICS monitoring exit was configured to alert operations when 64-bit storage utilization exceeds 70% of MEMLIMIT. This provides early warning of memory leaks before they cause region failures.
CICS Storage Alert Thresholds:
70% MEMLIMIT → INFO alert (investigate within 24 hours)
80% MEMLIMIT → WARNING alert (investigate within 4 hours)
90% MEMLIMIT → CRITICAL alert (recycle region at next maintenance window)
95% MEMLIMIT → EMERGENCY (consider immediate controlled shutdown and restart)
Action 6: Establish indoubt UOW escalation procedure
Ahmad Rashidi's manual intervention at 09:55:32 was effective but ad hoc. Diane formalized the escalation:
| Threshold | Action | Authority |
|---|---|---|
| Shunted UOW detected | Alert Level 1 (CICS team) | Automatic |
| Shunted > 2 minutes | Alert Level 2 (Operations manager) | CICS team lead |
| Shunted > 5 minutes | Evaluate FORCE resolution | Operations manager + CICS lead |
| Shunted > 10 minutes | FORCE resolution mandatory | Operations manager |
| Any shunted UOW affecting patient care | Immediate FORCE evaluation | Compliance officer (Ahmad) |
The Deeper Lesson: Recovery Architecture Is Holistic
Diane Okoye's summary to Pinnacle's IT leadership:
"This incident taught us that recovery architecture cannot be designed in silos. Our CICS recovery was well-designed — emergency restart worked, indoubt detection worked, shunting worked. Our MQ configuration was well-designed — it recovered from the session storm within 8 minutes. But the interaction between CICS recovery and MQ recovery created a 7-minute gap that neither team had anticipated.
"The problem was not that either system failed to recover. The problem was that the systems recovered independently instead of cooperatively. CICS shunted UOWs because MQ was unavailable. MQ was unavailable because of a cascade from the CICS failure. Each system was following its own recovery procedure correctly. But the combined effect was a 7-minute lock contention window that affected patient care.
"The fix — shared queues for MQ indoubt resolution — is the architectural equivalent of DB2's GROUPRESYNC. It allows CICS to resolve its MQ-related indoubt transactions through any available MQ queue manager, not just the one on the same LPAR. This transforms the recovery from serial (CICS waits for MQ) to parallel (CICS resolves through the surviving MQ instance).
"But the deeper lesson is this: every time we add a resource manager to a transaction's two-phase commit scope, we add a recovery dependency. DB2. MQ. VSAM. Each is a potential point where indoubt resolution can be blocked. The architectural question isn't just 'does this resource need to be in the 2PC?' — it's 'what happens to recovery if this resource manager is unavailable at resolution time?'"
Ahmad Rashidi's Compliance Perspective
Ahmad added his own section to the post-incident report:
"From a compliance standpoint, the 7-minute eligibility verification outage for two specific members is within our HIPAA operational tolerance. But it revealed a systemic risk: our recovery architecture can create targeted, per-member outages where specific patients lose access to eligibility verification while the system as a whole appears healthy.
"This is worse than a system-wide outage in some ways. A system-wide outage triggers disaster recovery procedures, management notification, and visible incident management. A per-member outage — two locked rows affecting two patients — is invisible to our monitoring dashboards, invisible to our SLA metrics, and invisible to management. It only becomes visible when a pharmacist calls the help desk because a patient can't fill a prescription.
"We need monitoring that detects per-resource lock contention, not just system-level availability. I've added this to the compliance requirements for the next architecture review."
Discussion Questions
-
The 23 indoubt transactions were caused by memory pressure widening the indoubt window. How would you design monitoring to detect this widening before it leads to an incident?
-
Ahmad's observation about "invisible per-member outages" highlights a gap in traditional availability monitoring. Design a monitoring approach that detects lock contention affecting specific business entities (patients, accounts, claims) rather than just system-level metrics.
-
Diane's corrective actions include MQ shared queues. What are the trade-offs of shared queues vs. dedicated queues for recovery? (Consider performance, complexity, and failure modes.)
-
The CEMT SET UOWENQ(*) ACTION(FORCE) command that Ahmad triggered forces immediate retry of all shunted UOWs. What would happen if PHQA was still unavailable when this command was issued? Is FORCE always safe?
-
Pinnacle's RMRETRY was at the default 300 seconds for years. Why do you think this was never changed? What does this tell you about the importance of reviewing default configurations for recovery parameters?
-
If Pinnacle migrated the VSAM audit journal from the FOR to a DB2 table (eliminating the FOR from the 2PC scope), how would the recovery profile change? Would it be faster, slower, or about the same? Why?