Case Study 2: Pinnacle Health Insurance's CICS Storage Crisis
Background
Pinnacle Health Insurance operates a 2-LPAR Parallel Sysplex processing 50 million claims per month. Their CICS environment handles real-time claims adjudication, provider inquiries, member eligibility checks, and prior authorization processing. The online system serves 3,200 concurrent users during peak hours (9:00 AM to 3:00 PM Eastern) — a mix of claims processors, customer service representatives, and provider office staff using TN3270 terminal emulation and an increasingly popular web front-end.
Diane Okoye, the systems architect, manages the CICS topology: 4 TORs (Terminal Owning Regions), 8 AORs (Application Owning Regions) distributed across 2 LPARs, and 2 FORs (File Owning Regions). CICSPlex SM handles workload routing. Each AOR's ECDSA (Extended CICS Dynamic Storage Area) is configured at 256 MB — a setting that had worked for three years.
Ahmad Rashidi, the compliance officer, would become involved because the resulting outage triggered HIPAA breach notification review procedures.
The Change That Started It All
In January 2022, Pinnacle's development team deployed a new claims adjudication program: PNCLAJ25. The program replaced an older version (PNCLAJ20) that had been in production for eight years. The new version added support for Pinnacle's expanded network of value-based care providers — a business requirement driven by a new payer contract with three major hospital systems.
The key difference: PNCLAJ20 had 180 KB of working storage per task. PNCLAJ25 had 2.4 MB of working storage per task.
The increase came from three sources:
| Working Storage Component | PNCLAJ20 | PNCLAJ25 | Reason for Increase |
|---|---|---|---|
| Provider rate table | 60 KB | 1.2 MB | Value-based care requires per-procedure rates for 800 providers |
| Adjudication rules table | 40 KB | 800 KB | New contract terms: 340 rules vs. previous 45 |
| Member eligibility cache | 30 KB | 200 KB | Extended family coverage structures |
| Communication areas | 50 KB | 200 KB | New API integration fields |
| Total | 180 KB | 2.4 MB | 13.3x increase |
The developer who wrote PNCLAJ25 came from a batch programming background. He had joined Pinnacle six months earlier from Federal Benefits Administration (where Sandra Chen had been his manager). He was an excellent COBOL programmer. He had never sized CICS working storage.
"In batch, 2.4 MB is nothing," the developer later explained. "My batch programs at FBA routinely used 500 MB. Nobody ever talked about working storage per task."
The Incident
Monday Morning, January 24, 2022
The deployment happened at 6:00 AM on Sunday, January 23 — during the maintenance window. PNCLAJ25 was new-copied into all 8 AORs. The deployment passed all standard checks: the program compiled cleanly, link-edited successfully, ran through unit tests in the QA CICS region, and processed 50 test claims correctly.
Nobody checked working storage size.
Monday morning, claims processing began at 8:00 AM as usual. Transaction volume ramped up gradually.
- 08:00-09:30 — Normal processing. PNCLAJ25 handled approximately 40-60 concurrent tasks per AOR. Working storage consumption: 60 × 2.4 MB = 144 MB per AOR. Within the 256 MB ECDSA — tight but functional.
- 09:45 — Volume increases. Claims processors from the West Coast join. Peak concurrent PNCLAJ25 tasks per AOR: 95.
- 09:48 — AOR PNCAO01 on LPAR PNCPROD1: DFHSM0131 message — ECDSA Short On Storage (SOS) condition.
The CICS SOS condition is not an immediate abend. CICS suspends new task creation to prevent storage exhaustion. Existing tasks continue. But from the users' perspective, the system has frozen — new transactions hang, screens don't respond, phone queues build.
DFHSM0131 I PNCAO01 STORAGE CUSHION HAS BEEN REACHED FOR ECDSA.
CURRENT SIZE: 268435456 IN USE: 256901120 CUSHION: 11534336
Translation: ECDSA total is 256 MB. In use: 245 MB. The cushion (reserved for CICS internal use) is 11 MB. CICS entered SOS because the remaining free ECDSA (256 - 245 = 11 MB) has hit the cushion threshold.
- 09:52 — Three more AORs enter SOS: PNCAO02, PNCAO05, PNCAO06.
- 09:55 — Diane Okoye receives automated alerts. She begins investigation.
- 10:05 — Diane issues
EXEC CICS INQUIRE SYSTEMcommands and reviews CICS statistics. She sees:
AOR: PNCAO01
ECDSA current: 256 MB
ECDSA in use: 248 MB (97%)
Tasks active: 112
Tasks suspended (SOS): 38
Largest program WS: PNCLAJ25 — 2,457,600 bytes per task
- 10:08 — Diane's reaction (later described to Kwame Mensah at a cross-company architecture forum): "I saw 2.4 megabytes per task and I knew exactly what happened. One hundred and twelve tasks times 2.4 MB is 269 megabytes. Our ECDSA is 256 megabytes. The program doesn't fit."
- 10:12 — Diane's emergency action: disable PNCLAJ25 in all AORs. EXEC CICS SET PROGRAM(PNCLAJ25) STATUS(DISABLED). Claims adjudication stops entirely.
- 10:15 — SOS condition clears in all AORs within 3 minutes as tasks complete and working storage is freed.
- 10:20 — Diane re-enables the old program PNCLAJ20 (still installed in the AORs from the previous deployment; the NEWCOPY had loaded PNCLAJ25 over it but the old version was retained in DFHRPL). Claims processing resumes with the old program.
- 10:45 — All queued transactions have been processed. System fully recovered.
Total outage for claims adjudication: approximately 55 minutes.
Impact Assessment
Ahmad Rashidi's compliance team conducted the impact assessment:
| Impact Category | Details |
|---|---|
| Claims processing interruption | 55 minutes, affecting ~4,200 in-flight and queued claims |
| Customer service impact | 320+ callers received "system unavailable" messages |
| Provider portal | 15-minute delay cascade; auto-adjudication queue backed up by 8,700 claims |
| Financial impact | $45,000 in overtime for claims processors to clear backlog |
| HIPAA review | Required because SOS condition caused 38 transactions to suspend mid-processing; investigation confirmed no PHI exposure |
| State regulatory filing | Required in 3 states (processing delays exceeding 30 minutes trigger filing) |
Root Cause Analysis
Diane convened a root cause analysis (RCA) meeting that afternoon. The findings:
Finding 1: No Working Storage Size Gate in the Deployment Process
The deployment checklist included program compilation verification, link-edit success, QA test pass, and change management approval. It did not include any check of per-task working storage size. A program with 2.4 MB per task was deployed without anyone calculating the ECDSA impact.
Finding 2: QA Environment Did Not Replicate Production Concurrency
The QA CICS environment processed 50 test claims — sequentially. Peak concurrency in QA: 1 task. At 1 task × 2.4 MB = 2.4 MB, the program ran perfectly. At 112 tasks × 2.4 MB = 269 MB, it exceeded ECDSA. The QA environment could not have caught this — it was never designed to test storage under concurrency.
Finding 3: Batch-to-CICS Knowledge Gap
The developer understood batch storage (where per-job working storage up to 1+ GB is routine) but not CICS storage (where per-task working storage must be multiplied by concurrent tasks). This is a training gap, not a competence gap. Nobody explained the multiplication effect.
Finding 4: No CICS Storage Monitoring Threshold
Pinnacle monitored CICS for transaction response time, CPU utilization, and DB2 thread usage. There was no alert for ECDSA utilization trending upward. The SOS condition was the first notification.
Remediation
Short-Term: Redesign PNCLAJ25
Diane worked with the developer to reduce per-task working storage. The redesign:
| Component | Original | Redesigned | Approach |
|---|---|---|---|
| Provider rate table | 1.2 MB | 0 KB | Moved to CICS shared data table (DFHSDT) — single copy shared by all tasks |
| Adjudication rules table | 800 KB | 0 KB | Moved to VSAM file read via EXEC CICS READ — rules looked up on demand, not preloaded |
| Member eligibility cache | 200 KB | 200 KB | Kept in working storage (small enough, frequently modified per-task) |
| Communication areas | 200 KB | 200 KB | Kept (per-task data, can't be shared) |
| Total | 2.4 MB | 400 KB | 6x reduction |
The redesigned PNCLAJ25 was deployed the following Sunday. At 112 concurrent tasks: 112 × 400 KB = 44 MB ECDSA. Comfortable.
The shared data table approach was key: instead of each task loading its own copy of the 1.2 MB provider rate table, CICS loaded one copy into a shared data table accessible to all tasks. The storage cost went from 1.2 MB × N tasks to 1.2 MB × 1 = 1.2 MB total. Diane called this "the CICS architect's first lesson: if the data doesn't change per task, don't put it in working storage."
Medium-Term: Process and Monitoring Improvements
1. Working Storage Size Gate:
Diane added a mandatory step to the CICS deployment checklist:
CICS DEPLOYMENT CHECKLIST — STORAGE REVIEW (MANDATORY)
Program name: _______________
Per-task working storage (from compile listing Data Division Map): _____ KB
Maximum concurrent tasks (from CICSPlex SM statistics, peak): _____
Total ECDSA requirement (WS × tasks): _____ MB
Current ECDSA allocation: _____ MB
Current ECDSA in-use at peak: _____ MB
Available ECDSA headroom: _____ MB
Headroom after deployment: _____ MB (must be > 20% of ECDSA allocation)
Reviewer: _______________ Date: _______________
Any program with per-task working storage exceeding 500 KB requires architect review.
2. ECDSA Monitoring:
Diane implemented CICS statistics monitoring with three alert thresholds:
| Threshold | ECDSA Utilization | Action |
|---|---|---|
| Warning | > 60% | Email to Diane and on-call CICS admin |
| Critical | > 75% | Page to on-call CICS admin; investigate immediately |
| Emergency | > 85% | Auto-alert to operations center; prepare for SOS |
The monitoring uses CICS statistics collected every 60 seconds via the CICS statistics API and fed to their monitoring tool.
3. Developer Training:
Pinnacle implemented mandatory CICS storage training for all developers who deploy to CICS. The training, developed by Diane, covers: - Working storage per-task multiplication - ECDSA sizing and SOS conditions - Alternatives to large per-task working storage (shared data tables, TSQs, file reads) - CICS storage monitoring and how to read CICS statistics
The batch-background developer who wrote PNCLAJ25 became one of the strongest advocates for the training. "I'd been writing COBOL for twelve years and I didn't know about the multiplication effect," he told the training class. "If you come from batch, CICS storage is a completely different world."
Long-Term: Architecture Standards
Diane established CICS storage architecture standards for Pinnacle:
| Standard | Value | Rationale |
|---|---|---|
| Maximum per-task working storage | 500 KB | At 200 concurrent tasks and 256 MB ECDSA, 500 KB × 200 = 100 MB = 39% of ECDSA |
| ECDSA allocation per AOR | 384 MB (increased from 256 MB) | Provides headroom for growth; still well within below-bar limits |
| Shared data tables for reference data > 100 KB | Mandatory | Eliminates per-task multiplication for read-only reference data |
| RPTSTG equivalent for CICS | CICS statistics collection every 60 seconds | Continuous storage monitoring |
| Quarterly ECDSA capacity review | Mandatory | Compare peak usage trend to ECDSA allocation |
Connections to Other Anchor Examples
After the incident, Diane contacted Kwame Mensah at Continental National Bank through the SHARE user group. Kwame had dealt with the below-bar crisis in batch; Diane had dealt with the per-task multiplication crisis in CICS. They realized the root cause was the same: architects who don't understand virtual storage constraints build systems that fail at scale.
"Batch architects hit the bar," Kwame said. "CICS architects hit the DSA. The numbers are different but the lesson is identical: know your container's capacity, calculate your program's requirements, and do the arithmetic before you deploy."
Sandra Chen at Federal Benefits Administration attended the SHARE session where Kwame and Diane presented their joint case study. She left with a list of 15 CICS programs at FBA that had never had their per-task working storage calculated against DSA capacity. Four of them were over 1 MB per task.
Discussion Questions
-
The developer who wrote PNCLAJ25 was experienced in batch COBOL but new to CICS. Should the code review process have caught the 2.4 MB working storage? What code review checklist item would have flagged it?
-
Diane's emergency fix was to disable PNCLAJ25 and re-enable the old program. This caused a 55-minute outage. Could she have taken a different emergency action that maintained partial service? What are the trade-offs?
-
The shared data table solution eliminated the per-task multiplication for the provider rate table. What are the limitations of shared data tables? When would this approach NOT work?
-
Diane increased ECDSA from 256 MB to 384 MB. Why not 512 MB or 1 GB? What constrains the maximum ECDSA allocation? (Hint: think about what else lives below the bar in a CICS region.)
-
Ahmad Rashidi's compliance team had to investigate whether the SOS condition caused any HIPAA violations. Why would a storage exhaustion condition trigger a HIPAA review? What could have happened to PHI if the SOS had been more severe?
-
Compare the batch storage problem (CNB's 80A — Case Study 1) and the CICS storage problem (Pinnacle's SOS — this case study). Both were caused by growing data volumes. Why did the batch problem manifest as an S80A abend while the CICS problem manifested as an SOS condition? What does this tell you about the different storage management models?