Case Study 2: Pinnacle Health Insurance's CICS Storage Crisis

Case Study 2: Pinnacle Health Insurance's CICS Storage Crisis

Background

Pinnacle Health Insurance operates a 2-LPAR Parallel Sysplex processing 50 million claims per month. Their CICS environment handles real-time claims adjudication, provider inquiries, member eligibility checks, and prior authorization processing. The online system serves 3,200 concurrent users during peak hours (9:00 AM to 3:00 PM Eastern) — a mix of claims processors, customer service representatives, and provider office staff using TN3270 terminal emulation and an increasingly popular web front-end.

Diane Okoye, the systems architect, manages the CICS topology: 4 TORs (Terminal Owning Regions), 8 AORs (Application Owning Regions) distributed across 2 LPARs, and 2 FORs (File Owning Regions). CICSPlex SM handles workload routing. Each AOR's ECDSA (Extended CICS Dynamic Storage Area) is configured at 256 MB — a setting that had worked for three years.

Ahmad Rashidi, the compliance officer, would become involved because the resulting outage triggered HIPAA breach notification review procedures.

The Change That Started It All

In January 2022, Pinnacle's development team deployed a new claims adjudication program: PNCLAJ25. The program replaced an older version (PNCLAJ20) that had been in production for eight years. The new version added support for Pinnacle's expanded network of value-based care providers — a business requirement driven by a new payer contract with three major hospital systems.

The key difference: PNCLAJ20 had 180 KB of working storage per task. PNCLAJ25 had 2.4 MB of working storage per task.

The increase came from three sources:

Working Storage Component	PNCLAJ20	PNCLAJ25	Reason for Increase
Provider rate table	60 KB	1.2 MB	Value-based care requires per-procedure rates for 800 providers
Adjudication rules table	40 KB	800 KB	New contract terms: 340 rules vs. previous 45
Member eligibility cache	30 KB	200 KB	Extended family coverage structures
Communication areas	50 KB	200 KB	New API integration fields
Total	180 KB	2.4 MB	13.3x increase

The developer who wrote PNCLAJ25 came from a batch programming background. He had joined Pinnacle six months earlier from Federal Benefits Administration (where Sandra Chen had been his manager). He was an excellent COBOL programmer. He had never sized CICS working storage.

"In batch, 2.4 MB is nothing," the developer later explained. "My batch programs at FBA routinely used 500 MB. Nobody ever talked about working storage per task."

The Incident

Monday Morning, January 24, 2022

The deployment happened at 6:00 AM on Sunday, January 23 — during the maintenance window. PNCLAJ25 was new-copied into all 8 AORs. The deployment passed all standard checks: the program compiled cleanly, link-edited successfully, ran through unit tests in the QA CICS region, and processed 50 test claims correctly.

Nobody checked working storage size.

Monday morning, claims processing began at 8:00 AM as usual. Transaction volume ramped up gradually.

08:00-09:30 — Normal processing. PNCLAJ25 handled approximately 40-60 concurrent tasks per AOR. Working storage consumption: 60 × 2.4 MB = 144 MB per AOR. Within the 256 MB ECDSA — tight but functional.
09:45 — Volume increases. Claims processors from the West Coast join. Peak concurrent PNCLAJ25 tasks per AOR: 95.
09:48 — AOR PNCAO01 on LPAR PNCPROD1: DFHSM0131 message — ECDSA Short On Storage (SOS) condition.

The CICS SOS condition is not an immediate abend. CICS suspends new task creation to prevent storage exhaustion. Existing tasks continue. But from the users' perspective, the system has frozen — new transactions hang, screens don't respond, phone queues build.

DFHSM0131 I PNCAO01 STORAGE CUSHION HAS BEEN REACHED FOR ECDSA.
             CURRENT SIZE: 268435456  IN USE: 256901120  CUSHION: 11534336

Translation: ECDSA total is 256 MB. In use: 245 MB. The cushion (reserved for CICS internal use) is 11 MB. CICS entered SOS because the remaining free ECDSA (256 - 245 = 11 MB) has hit the cushion threshold.

09:52 — Three more AORs enter SOS: PNCAO02, PNCAO05, PNCAO06.
09:55 — Diane Okoye receives automated alerts. She begins investigation.
10:05 — Diane issues EXEC CICS INQUIRE SYSTEM commands and reviews CICS statistics. She sees:

AOR: PNCAO01
  ECDSA current: 256 MB
  ECDSA in use:  248 MB (97%)
  Tasks active:  112
  Tasks suspended (SOS): 38
  Largest program WS: PNCLAJ25 — 2,457,600 bytes per task

10:08 — Diane's reaction (later described to Kwame Mensah at a cross-company architecture forum): "I saw 2.4 megabytes per task and I knew exactly what happened. One hundred and twelve tasks times 2.4 MB is 269 megabytes. Our ECDSA is 256 megabytes. The program doesn't fit."
10:12 — Diane's emergency action: disable PNCLAJ25 in all AORs. EXEC CICS SET PROGRAM(PNCLAJ25) STATUS(DISABLED). Claims adjudication stops entirely.
10:15 — SOS condition clears in all AORs within 3 minutes as tasks complete and working storage is freed.
10:20 — Diane re-enables the old program PNCLAJ20 (still installed in the AORs from the previous deployment; the NEWCOPY had loaded PNCLAJ25 over it but the old version was retained in DFHRPL). Claims processing resumes with the old program.
10:45 — All queued transactions have been processed. System fully recovered.

Total outage for claims adjudication: approximately 55 minutes.

Impact Assessment

Ahmad Rashidi's compliance team conducted the impact assessment:

Impact Category	Details
Claims processing interruption	55 minutes, affecting ~4,200 in-flight and queued claims
Customer service impact	320+ callers received "system unavailable" messages
Provider portal	15-minute delay cascade; auto-adjudication queue backed up by 8,700 claims
Financial impact	$45,000 in overtime for claims processors to clear backlog
HIPAA review	Required because SOS condition caused 38 transactions to suspend mid-processing; investigation confirmed no PHI exposure
State regulatory filing	Required in 3 states (processing delays exceeding 30 minutes trigger filing)

Root Cause Analysis

Diane convened a root cause analysis (RCA) meeting that afternoon. The findings:

Finding 1: No Working Storage Size Gate in the Deployment Process

The deployment checklist included program compilation verification, link-edit success, QA test pass, and change management approval. It did not include any check of per-task working storage size. A program with 2.4 MB per task was deployed without anyone calculating the ECDSA impact.

Finding 2: QA Environment Did Not Replicate Production Concurrency

The QA CICS environment processed 50 test claims — sequentially. Peak concurrency in QA: 1 task. At 1 task × 2.4 MB = 2.4 MB, the program ran perfectly. At 112 tasks × 2.4 MB = 269 MB, it exceeded ECDSA. The QA environment could not have caught this — it was never designed to test storage under concurrency.

Finding 3: Batch-to-CICS Knowledge Gap

The developer understood batch storage (where per-job working storage up to 1+ GB is routine) but not CICS storage (where per-task working storage must be multiplied by concurrent tasks). This is a training gap, not a competence gap. Nobody explained the multiplication effect.

Finding 4: No CICS Storage Monitoring Threshold

Pinnacle monitored CICS for transaction response time, CPU utilization, and DB2 thread usage. There was no alert for ECDSA utilization trending upward. The SOS condition was the first notification.

Remediation

Short-Term: Redesign PNCLAJ25

Diane worked with the developer to reduce per-task working storage. The redesign:

Component	Original	Redesigned	Approach
Provider rate table	1.2 MB	0 KB	Moved to CICS shared data table (DFHSDT) — single copy shared by all tasks
Adjudication rules table	800 KB	0 KB	Moved to VSAM file read via EXEC CICS READ — rules looked up on demand, not preloaded
Member eligibility cache	200 KB	200 KB	Kept in working storage (small enough, frequently modified per-task)
Communication areas	200 KB	200 KB	Kept (per-task data, can't be shared)
Total	2.4 MB	400 KB	6x reduction

The redesigned PNCLAJ25 was deployed the following Sunday. At 112 concurrent tasks: 112 × 400 KB = 44 MB ECDSA. Comfortable.

The shared data table approach was key: instead of each task loading its own copy of the 1.2 MB provider rate table, CICS loaded one copy into a shared data table accessible to all tasks. The storage cost went from 1.2 MB × N tasks to 1.2 MB × 1 = 1.2 MB total. Diane called this "the CICS architect's first lesson: if the data doesn't change per task, don't put it in working storage."

Medium-Term: Process and Monitoring Improvements

1. Working Storage Size Gate:

Diane added a mandatory step to the CICS deployment checklist:

CICS DEPLOYMENT CHECKLIST — STORAGE REVIEW (MANDATORY)

Program name: _______________
Per-task working storage (from compile listing Data Division Map): _____ KB
Maximum concurrent tasks (from CICSPlex SM statistics, peak): _____
Total ECDSA requirement (WS × tasks): _____ MB
Current ECDSA allocation: _____ MB
Current ECDSA in-use at peak: _____ MB
Available ECDSA headroom: _____ MB
Headroom after deployment: _____ MB  (must be > 20% of ECDSA allocation)

Reviewer: _______________  Date: _______________

Any program with per-task working storage exceeding 500 KB requires architect review.

2. ECDSA Monitoring:

Diane implemented CICS statistics monitoring with three alert thresholds:

Threshold	ECDSA Utilization	Action
Warning	> 60%	Email to Diane and on-call CICS admin
Critical	> 75%	Page to on-call CICS admin; investigate immediately
Emergency	> 85%	Auto-alert to operations center; prepare for SOS

The monitoring uses CICS statistics collected every 60 seconds via the CICS statistics API and fed to their monitoring tool.

3. Developer Training:

Pinnacle implemented mandatory CICS storage training for all developers who deploy to CICS. The training, developed by Diane, covers: - Working storage per-task multiplication - ECDSA sizing and SOS conditions - Alternatives to large per-task working storage (shared data tables, TSQs, file reads) - CICS storage monitoring and how to read CICS statistics

The batch-background developer who wrote PNCLAJ25 became one of the strongest advocates for the training. "I'd been writing COBOL for twelve years and I didn't know about the multiplication effect," he told the training class. "If you come from batch, CICS storage is a completely different world."

Long-Term: Architecture Standards

Diane established CICS storage architecture standards for Pinnacle:

Standard	Value	Rationale
Maximum per-task working storage	500 KB	At 200 concurrent tasks and 256 MB ECDSA, 500 KB × 200 = 100 MB = 39% of ECDSA
ECDSA allocation per AOR	384 MB (increased from 256 MB)	Provides headroom for growth; still well within below-bar limits
Shared data tables for reference data > 100 KB	Mandatory	Eliminates per-task multiplication for read-only reference data
RPTSTG equivalent for CICS	CICS statistics collection every 60 seconds	Continuous storage monitoring
Quarterly ECDSA capacity review	Mandatory	Compare peak usage trend to ECDSA allocation

Connections to Other Anchor Examples

After the incident, Diane contacted Kwame Mensah at Continental National Bank through the SHARE user group. Kwame had dealt with the below-bar crisis in batch; Diane had dealt with the per-task multiplication crisis in CICS. They realized the root cause was the same: architects who don't understand virtual storage constraints build systems that fail at scale.

"Batch architects hit the bar," Kwame said. "CICS architects hit the DSA. The numbers are different but the lesson is identical: know your container's capacity, calculate your program's requirements, and do the arithmetic before you deploy."

Sandra Chen at Federal Benefits Administration attended the SHARE session where Kwame and Diane presented their joint case study. She left with a list of 15 CICS programs at FBA that had never had their per-task working storage calculated against DSA capacity. Four of them were over 1 MB per task.

Discussion Questions

The developer who wrote PNCLAJ25 was experienced in batch COBOL but new to CICS. Should the code review process have caught the 2.4 MB working storage? What code review checklist item would have flagged it?
Diane's emergency fix was to disable PNCLAJ25 and re-enable the old program. This caused a 55-minute outage. Could she have taken a different emergency action that maintained partial service? What are the trade-offs?
The shared data table solution eliminated the per-task multiplication for the provider rate table. What are the limitations of shared data tables? When would this approach NOT work?
Diane increased ECDSA from 256 MB to 384 MB. Why not 512 MB or 1 GB? What constrains the maximum ECDSA allocation? (Hint: think about what else lives below the bar in a CICS region.)
Ahmad Rashidi's compliance team had to investigate whether the SOS condition caused any HIPAA violations. Why would a storage exhaustion condition trigger a HIPAA review? What could have happened to PHI if the SOS had been more severe?
Compare the batch storage problem (CNB's 80A — Case Study 1) and the CICS storage problem (Pinnacle's SOS — this case study). Both were caused by growing data volumes. Why did the batch problem manifest as an S80A abend while the CICS problem manifested as an SOS condition? What does this tell you about the different storage management models?