Case Study 1: Continental National Bank's 80A Crisis and Above-the-Bar Migration
Background
Continental National Bank processes 500 million transactions per day across four LPARs. Their batch processing window — 11:00 PM to 5:00 AM Eastern — runs approximately 340 batch jobs in dependency chains managed by CA-7 scheduling. The most critical chain is the End-of-Day (EOD) sequence: 47 jobs that must complete in order, processing the day's transactions, posting to master files, producing regulatory reports, and preparing the next business day's opening balances.
In September 2021, CNB completed the acquisition of Southeastern Community Bancorp (SCB), a regional bank with 2.3 million accounts. The integration project — led by Kwame Mensah with Lisa Tran managing the DB2 migration — merged SCB's chart of accounts into CNB's master reference tables. The chart of accounts table grew from 1.6 million entries to 2.8 million entries.
Nobody thought about virtual storage.
The Incident
Timeline
September 6, 2021 — Labor Day (observed). No batch processing.
September 7, 2021 (Tuesday) — First business day after acquisition merge.
- 23:30 — CA-7 triggers the EOD batch chain. Jobs 1-31 complete normally.
- 01:32 — Job 32: CNBGL300 (General Ledger Reconciliation) starts.
- 01:47 — CNBGL300 abends S80A RC=04 in step STEP010.
Rob Calloway, the batch operations lead, was monitoring the first post-acquisition batch run. He had been watching for DB2 issues (new tables, new indexes, unfamiliar access paths). Storage was not on his radar.
"I looked at the 80A and my first instinct was REGION," Rob recalls. "I checked the JCL — REGION=0M. Already at maximum. So I tried REGION=2000M, thinking maybe 0M wasn't working right. Same result."
- 01:55 — Rob restarts CNBGL300 with REGION=2000M. Abends S80A RC=04 at 01:58.
- 02:05 — Rob escalates to Kwame Mensah.
Kwame's Diagnosis
Kwame's first question: "What changed in the program or its data since the last successful run?"
Rob: "The SCB integration went live this weekend. The chart of accounts grew."
Kwame's second question: "How big is the working storage in CNBGL300?"
Nobody on the call knew. Kwame pulled the compile listing from the source management system. The Data Division Map showed:
DATA DIVISION MAP
01 WS-CHART-OF-ACCOUNTS-TABLE
OCCURS 3000000 TIMES
(entry size: 472 bytes)
Total size: 1,416,000,000 bytes (1,351 MB)
01 WS-TRANSACTION-ACCUM
(entry size: 128 bytes, 50,000 entries)
Total size: 6,400,000 bytes (6.1 MB)
01 WS-ERROR-TABLE
OCCURS DEPENDING ON WS-ERROR-COUNT
Maximum: 100,000 entries × 256 bytes
Maximum size: 25,600,000 bytes (24.4 MB)
Total Working Storage: 1,512,847,232 bytes (1,442 MB)
Kwame did the arithmetic in his head:
Available user region (approx): 1,640 MB
Working storage required: -1,442 MB
Remaining for everything else: 198 MB
LE runtime + load module: -33 MB
VSAM buffers (5 files): -95 MB
DB2 thread storage: -30 MB
LSQA + SWA overhead: -20 MB
LE heap + stack: -25 MB
────────────────────────────────────────────
Net available: -5 MB ← NEGATIVE
"It's five megabytes short on a good day," Kwame told the call. "The table was dimensioned for 3 million entries — enough for the merged data — but the total working storage exceeds what's available below the bar. REGION can't fix this. The 2 GB bar is a hardware limit."
The Fix
Kwame outlined three approaches, ranked by implementation speed:
Option A (fastest — 2-4 hours): Reduce the table dimension. The merged chart of accounts has 2.8 million entries; the table is dimensioned for 3 million. Reduce to 2,800,000 and the working storage drops by ~94 MB, putting total required at ~1,348 MB. This fits — barely.
Option B (4-6 hours): Recompile with LP(64) and add MEMLIMIT. The large table moves above the bar. This is the architecturally correct fix.
Option C (quick hack — 30 minutes): Split the program into two steps. Step 1 loads the first half of the chart of accounts and processes transactions A-M. Step 2 loads the second half and processes N-Z. Each step uses ~750 MB of working storage.
The team implemented Option A at 02:30 as an emergency fix. The dimension was reduced from 3,000,000 to 2,850,000 (leaving a 50,000-entry buffer), the program was recompiled, and the new load module was promoted to production. CNBGL300 restarted at 03:15 and completed at 04:47.
Option B was implemented the following weekend as the permanent fix.
The LP(64) Migration
Technical Implementation
Lisa Tran's team recompiled CNBGL300 with the following changes:
Compiler options (before):
RENT,AMODE(31),RMODE(ANY),OPT(2),SSRANGE(NOSSRANGE),
LIST,MAP,OFFSET,XREF
Compiler options (after):
RENT,AMODE(31),RMODE(ANY),OPT(2),SSRANGE(NOSSRANGE),
LIST,MAP,OFFSET,XREF,LP(64)
The only change: adding LP(64). The COBOL source code required no modification.
JCL changes:
//* BEFORE:
//CNBGL300 JOB (ACCT001),'GL RECONCILIATION',
// REGION=0M
//STEP010 EXEC PGM=CNBGL300
//* AFTER:
//CNBGL300 JOB (ACCT001),'GL RECONCILIATION',
// REGION=0M,MEMLIMIT=6G
//STEP010 EXEC PGM=CNBGL300,
// PARM='/HEAP(1M,1M,ANYWHERE,KEEP,64M,32M)'
LE runtime option changes:
The team also created a CEEUOPT module linked with CNBGL300 to standardize the LE options:
HEAP(1048576,1048576,ANYWHERE,KEEP,67108864,33554432)
STACK(524288,524288,ANYWHERE,KEEP,524288,131072)
STORAGE(00,FE,00)
RPTSTG(ON)
ALL31(ON)
Testing
The testing revealed a surprise: CNBGL300's elapsed time improved by 12% after the LP(64) recompile.
| Metric | Before LP(64) | After LP(64) | Change |
|---|---|---|---|
| Elapsed time | 3 hr 12 min | 2 hr 49 min | -12% |
| CPU time | 48 min | 45 min | -6% |
| Page fault rate (avg) | 340/sec | 85/sec | -75% |
| Below-bar storage used | 1,442 MB | 210 MB | -85% |
| Above-bar storage used | 0 | 1,310 MB | (new) |
| Total storage used | 1,442 MB | 1,520 MB | +5% |
The performance improvement came from reduced page fault pressure. When 1.4 GB of working storage competed for below-bar real storage frames with file buffers, DB2 thread storage, and LE overhead, RSM was constantly stealing pages and forcing page-in I/O. With only 210 MB below the bar, the program's working set fit comfortably in real storage, page faults dropped by 75%, and the program spent less time waiting for paging I/O.
Kwame's reaction: "The LP(64) migration didn't just prevent the 80A — it made the program faster. When you reduce below-bar contention, you reduce paging, and reduced paging is free performance. We should have done this two years ago."
RPTSTG Report (Post-Migration)
Language Environment Storage Report for CNBGL300
Heap statistics (below bar):
Initial size: 1,048,576 bytes
Increment size: 1,048,576 bytes
Total heap storage used: 18,874,368 bytes (18 MB)
Number of segments: 18
Largest segment in use: 1,048,576 bytes
Heap statistics (above bar):
Initial size: 67,108,864 bytes
Increment size: 33,554,432 bytes
Total heap storage used: 1,372,585,984 bytes (1,309 MB)
Number of segments: 22
Largest segment in use: 67,108,864 bytes
Stack statistics:
Initial size: 524,288 bytes
Increment size: 524,288 bytes
Total stack storage used: 1,572,864 bytes
Number of segments: 3
User Region Summary:
Region size: 1,702,887,424 bytes (1,623 MB)
Below-bar maximum used: 220,200,960 bytes (210 MB)
Above-bar maximum used: 1,372,585,984 bytes (1,309 MB)
Below-bar headroom: 1,482,686,464 bytes (1,413 MB) = 87%
87% below-bar headroom. The program went from running on fumes (-5 MB deficit) to having over 1.4 GB of free space below the bar.
Aftermath and Policy Changes
New Storage Standards at CNB
Kwame established the following standards after the incident:
1. Mandatory LP(64) for programs with working storage > 500 MB: Any COBOL program with total working storage exceeding 500 MB must be compiled with LP(64) and must specify MEMLIMIT in JCL. No exceptions.
2. Quarterly RPTSTG reviews: All critical batch programs (the 47 EOD chain jobs plus 30 additional high-priority jobs) run with RPTSTG(ON) on the first business day of each quarter. Rob's operations team extracts the "Maximum storage used" and "Region size" values and feeds them into a storage growth tracking spreadsheet.
3. Working storage growth alerts: If any program's below-bar storage usage grows more than 10% quarter-over-quarter, Kwame's architecture team investigates. This early warning system catches the next CNBGL300 before it becomes a 2 AM crisis.
4. IEFUSI exit standardization: CNB's IEFUSI exit was updated to report (via WTO message) any batch job that uses more than 80% of its available below-bar user region. This produces an alert message in the system log that operations monitors automatically.
5. Pre-deployment storage review: The deployment checklist now includes a storage capacity check: calculate required below-bar storage from the compile listing, compare to available user region, and verify MEMLIMIT if LP(64) is in use. The check must be signed off by a level-3 developer or architect.
Cost of the Incident
| Impact | Details |
|---|---|
| Batch window overrun | 1 hour 47 minutes past 5:00 AM deadline |
| Delayed market opening | Wire transfer processing delayed; Asian markets impacted |
| Regulatory notifications | 3 notifications to OCC (batch window breach, delayed settlement, data integrity concern) |
| Staff hours | 14 person-hours (Kwame, Rob, Lisa, 2 developers, 1 systems programmer) |
| Revenue impact | Estimated $180,000 in delayed wire transfer fees and customer compensations |
| Reputation | CTO briefed the board of directors on the incident |
"One hundred and eighty thousand dollars and a board-level briefing," Kwame says, "because nobody checked whether 1.4 GB fits in 1.6 GB. The arithmetic takes ten seconds. Do the arithmetic."
Discussion Questions
-
Could the 80A abend have been predicted and prevented before the SCB integration went live? What process should have caught it?
-
Kwame's emergency fix (Option A — reduce table dimension) introduced a risk: if the chart of accounts grows beyond 2.85 million entries, the job will abend with a different error. Was this the right emergency decision? What would you have done differently?
-
The LP(64) migration improved performance by 12%. This was unexpected. Why didn't the team migrate to LP(64) proactively? What organizational factors cause architects to leave performance improvements undiscovered?
-
Sandra Chen at Federal Benefits Administration has 3 programs exceeding 1.5 GB of working storage. Unlike CNB, her programs are 35 years old and may contain below-the-line dependencies. What additional risks does she face that Kwame did not? How should her migration approach differ?
-
The storage growth tracking spreadsheet is a manual process — Rob's team extracts RPTSTG data quarterly. Design an automated alternative using SMF records and z/OS automation. What would it measure, how often, and who would receive the alerts?