Case Study: GlobalBank — Optimizing the Nightly Batch from 4 Hours to 90 Minutes
The Crisis
On a Friday night in October 2025, GlobalBank's nightly batch cycle overran its window for the first time. The cycle — scheduled to complete by 5:00 AM — didn't finish until 5:47 AM. Online banking was unavailable for 47 minutes during a period when early-rising customers on the East Coast were checking balances and initiating transfers.
The incident made the local news. The CIO called an emergency meeting on Monday.
The Data
Maria Chen pulled historical batch completion data:
| Month | Batch Duration | Trend |
|---|---|---|
| Jan 2024 | 2h 50m | |
| Apr 2024 | 3h 05m | +15m |
| Jul 2024 | 3h 22m | +17m |
| Oct 2024 | 3h 35m | +13m |
| Jan 2025 | 3h 48m | +13m |
| Apr 2025 | 3h 55m | +7m |
| Jul 2025 | 4h 02m | +7m |
| Oct 2025 | 4h 47m | +45m (overrun!) |
The trend was clear: batch duration was growing at approximately 4-5 minutes per month, driven by account growth (12% year-over-year) and new regulatory feeds. The October spike was caused by a quarter-end regulatory report that processed additional data.
At the current growth rate, even after optimization, the batch would need to accommodate 15-20% annual data growth for the foreseeable future.
The Profiling Phase
Maria ran Strobe profiling on each job in the batch cycle over five consecutive nights (Monday-Friday) to get a representative sample:
BAL-CALC (72 minutes average)
The interest calculation program was the single longest job. Strobe revealed:
- 36.9% of CPU in 3110-COMPOUND-DAILY — the hot paragraph, called 2.3 million times
- All arithmetic fields were DISPLAY format — 9 bytes per field instead of 5 (COMP-3)
- Daily rate was recomputed inside the main loop for every record, despite being constant for each account type
- VSAM block size was 4,096 — adequate but not optimal
TXN-POST (55 minutes average)
The transaction posting program read a sequential transaction file and updated the VSAM account master. Strobe showed almost no CPU utilization — the program was I/O bound, spending 95% of time waiting for VSAM random reads.
Investigation revealed that the transaction file was not sorted by account key. Each transaction required a random VSAM read. With 800,000 transactions against 2.3 million accounts, most reads were cache misses.
RPT-DAILY (45 minutes average)
The daily report job spent 70% of its time in a DFSORT step. The sort processed 5 million records. Only one SORTWK DD was defined (the JCL had been copied from a template that predated sort optimization guidelines).
REG-FEED (38 minutes average)
The regulatory feed job wrote a 50-million-record file to sequential output. Maria noticed the block size was 800 bytes — the same as the logical record length. Records were completely unblocked. Every WRITE was a separate I/O operation.
"Someone set BLKSIZE equal to LRECL," Maria said. "That means every single record is its own block. Fifty million I/Os instead of 156,250."
The Optimization Plan
Maria prioritized optimizations by expected impact:
| Optimization | Target Job | Expected Savings | Risk | Effort |
|---|---|---|---|---|
| Block size fix | REG-FEED | 30 min | Very low | 1 hour |
| Sort optimization | RPT-DAILY | 25 min | Low | 2 hours |
| Sort + sequential read | TXN-POST | 35 min | Low | 4 hours |
| Data type conversion | BAL-CALC | 8 min | Medium | 8 hours |
| Loop optimization | BAL-CALC | 6 min | Low | 2 hours |
| Compiler options | BAL-CALC | 4 min | Low | 1 hour |
| VSAM buffer tuning | BAL-CALC | 17 min | Low | 1 hour |
| DB2 query rewrite | ACCT-MAINT | 14 min | Medium | 6 hours |
Total expected savings: 139 minutes. Target: 150 minutes.
Implementation
The One-Line Fix (REG-FEED)
The single most impactful change was one line of JCL:
//* BEFORE:
//OUTPUT DD DSN=PROD.REG.FEED,DISP=(NEW,CATLG),
// DCB=(RECFM=FB,LRECL=800,BLKSIZE=800)
//* AFTER:
//OUTPUT DD DSN=PROD.REG.FEED,DISP=(NEW,CATLG),
// DCB=(RECFM=FB,LRECL=800,BLKSIZE=32000)
Result: 38 minutes down to 8 minutes. No code changes. No testing required (the output file content was identical, just blocked differently). The downstream consumer read the file with the same JCL since the system handles deblocking automatically.
The Sort Pre-Step (TXN-POST)
Maria added a sort step before TXN-POST:
//SORT EXEC PGM=SORT
//SORTIN DD DSN=PROD.TXN.DAILY,DISP=SHR
//SORTOUT DD DSN=&&SORTED,DISP=(NEW,PASS),
// SPACE=(CYL,(50,10)),
// DCB=(RECFM=FB,LRECL=100,BLKSIZE=32000)
//SORTWK01 DD UNIT=SYSDA,SPACE=(CYL,(30))
//SORTWK02 DD UNIT=SYSDA,SPACE=(CYL,(30))
//SORTWK03 DD UNIT=SYSDA,SPACE=(CYL,(30))
//SYSIN DD *
SORT FIELDS=(1,10,CH,A)
OPTION MAINSIZE=MAX
/*
The sort took 3 minutes. But because TXN-POST could now read the VSAM master sequentially (skipping non-matching records) instead of randomly, processing dropped from 55 minutes to 15 minutes. Net savings: 37 minutes.
The BAL-CALC Overhaul
This was the most complex change. Maria converted 23 DISPLAY fields to COMP-3, moved the daily rate computation outside the main loop, changed compiler options to OPTIMIZE(FULL) and NUMPROC(PFD), and increased VSAM buffers.
The unit test suite from Chapter 34 was essential — every change was verified against the 64-test regression suite before promotion.
Post-Optimization Results
After all optimizations were implemented and tested over a one-week parallel run period:
| Job | Before | After | Savings |
|---|---|---|---|
| BAL-CALC | 72 min | 25 min | 65% |
| TXN-POST | 55 min | 15 min | 73% |
| RPT-DAILY | 45 min | 12 min | 73% |
| REG-FEED | 38 min | 8 min | 79% |
| ACCT-MAINT | 22 min | 8 min | 64% |
| Other | 8 min | 7 min | 13% |
| Total | 240 min | 75 min | 69% |
The batch cycle now completed by 12:15 AM — nearly five hours before the deadline.
Six-Month Follow-Up
Six months later, with continued 12% annual account growth, the batch cycle had grown to 82 minutes — still well within the 6-hour window. The optimization had bought approximately 8 years of growth headroom at current rates.
Discussion Questions
-
The REG-FEED block size fix saved 30 minutes with a one-line JCL change. How could this misconfiguration have been prevented? What process would catch similar issues in the future?
-
Maria prioritized optimizations by expected savings. If she had instead prioritized by effort (easiest first), would the order change? Would the outcome be different?
-
The sort pre-step for TXN-POST added 3 minutes but saved 40. Under what circumstances would adding a sort step NOT be beneficial?
-
Maria used NUMPROC(PFD) for BAL-CALC. What testing would you require before allowing this change in production? What would happen if the data contained non-preferred signs?