Case Study: GlobalBank — Optimizing the Nightly Batch from 4 Hours to 90 Minutes

DataField.Dev

Case Study: GlobalBank — Optimizing the Nightly Batch from 4 Hours to 90 Minutes

The Crisis

On a Friday night in October 2025, GlobalBank's nightly batch cycle overran its window for the first time. The cycle — scheduled to complete by 5:00 AM — didn't finish until 5:47 AM. Online banking was unavailable for 47 minutes during a period when early-rising customers on the East Coast were checking balances and initiating transfers.

The incident made the local news. The CIO called an emergency meeting on Monday.

The Data

Maria Chen pulled historical batch completion data:

Month	Batch Duration	Trend
Jan 2024	2h 50m
Apr 2024	3h 05m	+15m
Jul 2024	3h 22m	+17m
Oct 2024	3h 35m	+13m
Jan 2025	3h 48m	+13m
Apr 2025	3h 55m	+7m
Jul 2025	4h 02m	+7m
Oct 2025	4h 47m	+45m (overrun!)

The trend was clear: batch duration was growing at approximately 4-5 minutes per month, driven by account growth (12% year-over-year) and new regulatory feeds. The October spike was caused by a quarter-end regulatory report that processed additional data.

At the current growth rate, even after optimization, the batch would need to accommodate 15-20% annual data growth for the foreseeable future.

The Profiling Phase

Maria ran Strobe profiling on each job in the batch cycle over five consecutive nights (Monday-Friday) to get a representative sample:

BAL-CALC (72 minutes average)

The interest calculation program was the single longest job. Strobe revealed: - 36.9% of CPU in 3110-COMPOUND-DAILY — the hot paragraph, called 2.3 million times - All arithmetic fields were DISPLAY format — 9 bytes per field instead of 5 (COMP-3) - Daily rate was recomputed inside the main loop for every record, despite being constant for each account type - VSAM block size was 4,096 — adequate but not optimal

TXN-POST (55 minutes average)

The transaction posting program read a sequential transaction file and updated the VSAM account master. Strobe showed almost no CPU utilization — the program was I/O bound, spending 95% of time waiting for VSAM random reads.

Investigation revealed that the transaction file was not sorted by account key. Each transaction required a random VSAM read. With 800,000 transactions against 2.3 million accounts, most reads were cache misses.

RPT-DAILY (45 minutes average)

The daily report job spent 70% of its time in a DFSORT step. The sort processed 5 million records. Only one SORTWK DD was defined (the JCL had been copied from a template that predated sort optimization guidelines).

REG-FEED (38 minutes average)

The regulatory feed job wrote a 50-million-record file to sequential output. Maria noticed the block size was 800 bytes — the same as the logical record length. Records were completely unblocked. Every WRITE was a separate I/O operation.

"Someone set BLKSIZE equal to LRECL," Maria said. "That means every single record is its own block. Fifty million I/Os instead of 156,250."

The Optimization Plan

Maria prioritized optimizations by expected impact:

Optimization	Target Job	Expected Savings	Risk	Effort
Block size fix	REG-FEED	30 min	Very low	1 hour
Sort optimization	RPT-DAILY	25 min	Low	2 hours
Sort + sequential read	TXN-POST	35 min	Low	4 hours
Data type conversion	BAL-CALC	8 min	Medium	8 hours
Loop optimization	BAL-CALC	6 min	Low	2 hours
Compiler options	BAL-CALC	4 min	Low	1 hour
VSAM buffer tuning	BAL-CALC	17 min	Low	1 hour
DB2 query rewrite	ACCT-MAINT	14 min	Medium	6 hours

Total expected savings: 139 minutes. Target: 150 minutes.

Implementation

The One-Line Fix (REG-FEED)

The single most impactful change was one line of JCL:

//* BEFORE:
//OUTPUT   DD DSN=PROD.REG.FEED,DISP=(NEW,CATLG),
//            DCB=(RECFM=FB,LRECL=800,BLKSIZE=800)

//* AFTER:
//OUTPUT   DD DSN=PROD.REG.FEED,DISP=(NEW,CATLG),
//            DCB=(RECFM=FB,LRECL=800,BLKSIZE=32000)

Result: 38 minutes down to 8 minutes. No code changes. No testing required (the output file content was identical, just blocked differently). The downstream consumer read the file with the same JCL since the system handles deblocking automatically.

The Sort Pre-Step (TXN-POST)

Maria added a sort step before TXN-POST:

//SORT     EXEC PGM=SORT
//SORTIN   DD DSN=PROD.TXN.DAILY,DISP=SHR
//SORTOUT  DD DSN=&&SORTED,DISP=(NEW,PASS),
//            SPACE=(CYL,(50,10)),
//            DCB=(RECFM=FB,LRECL=100,BLKSIZE=32000)
//SORTWK01 DD UNIT=SYSDA,SPACE=(CYL,(30))
//SORTWK02 DD UNIT=SYSDA,SPACE=(CYL,(30))
//SORTWK03 DD UNIT=SYSDA,SPACE=(CYL,(30))
//SYSIN    DD *
  SORT FIELDS=(1,10,CH,A)
  OPTION MAINSIZE=MAX
/*

The sort took 3 minutes. But because TXN-POST could now read the VSAM master sequentially (skipping non-matching records) instead of randomly, processing dropped from 55 minutes to 15 minutes. Net savings: 37 minutes.

The BAL-CALC Overhaul

This was the most complex change. Maria converted 23 DISPLAY fields to COMP-3, moved the daily rate computation outside the main loop, changed compiler options to OPTIMIZE(FULL) and NUMPROC(PFD), and increased VSAM buffers.

The unit test suite from Chapter 34 was essential — every change was verified against the 64-test regression suite before promotion.

Post-Optimization Results

After all optimizations were implemented and tested over a one-week parallel run period:

Job	Before	After	Savings
BAL-CALC	72 min	25 min	65%
TXN-POST	55 min	15 min	73%
RPT-DAILY	45 min	12 min	73%
REG-FEED	38 min	8 min	79%
ACCT-MAINT	22 min	8 min	64%
Other	8 min	7 min	13%
Total	240 min	75 min	69%

The batch cycle now completed by 12:15 AM — nearly five hours before the deadline.

Six-Month Follow-Up

Six months later, with continued 12% annual account growth, the batch cycle had grown to 82 minutes — still well within the 6-hour window. The optimization had bought approximately 8 years of growth headroom at current rates.

Discussion Questions

The REG-FEED block size fix saved 30 minutes with a one-line JCL change. How could this misconfiguration have been prevented? What process would catch similar issues in the future?
Maria prioritized optimizations by expected savings. If she had instead prioritized by effort (easiest first), would the order change? Would the outcome be different?
The sort pre-step for TXN-POST added 3 minutes but saved 40. Under what circumstances would adding a sort step NOT be beneficial?
Maria used NUMPROC(PFD) for BAL-CALC. What testing would you require before allowing this change in production? What would happen if the data contained non-preferred signs?