Case Study 1: GlobalBank Nightly Batch Recovery
The Situation
It was a Wednesday night in November — peak transaction season because of the upcoming holiday. GlobalBank processed 3.1 million transactions that day, roughly 35% above the typical daily volume. Maria Chen was not on call, but she had set an alert on her phone for any batch job returning RC > 4.
At 1:23 AM, her phone buzzed: GBPOST — RETURN CODE 12 — BATCH WINDOW AT RISK.
The transaction posting step — the most critical job in the nightly cycle — had aborted after processing 1,847,233 of 3.1 million transactions. The extract, sort, and validation steps had all completed successfully. The posting step hit a VSAM file error (file status 93 — insufficient virtual storage for VSAM buffers) because the higher-than-normal transaction volume required more buffer space than the region size allowed.
The Challenge
Maria had to make a decision quickly. The batch window closed at 6:00 AM. It was 1:23 AM. She had 4 hours and 37 minutes to:
- Fix the VSAM buffer issue
- Restart the posting step from checkpoint
- Complete the remaining steps: interest calculation, fee assessment, GL reconciliation, statement generation, and archival
The posting step typically ran for 90 minutes at normal volume. With the higher volume and the remaining 1.25 million transactions, she estimated 50 minutes to complete from the restart point — but only if the restart worked cleanly.
The Recovery
Step 1: Diagnose (1:23 AM – 1:35 AM)
Maria checked the SYSOUT from GBPOST. The last messages were:
GBPOST: CHECKPOINT WRITTEN - RECORDS PROCESSED: 1847000
GBPOST: PROCESSING RECORD 1847234
GBPOST: VSAM FILE ERROR - STATUS 93 ON ACCT-MASTER
GBPOST: ATTEMPTING GRACEFUL SHUTDOWN
GBPOST: FINAL CHECKPOINT WRITTEN - RECORDS: 1847233
GBPOST: CONTROL TOTALS AT FAILURE:
RECORDS READ: 1,847,233
RECORDS POSTED: 1,843,891
RECORDS REJECTED: 3,342
HASH TOTAL: 387,421,339,841
FINANCIAL TOTAL: $412,887,193.44
GBPOST: RETURN CODE 12 - VSAM RESOURCE ERROR
The critical observation: the program had written a final checkpoint at the exact failure point. This was possible because Maria had coded a VSAM error handler that caught the file status 93, wrote one final checkpoint, and then terminated gracefully with RC 12 rather than letting the system ABEND (which would have lost the in-flight data).
Step 2: Fix the Resource Issue (1:35 AM – 1:50 AM)
Maria increased the region size for the GBPOST step from 256M to 512M in the JCL:
//GBPOST EXEC PGM=GBPOST,REGION=512M
She also adjusted the VSAM buffer allocation:
//ACCTMSTR DD DSN=GBANK.ACCT.MASTER,DISP=OLD,
// AMP=('BUFNI=30,BUFND=60')
Step 3: Restart from Checkpoint (1:50 AM – 2:42 AM)
Maria restarted the GBPOST step. The program detected the checkpoint file:
GBPOST: CHECKPOINT DETECTED - RESTART MODE
GBPOST: CHECKPOINT DATA:
RECORDS PROCESSED: 1,847,233
HASH TOTAL: 387,421,339,841
GBPOST: REPOSITIONING INPUT FILE...
GBPOST: SKIPPED 1,847,233 RECORDS
GBPOST: RESUMING PROCESSING AT RECORD 1,847,234
The program skipped 1,847,233 already-processed input records (this took about 3 minutes for sequential read-through) and resumed processing. The remaining 1,252,767 records were processed in 49 minutes.
Step 4: Verify Control Totals (2:42 AM)
The program's completion message confirmed a clean restart:
GBPOST: PROCESSING COMPLETE
TOTAL RECORDS READ: 3,100,000
TOTAL RECORDS POSTED: 3,091,847
TOTAL RECORDS REJECTED: 8,153
HASH TOTAL: 648,291,847,223
FINANCIAL TOTAL: $847,291,433.27
GBPOST: CONTROL TOTALS VERIFIED - BALANCED
GBPOST: RETURN CODE 0
Step 5: Complete the Batch Window (2:42 AM – 5:48 AM)
The remaining steps ran without incident:
| Step | Start | End | RC |
|---|---|---|---|
| Interest Calculation | 2:44 AM | 3:21 AM | 0 |
| Fee Assessment | 3:22 AM | 3:38 AM | 0 |
| GL Reconciliation | 3:39 AM | 4:02 AM | 0 |
| Statement Generation | 4:03 AM | 5:14 AM | 0 |
| Archive to GDG | 5:15 AM | 5:31 AM | 0 |
| Housekeeping | 5:32 AM | 5:48 AM | 0 |
The batch window closed at 5:48 AM — 12 minutes before deadline.
Key Lessons
-
Graceful error handling saved the night. Because GBPOST caught the VSAM error and wrote a clean checkpoint before terminating, the restart was straightforward. If the program had ABENDed, the checkpoint would have been stale (from record 1,847,000), and 233 records would have needed special handling.
-
Checkpoint frequency mattered. Checkpoints every 5,000 records meant the maximum reprocessing after any failure was 5,000 records — about 4 seconds of processing time. The cost of this safety: approximately 620 extra I/O operations during the full run (3.1M / 5,000 = 620 checkpoints).
-
Holiday volume planning is essential. The region size should have been proactively increased for the holiday period. Maria added this to the pre-holiday checklist for future years.
-
The batch window margin was too thin. With 12 minutes to spare after an incident, the system was at risk. Maria proposed extending the batch window by 30 minutes or optimizing the statement generation step (the longest remaining step) to provide more margin.
Discussion Questions
-
What would have happened if the program had ABENDed instead of terminating gracefully? How would the recovery have been different?
-
The input file repositioning (skipping 1.8M records) took 3 minutes. Could this be optimized? What if the input file were VSAM KSDS instead of sequential?
-
Should the checkpoint interval be decreased during high-volume periods? What is the trade-off?
-
How would this recovery scenario change if the posting step used DB2 instead of VSAM? Consider the role of COMMIT/ROLLBACK.