Case Study 1: GlobalBank Nightly Batch Recovery

The Situation

It was a Wednesday night in November — peak transaction season because of the upcoming holiday. GlobalBank processed 3.1 million transactions that day, roughly 35% above the typical daily volume. Maria Chen was not on call, but she had set an alert on her phone for any batch job returning RC > 4.

At 1:23 AM, her phone buzzed: GBPOST — RETURN CODE 12 — BATCH WINDOW AT RISK.

The transaction posting step — the most critical job in the nightly cycle — had aborted after processing 1,847,233 of 3.1 million transactions. The extract, sort, and validation steps had all completed successfully. The posting step hit a VSAM file error (file status 93 — insufficient virtual storage for VSAM buffers) because the higher-than-normal transaction volume required more buffer space than the region size allowed.

The Challenge

Maria had to make a decision quickly. The batch window closed at 6:00 AM. It was 1:23 AM. She had 4 hours and 37 minutes to:

  1. Fix the VSAM buffer issue
  2. Restart the posting step from checkpoint
  3. Complete the remaining steps: interest calculation, fee assessment, GL reconciliation, statement generation, and archival

The posting step typically ran for 90 minutes at normal volume. With the higher volume and the remaining 1.25 million transactions, she estimated 50 minutes to complete from the restart point — but only if the restart worked cleanly.

The Recovery

Step 1: Diagnose (1:23 AM – 1:35 AM)

Maria checked the SYSOUT from GBPOST. The last messages were:

GBPOST: CHECKPOINT WRITTEN - RECORDS PROCESSED: 1847000
GBPOST: PROCESSING RECORD 1847234
GBPOST: VSAM FILE ERROR - STATUS 93 ON ACCT-MASTER
GBPOST: ATTEMPTING GRACEFUL SHUTDOWN
GBPOST: FINAL CHECKPOINT WRITTEN - RECORDS: 1847233
GBPOST: CONTROL TOTALS AT FAILURE:
  RECORDS READ:      1,847,233
  RECORDS POSTED:    1,843,891
  RECORDS REJECTED:      3,342
  HASH TOTAL:    387,421,339,841
  FINANCIAL TOTAL:  $412,887,193.44
GBPOST: RETURN CODE 12 - VSAM RESOURCE ERROR

The critical observation: the program had written a final checkpoint at the exact failure point. This was possible because Maria had coded a VSAM error handler that caught the file status 93, wrote one final checkpoint, and then terminated gracefully with RC 12 rather than letting the system ABEND (which would have lost the in-flight data).

Step 2: Fix the Resource Issue (1:35 AM – 1:50 AM)

Maria increased the region size for the GBPOST step from 256M to 512M in the JCL:

//GBPOST   EXEC PGM=GBPOST,REGION=512M

She also adjusted the VSAM buffer allocation:

//ACCTMSTR DD  DSN=GBANK.ACCT.MASTER,DISP=OLD,
//             AMP=('BUFNI=30,BUFND=60')

Step 3: Restart from Checkpoint (1:50 AM – 2:42 AM)

Maria restarted the GBPOST step. The program detected the checkpoint file:

GBPOST: CHECKPOINT DETECTED - RESTART MODE
GBPOST: CHECKPOINT DATA:
  RECORDS PROCESSED: 1,847,233
  HASH TOTAL:    387,421,339,841
GBPOST: REPOSITIONING INPUT FILE...
GBPOST: SKIPPED 1,847,233 RECORDS
GBPOST: RESUMING PROCESSING AT RECORD 1,847,234

The program skipped 1,847,233 already-processed input records (this took about 3 minutes for sequential read-through) and resumed processing. The remaining 1,252,767 records were processed in 49 minutes.

Step 4: Verify Control Totals (2:42 AM)

The program's completion message confirmed a clean restart:

GBPOST: PROCESSING COMPLETE
  TOTAL RECORDS READ:      3,100,000
  TOTAL RECORDS POSTED:    3,091,847
  TOTAL RECORDS REJECTED:      8,153
  HASH TOTAL:    648,291,847,223
  FINANCIAL TOTAL:  $847,291,433.27
GBPOST: CONTROL TOTALS VERIFIED - BALANCED
GBPOST: RETURN CODE 0

Step 5: Complete the Batch Window (2:42 AM – 5:48 AM)

The remaining steps ran without incident:

Step Start End RC
Interest Calculation 2:44 AM 3:21 AM 0
Fee Assessment 3:22 AM 3:38 AM 0
GL Reconciliation 3:39 AM 4:02 AM 0
Statement Generation 4:03 AM 5:14 AM 0
Archive to GDG 5:15 AM 5:31 AM 0
Housekeeping 5:32 AM 5:48 AM 0

The batch window closed at 5:48 AM — 12 minutes before deadline.

Key Lessons

  1. Graceful error handling saved the night. Because GBPOST caught the VSAM error and wrote a clean checkpoint before terminating, the restart was straightforward. If the program had ABENDed, the checkpoint would have been stale (from record 1,847,000), and 233 records would have needed special handling.

  2. Checkpoint frequency mattered. Checkpoints every 5,000 records meant the maximum reprocessing after any failure was 5,000 records — about 4 seconds of processing time. The cost of this safety: approximately 620 extra I/O operations during the full run (3.1M / 5,000 = 620 checkpoints).

  3. Holiday volume planning is essential. The region size should have been proactively increased for the holiday period. Maria added this to the pre-holiday checklist for future years.

  4. The batch window margin was too thin. With 12 minutes to spare after an incident, the system was at risk. Maria proposed extending the batch window by 30 minutes or optimizing the statement generation step (the longest remaining step) to provide more margin.

Discussion Questions

  1. What would have happened if the program had ABENDed instead of terminating gracefully? How would the recovery have been different?

  2. The input file repositioning (skipping 1.8M records) took 3 minutes. Could this be optimized? What if the input file were VSAM KSDS instead of sequential?

  3. Should the checkpoint interval be decreased during high-volume periods? What is the trade-off?

  4. How would this recovery scenario change if the posting step used DB2 instead of VSAM? Consider the role of COMMIT/ROLLBACK.