Case Study 1: CNB's Checkpoint/Restart Redesign After the Six-Hour Batch Rerun
Background
Central National Bank's nightly batch cycle processes transactions across 3.2 million customer accounts. The cycle includes 47 batch jobs comprising 186 job steps, running in a 6-hour batch window from 11:00 PM to 5:00 AM. The critical path — the longest chain of dependent jobs that determines the minimum possible runtime — takes 4 hours and 20 minutes on a typical night.
On the night of March 14, 2019, that critical path expanded to 10 hours and 47 minutes. The cause: a single program, CBNC4500, that had no checkpoint/restart logic.
The Incident
CBNC4500 — the daily account reconciliation program — was step 2 of a 5-step job (CBNNIGHT). It read 14.2 million transaction records from DB2 table DAILY_TRANS, matched each record against three VSAM master files (ACCT_MASTER, BRANCH_MASTER, GL_MASTER), wrote adjustment records to a sequential output dataset (RECON.ADJUST), and updated two DB2 tables (RECON_RESULTS, RECON_EXCEPTIONS).
The program was written in 1997 by a contractor named Dave Phelps. It processed all 14.2 million records as a single unit of work — no COMMIT statements anywhere in the 4,200-line source. At 2:47 AM, after 3 hours and 52 minutes of processing, a storage controller firmware bug caused a brief I/O interruption. DB2 detected a timeout on a page lock and abended the thread with SQLCODE -911.
Rob Calloway was the on-call infrastructure lead. He got the page at 2:51 AM.
"I looked at the abend and knew immediately," Rob recalled during the post-mortem. "No checkpoint, no restart. CBNC4500 had to go back to record one. But first, DB2 had to roll back every single update it had made in the last four hours. That rollback took 48 minutes by itself."
The timeline:
| Time | Event |
|---|---|
| 11:00 PM | Batch cycle starts |
| 11:47 PM | CBNC4500 begins (step 2 of CBNNIGHT) |
| 2:47 AM | Storage controller I/O error; DB2 -911 abend |
| 2:51 AM | Rob paged |
| 2:55 AM | Rob assesses the situation, notifies Kwame |
| 3:35 AM | DB2 rollback completes (48 minutes) |
| 3:40 AM | Rob resubmits CBNNIGHT from STEP020 |
| 3:42 AM | CBNC4500 restarts from record 1 |
| 6:29 AM | CBNC4500 completes (2:47 elapsed — faster due to lighter system load) |
| 6:32 AM | Steps 3-5 of CBNNIGHT begin |
| 7:15 AM | CBNNIGHT completes |
| 9:47 AM | All downstream jobs complete |
The batch window was blown by 4 hours and 47 minutes. Three downstream jobs missed their SLA deadlines. The most critical: the daily wire transfer extract (job CBNWIRE), which was due to the Federal Reserve by 7:00 AM. It completed at 8:12 AM. CNB received a regulatory notice.
The Post-Mortem
Kwame convened the post-mortem two days later. Rob, Lisa Park (batch architecture lead), and four developers attended.
Root cause analysis:
- Immediate cause: Storage controller firmware bug causing I/O interruption. The vendor acknowledged the bug and delivered a fix within a week.
- Contributing cause: CBNC4500 had no checkpoint/restart logic. A transient I/O error that should have resulted in a 5-minute recovery instead caused a 6+ hour delay.
- Systemic cause: No standard requiring checkpoint/restart for long-running batch programs. Of CNB's 186 batch job steps, only 23 had any form of checkpoint/restart.
Kwame's direction:
"The storage controller bug was bad luck. But we made our own bad luck by running a 4-hour program with no checkpoints. Fix the program. Then fix the standard. I don't want to have this conversation again."
The Redesign
Lisa led the redesign. She established three goals:
- CBNC4500 must support restart from its last checkpoint with a maximum recovery time of 10 minutes.
- A reusable checkpoint/restart framework must be created for all CNB batch programs.
- All batch programs running longer than 30 minutes or processing more than 100,000 records must implement checkpoint/restart within 6 months.
Phase 1: CBNC4500 Redesign
Lisa and Rob analyzed CBNC4500's data access patterns:
| Resource | Access Type | Records | Checkpoint Impact |
|---|---|---|---|
| DAILY_TRANS (DB2) | Read via cursor | 14.2M | Cursor repositioning on restart |
| ACCT_MASTER (VSAM KSDS) | Random read | ~14.2M lookups | No checkpoint needed (read-only) |
| BRANCH_MASTER (VSAM KSDS) | Random read | ~14.2M lookups | No checkpoint needed (read-only) |
| GL_MASTER (VSAM KSDS) | Random read | ~2.1M lookups | No checkpoint needed (read-only) |
| RECON_RESULTS (DB2) | Insert | ~14.2M | Committed with checkpoint |
| RECON_EXCEPTIONS (DB2) | Insert | ~180K | Committed with checkpoint |
| RECON.ADJUST (Sequential) | Write | ~320K | Regenerated on restart |
Design decisions:
Commit frequency: 5,000 records. Lisa ran tests at 1,000, 2,500, 5,000, 10,000, and 25,000:
| Commit Freq | Elapsed Time | CPU Time | Max Lock Hold | Concurrent Impact |
|---|---|---|---|---|
| 1,000 | 3h 12m | 58 min | 1.8 sec | None measured |
| 2,500 | 2h 58m | 52 min | 4.5 sec | None measured |
| 5,000 | 2h 51m | 49 min | 9.1 sec | None measured |
| 10,000 | 2h 48m | 48 min | 18.2 sec | Minor (2 online timeouts) |
| 25,000 | 2h 46m | 47 min | 45.5 sec | Significant (17 online timeouts) |
At 10,000, two online banking transactions timed out during the test — RECON_RESULTS shared a tablespace with a table accessed by the online system. At 25,000, the problem was severe. Lisa chose 5,000: the elapsed time penalty was minimal (5 minutes over the no-checkpoint baseline of 2:47), and lock hold time stayed under 10 seconds.
Restart table design:
CREATE TABLE CNB_RESTART_CONTROL (
PROGRAM_NAME CHAR(8) NOT NULL,
JOB_NAME CHAR(8) NOT NULL,
STEP_NAME CHAR(8) NOT NULL,
LAST_ACCT_NUM CHAR(10) NOT NULL,
LAST_TRANS_SEQ INTEGER NOT NULL,
RECORDS_READ INTEGER NOT NULL,
RECORDS_MATCHED INTEGER NOT NULL,
RECORDS_EXCEPT INTEGER NOT NULL,
RECORDS_ADJUST INTEGER NOT NULL,
TOTAL_DEBIT_AMT DECIMAL(15,2) NOT NULL,
TOTAL_CREDIT_AMT DECIMAL(15,2) NOT NULL,
EXCEPT_AMOUNT DECIMAL(15,2) NOT NULL,
CHECKPOINT_TS TIMESTAMP NOT NULL,
RUN_STATUS CHAR(1) NOT NULL,
PRIMARY KEY (PROGRAM_NAME, JOB_NAME, STEP_NAME)
) IN CNBDB01.CNBTS01;
Lisa included both LAST_ACCT_NUM (the account number) and LAST_TRANS_SEQ (the transaction sequence within that account) because multiple transactions could exist for the same account. The restart cursor needed both values to position correctly.
Cursor for restart:
DECLARE CSR_DAILY_TRANS CURSOR FOR
SELECT ACCT_NUM, TRANS_SEQ, TRANS_DATE, TRANS_AMT,
TRANS_TYPE, BRANCH_ID, GL_CODE
FROM DAILY_TRANS
WHERE (ACCT_NUM > :restart-acct
OR (ACCT_NUM = :restart-acct
AND TRANS_SEQ > :restart-seq))
OR (:restart-acct = ' ')
ORDER BY ACCT_NUM, TRANS_SEQ
FOR FETCH ONLY
The composite key restart condition — (ACCT_NUM > :restart-acct OR (ACCT_NUM = :restart-acct AND TRANS_SEQ > :restart-seq)) — ensures the cursor skips all records processed before the checkpoint, including multiple transactions for the same account.
Sequential output handling:
The RECON.ADJUST sequential file was handled using the regeneration strategy. On restart, the output dataset was deleted (by JCL DISP=(NEW,CATLG,DELETE)) and rewritten from committed data. Since the adjustments were derived from RECON_EXCEPTIONS (which was committed to DB2), the output could be regenerated from committed state. Rob wrote a short utility step that ran before the CBNC4500 restart to recreate the adjustment file from the RECON_EXCEPTIONS table — but Lisa overruled this approach. Instead, she modified CBNC4500 to write the adjustment file after all processing was complete, as a final pass over the committed RECON_EXCEPTIONS table. This eliminated the coordination problem entirely.
Phase 2: The CNB Checkpoint/Restart Framework
Lisa built a reusable framework consisting of four COPY members:
| COPY Member | Contents |
|---|---|
| CNBCKWS | Working storage for checkpoint/restart (restart area, control fields, SQL host variables) |
| CNBCKINIT | Initialization paragraphs (read restart table, determine fresh start vs. restart) |
| CNBCKPT | Checkpoint paragraph (update restart table, commit, log) |
| CNBCKTERM | Termination paragraphs (set RUN_STATUS='E', commit, write final totals) |
A developer implementing checkpoint/restart in a new program would:
- Add
COPY CNBCKWSto WORKING-STORAGE - Call
PERFORM CNB-CHKPT-INITin the initialization section - Call
PERFORM CNB-CHKPT-TAKEat the appropriate point in the processing loop - Call
PERFORM CNB-CHKPT-TERMat normal end-of-job - Populate the program-specific fields (key values, accumulators) before each checkpoint
The framework handled all generic logic: reading the restart table, determining start mode, writing checkpoints, committing, logging, error handling, and termination. The developer only needed to provide the business-specific key values and accumulators.
Phase 3: The Retrofit
Over the following six months, CNB retrofitted checkpoint/restart into 41 batch programs that met the criteria (running > 30 minutes or processing > 100,000 records). The framework reduced the per-program effort from an estimated 3–5 days to 1–2 days.
Rob prioritized the critical path jobs first. Within the first month, all 12 critical-path programs had checkpoint/restart. The remaining 29 programs were completed over the next five months.
Results
Immediate impact (CBNC4500): - Elapsed time: 2:47 (unchanged from pre-checkpoint; the 4-minute overhead was offset by the better commit behavior reducing lock contention) - Recovery time on restart: tested at 4 minutes (vs. 6+ hours previously) - Online system impact: zero timeouts (down from occasional timeouts caused by the long-running uncommitted UR)
Six-month impact (all 41 programs): - Total batch failures requiring restart: 14 incidents - Average recovery time: 6 minutes - Maximum recovery time: 18 minutes (a program with a very large commit interval that was later adjusted) - SLA misses due to batch restarts: zero
Three-year impact: - CBNC4500 has been restarted 9 times. Average recovery: 4 minutes. Zero downstream impact. - The CNB framework is now used by 67 batch programs. - Two additional major incidents occurred where checkpoint/restart prevented SLA misses — both involved hardware failures similar to the original CBNC4500 incident.
Lessons Learned
Rob documented five lessons from the CBNC4500 incident:
-
The cost of not checkpointing is hidden until the failure. CBNC4500 ran successfully for 22 years without checkpoints. The risk was invisible until it materialized. "Just because it hasn't failed doesn't mean it won't."
-
Commit frequency affects more than just recovery. The uncommitted UR in CBNC4500 was causing low-grade lock contention against the online system for years. Adding commits improved the online system's response time during the batch window — an unexpected benefit.
-
Reusable frameworks pay for themselves. The COPY member framework reduced the per-program retrofit effort by 60–70%. Without it, the 6-month retrofit timeline would have been 12–18 months.
-
Test the restart, not just the checkpoint. Three of the 41 retrofitted programs had bugs in their restart logic that were only found during end-to-end restart testing. Code review missed all three.
-
Sequential file handling is the hardest part. Every program with sequential output required a specific strategy (regeneration, GDG, or post-processing). There is no one-size-fits-all solution for sequential files.
Discussion Questions
-
Lisa chose a commit frequency of 5,000 for CBNC4500. Given the test data shown above, do you agree with this choice? What commit frequency would you recommend, and why?
-
Lisa modified CBNC4500 to write the sequential adjustment file as a final pass over committed data, rather than writing it during the main processing loop. What are the advantages and disadvantages of this approach? Under what circumstances would you choose differently?
-
The CNB framework uses COPY members for reusable checkpoint/restart logic. An alternative approach is to use a called subprogram (a separate COBOL program called via CALL). Compare the two approaches. Which would you recommend for a shop with 200+ batch programs?
-
Of the 186 job steps in CNB's nightly batch, only 41 met the criteria for checkpoint/restart (> 30 minutes or > 100,000 records). Is this threshold appropriate? What would you set as the threshold, and why?
-
Rob's post-mortem identified "no standard requiring checkpoint/restart" as a systemic cause. How would you enforce a checkpoint/restart standard for new development? What would the code review checklist look like?
-
The CBNC4500 incident was ultimately caused by a storage controller firmware bug — an event completely outside the application team's control. What does this tell you about the relationship between application design and infrastructure reliability? How should this influence your approach to defensive programming?