Case Study 1: CNB's Checkpoint/Restart Redesign After the Six-Hour Batch Rerun

DataField.Dev

Case Study 1: CNB's Checkpoint/Restart Redesign After the Six-Hour Batch Rerun

Background

Continental National Bank's nightly batch cycle processes transactions across 3.2 million customer accounts. The cycle includes 47 batch jobs comprising 186 job steps, running in a 6-hour batch window from 11:00 PM to 5:00 AM. The critical path — the longest chain of dependent jobs that determines the minimum possible runtime — takes 4 hours and 20 minutes on a typical night.

On the night of March 14, 2019, that critical path expanded to 10 hours and 47 minutes. The cause: a single program, CBNC4500, that had no checkpoint/restart logic.

The Incident

CBNC4500 — the daily account reconciliation program — was step 2 of a 5-step job (CBNNIGHT). It read 14.2 million transaction records from DB2 table DAILY_TRANS, matched each record against three VSAM master files (ACCT_MASTER, BRANCH_MASTER, GL_MASTER), wrote adjustment records to a sequential output dataset (RECON.ADJUST), and updated two DB2 tables (RECON_RESULTS, RECON_EXCEPTIONS).

The program was written in 1997 by a contractor named Dave Phelps. It processed all 14.2 million records as a single unit of work — no COMMIT statements anywhere in the 4,200-line source. At 2:47 AM, after 3 hours and 52 minutes of processing, a storage controller firmware bug caused a brief I/O interruption. DB2 detected a timeout on a page lock and abended the thread with SQLCODE -911.

Rob Calloway was the on-call infrastructure lead. He got the page at 2:51 AM.

"I looked at the abend and knew immediately," Rob recalled during the post-mortem. "No checkpoint, no restart. CBNC4500 had to go back to record one. But first, DB2 had to roll back every single update it had made in the last four hours. That rollback took 48 minutes by itself."

The timeline:

Time	Event
11:00 PM	Batch cycle starts
11:47 PM	CBNC4500 begins (step 2 of CBNNIGHT)
2:47 AM	Storage controller I/O error; DB2 -911 abend
2:51 AM	Rob paged
2:55 AM	Rob assesses the situation, notifies Kwame
3:35 AM	DB2 rollback completes (48 minutes)
3:40 AM	Rob resubmits CBNNIGHT from STEP020
3:42 AM	CBNC4500 restarts from record 1
6:29 AM	CBNC4500 completes (2:47 elapsed — faster due to lighter system load)
6:32 AM	Steps 3-5 of CBNNIGHT begin
7:15 AM	CBNNIGHT completes
9:47 AM	All downstream jobs complete

The batch window was blown by 4 hours and 47 minutes. Three downstream jobs missed their SLA deadlines. The most critical: the daily wire transfer extract (job CBNWIRE), which was due to the Federal Reserve by 7:00 AM. It completed at 8:12 AM. CNB received a regulatory notice.

The Post-Mortem

Kwame convened the post-mortem two days later. Rob, Lisa Park (batch architecture lead), and four developers attended.

Root cause analysis:

Immediate cause: Storage controller firmware bug causing I/O interruption. The vendor acknowledged the bug and delivered a fix within a week.
Contributing cause: CBNC4500 had no checkpoint/restart logic. A transient I/O error that should have resulted in a 5-minute recovery instead caused a 6+ hour delay.
Systemic cause: No standard requiring checkpoint/restart for long-running batch programs. Of CNB's 186 batch job steps, only 23 had any form of checkpoint/restart.

Kwame's direction:

"The storage controller bug was bad luck. But we made our own bad luck by running a 4-hour program with no checkpoints. Fix the program. Then fix the standard. I don't want to have this conversation again."

The Redesign

Lisa led the redesign. She established three goals:

CBNC4500 must support restart from its last checkpoint with a maximum recovery time of 10 minutes.
A reusable checkpoint/restart framework must be created for all CNB batch programs.
All batch programs running longer than 30 minutes or processing more than 100,000 records must implement checkpoint/restart within 6 months.

Phase 1: CBNC4500 Redesign

Lisa and Rob analyzed CBNC4500's data access patterns:

Resource	Access Type	Records	Checkpoint Impact
DAILY_TRANS (DB2)	Read via cursor	14.2M	Cursor repositioning on restart
ACCT_MASTER (VSAM KSDS)	Random read	~14.2M lookups	No checkpoint needed (read-only)
BRANCH_MASTER (VSAM KSDS)	Random read	~14.2M lookups	No checkpoint needed (read-only)
GL_MASTER (VSAM KSDS)	Random read	~2.1M lookups	No checkpoint needed (read-only)
RECON_RESULTS (DB2)	Insert	~14.2M	Committed with checkpoint
RECON_EXCEPTIONS (DB2)	Insert	~180K	Committed with checkpoint
RECON.ADJUST (Sequential)	Write	~320K	Regenerated on restart

Design decisions:

Commit frequency: 5,000 records. Lisa ran tests at 1,000, 2,500, 5,000, 10,000, and 25,000:

Commit Freq	Elapsed Time	CPU Time	Max Lock Hold	Concurrent Impact
1,000	3h 12m	58 min	1.8 sec	None measured
2,500	2h 58m	52 min	4.5 sec	None measured
5,000	2h 51m	49 min	9.1 sec	None measured
10,000	2h 48m	48 min	18.2 sec	Minor (2 online timeouts)
25,000	2h 46m	47 min	45.5 sec	Significant (17 online timeouts)

At 10,000, two online banking transactions timed out during the test — RECON_RESULTS shared a tablespace with a table accessed by the online system. At 25,000, the problem was severe. Lisa chose 5,000: the elapsed time penalty was minimal (5 minutes over the no-checkpoint baseline of 2:47), and lock hold time stayed under 10 seconds.

Restart table design:

CREATE TABLE CNB_RESTART_CONTROL (
    PROGRAM_NAME    CHAR(8)       NOT NULL,
    JOB_NAME        CHAR(8)       NOT NULL,
    STEP_NAME       CHAR(8)       NOT NULL,
    LAST_ACCT_NUM   CHAR(10)      NOT NULL,
    LAST_TRANS_SEQ   INTEGER      NOT NULL,
    RECORDS_READ     INTEGER      NOT NULL,
    RECORDS_MATCHED  INTEGER      NOT NULL,
    RECORDS_EXCEPT   INTEGER      NOT NULL,
    RECORDS_ADJUST   INTEGER      NOT NULL,
    TOTAL_DEBIT_AMT  DECIMAL(15,2) NOT NULL,
    TOTAL_CREDIT_AMT DECIMAL(15,2) NOT NULL,
    EXCEPT_AMOUNT    DECIMAL(15,2) NOT NULL,
    CHECKPOINT_TS    TIMESTAMP    NOT NULL,
    RUN_STATUS       CHAR(1)      NOT NULL,
    PRIMARY KEY (PROGRAM_NAME, JOB_NAME, STEP_NAME)
) IN CNBDB01.CNBTS01;

Lisa included both LAST_ACCT_NUM (the account number) and LAST_TRANS_SEQ (the transaction sequence within that account) because multiple transactions could exist for the same account. The restart cursor needed both values to position correctly.

Cursor for restart:

DECLARE CSR_DAILY_TRANS CURSOR FOR
  SELECT ACCT_NUM, TRANS_SEQ, TRANS_DATE, TRANS_AMT,
         TRANS_TYPE, BRANCH_ID, GL_CODE
  FROM   DAILY_TRANS
  WHERE  (ACCT_NUM > :restart-acct
          OR (ACCT_NUM = :restart-acct
              AND TRANS_SEQ > :restart-seq))
     OR  (:restart-acct = ' ')
  ORDER BY ACCT_NUM, TRANS_SEQ
  FOR FETCH ONLY

The composite key restart condition — (ACCT_NUM > :restart-acct OR (ACCT_NUM = :restart-acct AND TRANS_SEQ > :restart-seq)) — ensures the cursor skips all records processed before the checkpoint, including multiple transactions for the same account.

Sequential output handling:

The RECON.ADJUST sequential file was handled using the regeneration strategy. On restart, the output dataset was deleted (by JCL DISP=(NEW,CATLG,DELETE)) and rewritten from committed data. Since the adjustments were derived from RECON_EXCEPTIONS (which was committed to DB2), the output could be regenerated from committed state. Rob wrote a short utility step that ran before the CBNC4500 restart to recreate the adjustment file from the RECON_EXCEPTIONS table — but Lisa overruled this approach. Instead, she modified CBNC4500 to write the adjustment file after all processing was complete, as a final pass over the committed RECON_EXCEPTIONS table. This eliminated the coordination problem entirely.

Phase 2: The CNB Checkpoint/Restart Framework

Lisa built a reusable framework consisting of four COPY members:

COPY Member	Contents
CNBCKWS	Working storage for checkpoint/restart (restart area, control fields, SQL host variables)
CNBCKINIT	Initialization paragraphs (read restart table, determine fresh start vs. restart)
CNBCKPT	Checkpoint paragraph (update restart table, commit, log)
CNBCKTERM	Termination paragraphs (set RUN_STATUS='E', commit, write final totals)

A developer implementing checkpoint/restart in a new program would:

Add COPY CNBCKWS to WORKING-STORAGE
Call PERFORM CNB-CHKPT-INIT in the initialization section
Call PERFORM CNB-CHKPT-TAKE at the appropriate point in the processing loop
Call PERFORM CNB-CHKPT-TERM at normal end-of-job
Populate the program-specific fields (key values, accumulators) before each checkpoint

The framework handled all generic logic: reading the restart table, determining start mode, writing checkpoints, committing, logging, error handling, and termination. The developer only needed to provide the business-specific key values and accumulators.

Phase 3: The Retrofit

Over the following six months, CNB retrofitted checkpoint/restart into 41 batch programs that met the criteria (running > 30 minutes or processing > 100,000 records). The framework reduced the per-program effort from an estimated 3–5 days to 1–2 days.

Rob prioritized the critical path jobs first. Within the first month, all 12 critical-path programs had checkpoint/restart. The remaining 29 programs were completed over the next five months.

Results

Immediate impact (CBNC4500): - Elapsed time: 2:47 (unchanged from pre-checkpoint; the 4-minute overhead was offset by the better commit behavior reducing lock contention) - Recovery time on restart: tested at 4 minutes (vs. 6+ hours previously) - Online system impact: zero timeouts (down from occasional timeouts caused by the long-running uncommitted UR)

Six-month impact (all 41 programs): - Total batch failures requiring restart: 14 incidents - Average recovery time: 6 minutes - Maximum recovery time: 18 minutes (a program with a very large commit interval that was later adjusted) - SLA misses due to batch restarts: zero

Three-year impact: - CBNC4500 has been restarted 9 times. Average recovery: 4 minutes. Zero downstream impact. - The CNB framework is now used by 67 batch programs. - Two additional major incidents occurred where checkpoint/restart prevented SLA misses — both involved hardware failures similar to the original CBNC4500 incident.

Lessons Learned

Rob documented five lessons from the CBNC4500 incident:

The cost of not checkpointing is hidden until the failure. CBNC4500 ran successfully for 22 years without checkpoints. The risk was invisible until it materialized. "Just because it hasn't failed doesn't mean it won't."
Commit frequency affects more than just recovery. The uncommitted UR in CBNC4500 was causing low-grade lock contention against the online system for years. Adding commits improved the online system's response time during the batch window — an unexpected benefit.
Reusable frameworks pay for themselves. The COPY member framework reduced the per-program retrofit effort by 60–70%. Without it, the 6-month retrofit timeline would have been 12–18 months.
Test the restart, not just the checkpoint. Three of the 41 retrofitted programs had bugs in their restart logic that were only found during end-to-end restart testing. Code review missed all three.
Sequential file handling is the hardest part. Every program with sequential output required a specific strategy (regeneration, GDG, or post-processing). There is no one-size-fits-all solution for sequential files.

Discussion Questions

Lisa chose a commit frequency of 5,000 for CBNC4500. Given the test data shown above, do you agree with this choice? What commit frequency would you recommend, and why?
Lisa modified CBNC4500 to write the sequential adjustment file as a final pass over committed data, rather than writing it during the main processing loop. What are the advantages and disadvantages of this approach? Under what circumstances would you choose differently?
The CNB framework uses COPY members for reusable checkpoint/restart logic. An alternative approach is to use a called subprogram (a separate COBOL program called via CALL). Compare the two approaches. Which would you recommend for a shop with 200+ batch programs?
Of the 186 job steps in CNB's nightly batch, only 41 met the criteria for checkpoint/restart (> 30 minutes or > 100,000 records). Is this threshold appropriate? What would you set as the threshold, and why?
Rob's post-mortem identified "no standard requiring checkpoint/restart" as a systemic cause. How would you enforce a checkpoint/restart standard for new development? What would the code review checklist look like?
The CBNC4500 incident was ultimately caused by a storage controller firmware bug — an event completely outside the application team's control. What does this tell you about the relationship between application design and infrastructure reliability? How should this influence your approach to defensive programming?