Chapter 24: Checkpoint/Restart Design — Building Batch Programs That Survive Any Failure

40 min read

> "You don't build checkpoint/restart because you think your job will fail. You build it because the one time it does, at 2:47 AM on a Saturday before a holiday Monday, you want to go back to sleep instead of rerunning six hours of processing."

Learning Objectives

Design checkpoint/restart logic for COBOL batch programs that coordinate across DB2, VSAM, and sequential files
Implement application-level checkpointing using DB2 COMMIT frequency and restart table patterns
Configure JCL for automatic restart using the z/OS checkpoint/restart facility and RD parameter
Analyze checkpoint frequency tradeoffs (commit interval, recovery time, lock duration, log volume)
Design checkpoint/restart for the HA banking system's batch processing pipeline

In This Chapter

24.1 Why Checkpoints Matter
24.2 The z/OS Checkpoint/Restart Facility
24.3 Application-Level Checkpointing in COBOL
24.4 DB2 Commit Frequency Analysis
24.5 VSAM and Sequential File Checkpoint/Restart
24.6 Multi-Step Job Checkpoint Strategy
24.7 Testing Checkpoint/Restart
24.8 Checkpoint/Restart in the HA Banking System
24.9 Spaced Review: Connecting to Prior Chapters
24.10 Common Mistakes and How to Avoid Them
Chapter Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 24: Checkpoint/Restart Design — Building Batch Programs That Survive Any Failure

"You don't build checkpoint/restart because you think your job will fail. You build it because the one time it does, at 2:47 AM on a Saturday before a holiday Monday, you want to go back to sleep instead of rerunning six hours of processing." — Rob Calloway, CNB Infrastructure Lead

24.1 Why Checkpoints Matter

Rob Calloway has been running batch systems at Central National Bank for nineteen years. He remembers every major outage. Not because he keeps a log — though he does — but because each one carved a lesson into his operational instincts the way a chisel cuts stone.

The one he tells new hires about happened on a Thursday in March, 2019. CNB's nightly batch cycle included a job called CBNC4500 — the daily account reconciliation program. It read 14.2 million transaction records from a DB2 table, matched them against VSAM master files, wrote adjustment records to a sequential output dataset, and updated three DB2 tables along the way. On a good night, it ran in 2 hours and 47 minutes. On that Thursday, it had been running for 3 hours and 52 minutes when a storage controller hiccupped. The DB2 address space took an abnormal termination. The job abended with a -911 SQLCODE — deadlock/timeout.

The program had no checkpoint logic. Zero. It had been written in 1997 by a contractor who was long gone, and it processed all 14.2 million records as a single unit of work. When it failed, DB2 rolled back every update it had made in those 3 hours and 52 minutes. The rollback itself took 48 minutes. Then the entire job had to be restarted from the beginning. Total elapsed time before the batch window closed: 7 hours and 27 minutes. Three downstream jobs missed their SLA. The wire transfer file was late. The Fed noticed.

Rob's post-mortem had one recommendation: implement checkpoint/restart for every batch program that processes more than 100,000 records or runs longer than 30 minutes.

That recommendation is the foundation of this chapter.

The Cost of Not Checkpointing

Let's be precise about what checkpoint/restart gives you, because vague benefits don't survive budget meetings:

Recovery time. Without checkpointing, a failed job restarts from record one. With checkpointing every 5,000 records, a job that fails at record 4,200,000 restarts from record 4,200,000 — not from zero. If processing takes 0.001 seconds per record, you save 4,200 seconds (70 minutes) of reprocessing.

Batch window protection. Chapter 23 covered the reality of shrinking batch windows. A 4-hour job that fails at the 3-hour mark and has to restart from scratch needs 7 hours total — if nothing else goes wrong. With checkpoint/restart, it needs 4 hours and perhaps 15 minutes. That's the difference between making the window and blowing it.

Lock duration. A program that commits every 5,000 records holds DB2 locks for the time it takes to process 5,000 records — maybe 5 seconds. A program that never commits holds locks for the entire run. Other programs that need those rows wait. Batch throughput degrades. Online systems that share those tables may timeout. This is not theoretical — we covered the mechanics of lock escalation in Chapter 8.

Log volume. DB2 writes before-images of every changed row to the active log. A single unit of work that updates 14.2 million rows generates enormous log volume. If the active log fills before the commit, DB2 forces an archive, and the entire system slows. Frequent commits keep the active log manageable.

Operational confidence. Operations staff who know that a failing job can be restarted from its last checkpoint handle incidents differently than staff who know that failure means a complete rerun. The first group follows procedure. The second group panics.

The Threshold Concept: Design for Recovery, Not Prevention

Here is the mental shift that separates junior mainframe developers from senior ones:

Checkpoint/restart is not about preventing failure. It is about making failure recovery fast and automatic.

You cannot prevent all failures. Hardware fails. Software has bugs. Network connections drop. Databases run out of space. The question is not "will this job ever fail?" The answer to that is always yes. The question is "when this job fails, how fast can we recover?"

The programs that survive in production for decades are not the ones that never fail. They are the ones that fail gracefully — that leave the system in a known state, that can be restarted without manual intervention, that pick up exactly where they left off.

This is the principle we will apply throughout this chapter: design every batch program so that failure at any point results in fast, automatic recovery.

Checkpoint/Restart Terminology

Before we go further, let's nail down the vocabulary. You will hear these terms used loosely in conversation. In this chapter, they have precise meanings:

Term	Definition
Checkpoint	A recorded point in a program's execution from which processing can resume after a failure. Includes saving the program's position in its input files, the state of its counters and accumulators, and committing database changes.
Restart	The process of resuming a program's execution from a previously recorded checkpoint rather than from the beginning.
Commit	A database operation that makes all changes since the last commit (or program start) permanent. In DB2, this is `EXEC SQL COMMIT`.
Rollback	A database operation that undoes all changes since the last commit. In DB2, `EXEC SQL ROLLBACK`.
Unit of Recovery (UR)	The set of database changes between two consecutive commits. If a failure occurs, the current UR is rolled back.
Restart Table	An application-maintained DB2 table that stores checkpoint information — record counts, key values, timestamps — so the program knows where to resume.
Commit Frequency / Commit Interval	The number of records processed between commits. A commit frequency of 5,000 means the program commits after every 5,000 records.
Forward Recovery	Reapplying committed changes from a log to bring a database forward to a consistent state after a media failure.
Backward Recovery	Undoing uncommitted changes (rollback) to return a database to its last consistent state after a program failure.
Checkpoint Dataset	A z/OS dataset written by the CHKPT macro to save program state for the z/OS checkpoint/restart facility.
RD Parameter	A JCL DD parameter that controls automatic checkpoint/restart behavior at the step level.

24.2 The z/OS Checkpoint/Restart Facility

z/OS provides a system-level checkpoint/restart facility that has been part of the operating system since the MVS days. It works at the job step level and is primarily useful for programs that process sequential datasets. Let's understand what it offers and where it falls short.

The CHKPT Macro

A program can issue a checkpoint by calling the z/OS CHKPT macro. In COBOL, this is done through a system service call. When CHKPT executes, the operating system:

Writes the contents of the program's working storage to a checkpoint dataset
Records the position of all open sequential datasets (by block count)
Records the status of all open QSAM/BSAM files
Writes a checkpoint record that can be used for restart

The checkpoint dataset is specified in JCL using the SYSCHK DD statement:

//SYSCHK   DD DSN=PROD.BATCH.CHKPT.DATA,
//            DISP=(NEW,KEEP,KEEP),
//            SPACE=(CYL,(5,5)),
//            UNIT=SYSDA

The RD Parameter

The RD (Restart Definition) parameter on the JOB or EXEC statement controls checkpoint/restart behavior. It has four possible values:

RD Value	Meaning
`RD=R`	Automatic restart is allowed. Checkpoints are allowed.
`RD=RNC`	Automatic restart is allowed. Checkpoints are suppressed (Not Checkpoint).
`RD=NR`	Automatic restart is suppressed (Not Restart). Checkpoints are allowed.
`RD=NC`	Neither restart nor checkpoints are allowed.

In practice, you specify RD on the EXEC statement for the step you want to protect:

//STEP010  EXEC PGM=CBNC4500,RD=R

When a step with RD=R abends, the operator can restart the job from the last checkpoint using the RESTART parameter on the JOB statement:

//CBNC4500 JOB (ACCT),'RECON',CLASS=A,
//         RESTART=(STEP010,chkptname)

SYSCKEOV — Checkpoint at End of Volume

For programs that process multi-volume sequential datasets, the SYSCKEOV DD statement tells the system to take an automatic checkpoint every time an input dataset reaches the end of a volume:

//SYSCKEOV DD DSN=PROD.BATCH.CHKPT.EOV,
//            DISP=(NEW,KEEP,KEEP),
//            SPACE=(CYL,(2,2)),
//            UNIT=SYSDA

This was more useful in the tape era when volumes were physical tape reels. A 100-reel input dataset would get 99 automatic checkpoints. Today, with DASD datasets, SYSCKEOV is less commonly used, but it still functions with multi-volume DASD datasets.

Limitations of the System Facility

The z/OS checkpoint/restart facility has significant limitations that you need to understand:

It does not checkpoint DB2 state. The facility saves file positions and working storage, but DB2 transactions are separate. If your program updates DB2 tables and you restart from a system checkpoint, the DB2 changes made since the last DB2 COMMIT are already rolled back. Your program's working storage says you processed 500,000 records, but DB2 only has the first 495,000 committed. You have a mismatch.

It does not reposition VSAM files. VSAM datasets are not managed by the same I/O subsystem as sequential files. The checkpoint facility does not record VSAM file positions.

It requires operator intervention. The RESTART parameter must be coded on the JOB statement when resubmitting. This means someone has to modify JCL and resubmit. At 2:47 AM, that someone may not be immediately available.

It does not handle application state beyond working storage. If your program maintains state in external files, temporary datasets, or cross-memory structures, the checkpoint facility doesn't know about them.

These limitations are why most modern mainframe shops use application-level checkpointing instead of or in addition to the system facility. The system facility is a safety net. Application-level checkpointing is the primary strategy.

💡 Practitioner Note: I've worked in shops that relied entirely on the z/OS checkpoint/restart facility, and shops that used purely application-level checkpointing. The shops that had the smoothest operations used both — application-level checkpointing as the primary mechanism, with the system facility as a fallback for programs that didn't have application checkpointing yet.

24.3 Application-Level Checkpointing in COBOL

Application-level checkpointing means your COBOL program manages its own checkpoint and restart logic. The program decides when to checkpoint, what to save, and how to restart. This gives you complete control — and complete responsibility.

The Restart Table Pattern

The restart table pattern is the most widely used approach for application-level checkpointing in DB2/COBOL batch programs. Here's how it works:

You create a DB2 table specifically to hold checkpoint information
Your program writes its checkpoint state to this table every N records
The checkpoint write is part of the same COMMIT that commits the business data
On restart, the program reads the restart table to determine where to resume

The restart table typically looks like this:

CREATE TABLE RESTART_CONTROL (
    PROGRAM_NAME    CHAR(8)       NOT NULL,
    JOB_NAME        CHAR(8)       NOT NULL,
    STEP_NAME       CHAR(8)       NOT NULL,
    LAST_KEY_VALUE  VARCHAR(100)  NOT NULL,
    RECORDS_READ    INTEGER       NOT NULL,
    RECORDS_WRITTEN INTEGER       NOT NULL,
    RECORDS_UPDATED INTEGER       NOT NULL,
    RECORDS_ERROR   INTEGER       NOT NULL,
    CHECKPOINT_TS   TIMESTAMP     NOT NULL,
    RUN_STATUS      CHAR(1)       NOT NULL,
    ACCUM_AMOUNT_1  DECIMAL(15,2) NOT NULL WITH DEFAULT 0,
    ACCUM_AMOUNT_2  DECIMAL(15,2) NOT NULL WITH DEFAULT 0,
    ACCUM_AMOUNT_3  DECIMAL(15,2) NOT NULL WITH DEFAULT 0,
    USER_DATA       VARCHAR(500),
    PRIMARY KEY (PROGRAM_NAME, JOB_NAME, STEP_NAME)
);

The key columns:

PROGRAM_NAME / JOB_NAME / STEP_NAME: Uniquely identify the running instance. This matters when the same program runs in multiple jobs.
LAST_KEY_VALUE: The key of the last record successfully processed. This is how the program knows where to resume.
RECORDS_READ / WRITTEN / UPDATED / ERROR: Counters that must be preserved across restart so that end-of-job totals are correct.
CHECKPOINT_TS: When the checkpoint was taken. Useful for monitoring and debugging.
RUN_STATUS: 'S' for started, 'C' for checkpointed, 'E' for ended. Tells the restart logic whether a previous run completed or was interrupted.
ACCUM_AMOUNT_1/2/3: Accumulators for running totals (e.g., total debit amount, total credit amount). These must be preserved so that control totals at end-of-job are correct.

The Program Flow

Here is the complete flow of a checkpoint/restart-enabled COBOL batch program:

Initialization (performed once at program start):

1. Read the restart table for this program/job/step
2. IF RUN_STATUS = 'E' (previous run completed normally)
   OR no row exists (first run ever)
     THEN this is a fresh start:
       - Initialize all counters to zero
       - Set starting position to beginning of input
       - INSERT or UPDATE restart table with RUN_STATUS = 'S'
       - COMMIT
3. IF RUN_STATUS = 'S' or 'C' (previous run did not complete)
     THEN this is a restart:
       - Load counters from restart table
       - Set starting position to LAST_KEY_VALUE
       - Position input file/cursor past LAST_KEY_VALUE
       - Log: "RESTARTING FROM KEY: [value], RECORDS PREVIOUSLY PROCESSED: [count]"

Main Processing Loop:

FOR each input record:
  1. Process the record (updates, inserts, writes)
  2. Increment counters
  3. IF records-since-last-commit >= COMMIT-FREQUENCY
     THEN take a checkpoint:
       a. UPDATE restart table with current counters and last key
       b. EXEC SQL COMMIT
       c. Reset records-since-last-commit counter
       d. Log: "CHECKPOINT AT KEY: [value], RECORDS: [count]"

Termination (performed once at normal end):

1. Process any remaining records
2. UPDATE restart table with final counters and RUN_STATUS = 'E'
3. EXEC SQL COMMIT
4. Write control totals to report
5. Close all files

COBOL Implementation

Here is the WORKING-STORAGE section for checkpoint/restart support:

       01  WS-RESTART-AREA.
           05  WS-RESTART-PROGRAM   PIC X(08) VALUE 'CBNC4500'.
           05  WS-RESTART-JOB       PIC X(08).
           05  WS-RESTART-STEP      PIC X(08).
           05  WS-RESTART-KEY       PIC X(100).
           05  WS-RESTART-REC-READ  PIC S9(09) COMP VALUE ZERO.
           05  WS-RESTART-REC-WRIT  PIC S9(09) COMP VALUE ZERO.
           05  WS-RESTART-REC-UPD   PIC S9(09) COMP VALUE ZERO.
           05  WS-RESTART-REC-ERR   PIC S9(09) COMP VALUE ZERO.
           05  WS-RESTART-STATUS    PIC X(01).
              88  RESTART-STARTED      VALUE 'S'.
              88  RESTART-CHECKPOINTED VALUE 'C'.
              88  RESTART-ENDED        VALUE 'E'.
           05  WS-RESTART-ACCUM-1   PIC S9(13)V99 COMP-3
                                                   VALUE ZERO.
           05  WS-RESTART-ACCUM-2   PIC S9(13)V99 COMP-3
                                                   VALUE ZERO.
           05  WS-RESTART-ACCUM-3   PIC S9(13)V99 COMP-3
                                                   VALUE ZERO.

       01  WS-CHECKPOINT-CONTROL.
           05  WS-COMMIT-FREQUENCY  PIC S9(09) COMP VALUE 5000.
           05  WS-RECORDS-SINCE-CMT PIC S9(09) COMP VALUE ZERO.
           05  WS-IS-RESTART        PIC X(01) VALUE 'N'.
              88  IS-RESTART           VALUE 'Y'.
              88  IS-FRESH-START       VALUE 'N'.
           05  WS-CHECKPOINT-COUNT  PIC S9(09) COMP VALUE ZERO.

The initialization paragraph:

       2000-INITIALIZE-RESTART.
           MOVE SPACES TO WS-RESTART-KEY
           EXEC SQL
             SELECT LAST_KEY_VALUE,
                    RECORDS_READ,
                    RECORDS_WRITTEN,
                    RECORDS_UPDATED,
                    RECORDS_ERROR,
                    RUN_STATUS,
                    ACCUM_AMOUNT_1,
                    ACCUM_AMOUNT_2,
                    ACCUM_AMOUNT_3
             INTO   :WS-RESTART-KEY,
                    :WS-RESTART-REC-READ,
                    :WS-RESTART-REC-WRIT,
                    :WS-RESTART-REC-UPD,
                    :WS-RESTART-REC-ERR,
                    :WS-RESTART-STATUS,
                    :WS-RESTART-ACCUM-1,
                    :WS-RESTART-ACCUM-2,
                    :WS-RESTART-ACCUM-3
             FROM   RESTART_CONTROL
             WHERE  PROGRAM_NAME = :WS-RESTART-PROGRAM
               AND  JOB_NAME     = :WS-RESTART-JOB
               AND  STEP_NAME    = :WS-RESTART-STEP
           END-EXEC

           EVALUATE SQLCODE
             WHEN 0
               IF RESTART-ENDED
                 PERFORM 2100-FRESH-START
               ELSE
                 PERFORM 2200-RESTART-FROM-CHECKPOINT
               END-IF
             WHEN +100
               PERFORM 2100-FRESH-START
             WHEN OTHER
               DISPLAY 'RESTART TABLE READ FAILED, SQLCODE='
                       SQLCODE
               MOVE 16 TO WS-RETURN-CODE
               PERFORM 9000-ABEND-HANDLER
           END-EVALUATE.

       2100-FRESH-START.
           SET IS-FRESH-START TO TRUE
           MOVE ZEROS TO WS-RESTART-REC-READ
                         WS-RESTART-REC-WRIT
                         WS-RESTART-REC-UPD
                         WS-RESTART-REC-ERR
                         WS-RESTART-ACCUM-1
                         WS-RESTART-ACCUM-2
                         WS-RESTART-ACCUM-3
           MOVE SPACES TO WS-RESTART-KEY
           SET RESTART-STARTED TO TRUE

           EXEC SQL
             MERGE INTO RESTART_CONTROL RC
             USING (VALUES (:WS-RESTART-PROGRAM,
                            :WS-RESTART-JOB,
                            :WS-RESTART-STEP))
                   AS SRC(PGM, JOB, STP)
             ON RC.PROGRAM_NAME = SRC.PGM
                AND RC.JOB_NAME = SRC.JOB
                AND RC.STEP_NAME = SRC.STP
             WHEN MATCHED THEN
               UPDATE SET RUN_STATUS = 'S',
                          LAST_KEY_VALUE = ' ',
                          RECORDS_READ = 0,
                          RECORDS_WRITTEN = 0,
                          RECORDS_UPDATED = 0,
                          RECORDS_ERROR = 0,
                          CHECKPOINT_TS = CURRENT TIMESTAMP,
                          ACCUM_AMOUNT_1 = 0,
                          ACCUM_AMOUNT_2 = 0,
                          ACCUM_AMOUNT_3 = 0
             WHEN NOT MATCHED THEN
               INSERT (PROGRAM_NAME, JOB_NAME, STEP_NAME,
                       LAST_KEY_VALUE, RECORDS_READ,
                       RECORDS_WRITTEN, RECORDS_UPDATED,
                       RECORDS_ERROR, CHECKPOINT_TS,
                       RUN_STATUS, ACCUM_AMOUNT_1,
                       ACCUM_AMOUNT_2, ACCUM_AMOUNT_3)
               VALUES (:WS-RESTART-PROGRAM,
                       :WS-RESTART-JOB,
                       :WS-RESTART-STEP,
                       ' ', 0, 0, 0, 0,
                       CURRENT TIMESTAMP,
                       'S', 0, 0, 0)
           END-EXEC

           EXEC SQL COMMIT END-EXEC

           DISPLAY 'CBNC4500 - FRESH START INITIATED'
           .

       2200-RESTART-FROM-CHECKPOINT.
           SET IS-RESTART TO TRUE
           DISPLAY 'CBNC4500 - RESTARTING FROM KEY: '
                   WS-RESTART-KEY
           DISPLAY '  RECORDS PREVIOUSLY READ:    '
                   WS-RESTART-REC-READ
           DISPLAY '  RECORDS PREVIOUSLY WRITTEN: '
                   WS-RESTART-REC-WRIT
           DISPLAY '  RECORDS PREVIOUSLY UPDATED: '
                   WS-RESTART-REC-UPD
           DISPLAY '  RECORDS PREVIOUSLY IN ERROR:'
                   WS-RESTART-REC-ERR
           .

The checkpoint paragraph:

       5000-TAKE-CHECKPOINT.
           ADD 1 TO WS-CHECKPOINT-COUNT
           SET RESTART-CHECKPOINTED TO TRUE

           EXEC SQL
             UPDATE RESTART_CONTROL
             SET    LAST_KEY_VALUE  = :WS-RESTART-KEY,
                    RECORDS_READ    = :WS-RESTART-REC-READ,
                    RECORDS_WRITTEN = :WS-RESTART-REC-WRIT,
                    RECORDS_UPDATED = :WS-RESTART-REC-UPD,
                    RECORDS_ERROR   = :WS-RESTART-REC-ERR,
                    CHECKPOINT_TS   = CURRENT TIMESTAMP,
                    RUN_STATUS      = 'C',
                    ACCUM_AMOUNT_1  = :WS-RESTART-ACCUM-1,
                    ACCUM_AMOUNT_2  = :WS-RESTART-ACCUM-2,
                    ACCUM_AMOUNT_3  = :WS-RESTART-ACCUM-3
             WHERE  PROGRAM_NAME = :WS-RESTART-PROGRAM
               AND  JOB_NAME     = :WS-RESTART-JOB
               AND  STEP_NAME    = :WS-RESTART-STEP
           END-EXEC

           IF SQLCODE NOT = 0
             DISPLAY 'CHECKPOINT UPDATE FAILED, SQLCODE='
                     SQLCODE
             MOVE 16 TO WS-RETURN-CODE
             PERFORM 9000-ABEND-HANDLER
           END-IF

           EXEC SQL COMMIT END-EXEC

           DISPLAY 'CHECKPOINT #' WS-CHECKPOINT-COUNT
                   ' AT KEY: ' WS-RESTART-KEY
                   ' RECORDS: ' WS-RESTART-REC-READ
           MOVE ZERO TO WS-RECORDS-SINCE-CMT
           .

The Critical Atomicity Requirement

Notice that the restart table UPDATE and the COMMIT happen together. The business data updates and the restart table update are all part of the same unit of recovery. This is not optional — it is the fundamental guarantee that makes checkpoint/restart work.

If you update the restart table in a separate commit from the business data, you create a window where the restart table says "I processed up to record 500,000" but DB2 has only committed changes through record 495,000. On restart, you'd skip 5,000 records. Or worse: the restart table commit succeeds but the business data commit fails, and you skip records that were never processed.

⚠️ Critical Rule: The restart table update and the business data updates MUST be committed in the same COMMIT. They must be in the same unit of recovery. This is non-negotiable.

Retrieving Job and Step Names

Your program needs to know its own job name and step name to use as keys in the restart table. In COBOL running under DB2, you can retrieve these from the JCL environment:

       2050-GET-JOB-INFO.
           ACCEPT WS-RESTART-JOB FROM JOB-NAME
           ACCEPT WS-RESTART-STEP FROM STEP-NAME
           .

Some shops use a utility call or LE callable service instead. The point is: get the values dynamically, don't hardcode them. The same program may run in different jobs with different step names.

Handling the Input Cursor on Restart

When restarting, the program must skip past all records that were already processed. For a DB2 cursor, this means adding the restart key to the WHERE clause:

       3000-OPEN-INPUT-CURSOR.
           IF IS-RESTART
             EXEC SQL
               OPEN CSR-TRANSACTIONS
               USING :WS-RESTART-KEY
             END-EXEC
           ELSE
             EXEC SQL
               OPEN CSR-TRANSACTIONS
             END-EXEC
           END-IF
           .

Where the cursor is declared with a parameter marker for the restart case:

DECLARE CSR-TRANSACTIONS CURSOR FOR
  SELECT ACCT_NUM, TRANS_DATE, TRANS_AMT, TRANS_TYPE
  FROM   DAILY_TRANSACTIONS
  WHERE  ACCT_NUM > :restart-key
     OR  :restart-key = ' '
  ORDER BY ACCT_NUM
  FOR FETCH ONLY

The OR :restart-key = ' ' clause makes this a single cursor that handles both fresh start (returns all rows) and restart (returns rows after the checkpoint key). Some shops use two separate cursors instead — one for fresh start, one for restart. Either approach works. The single-cursor approach is cleaner but the optimizer may not produce an ideal access path for both cases. Profile your specific queries.

The WITH HOLD Cursor Consideration

By default, DB2 closes all open cursors when you issue a COMMIT. This means after every checkpoint COMMIT, your input cursor is closed and you must reopen it.

You can avoid this by declaring the cursor WITH HOLD:

DECLARE CSR-TRANSACTIONS CURSOR WITH HOLD FOR
  SELECT ACCT_NUM, TRANS_DATE, TRANS_AMT, TRANS_TYPE
  FROM   DAILY_TRANSACTIONS
  WHERE  ACCT_NUM > :restart-key
     OR  :restart-key = ' '
  ORDER BY ACCT_NUM
  FOR FETCH ONLY

A WITH HOLD cursor survives a COMMIT — it stays open, and the next FETCH returns the next row as if the COMMIT hadn't happened. This eliminates the overhead of closing and reopening the cursor at every checkpoint.

However, WITH HOLD cursors have tradeoffs:

Advantages: - No cursor close/reopen overhead at each checkpoint (can be significant for complex cursor queries) - Simpler code — no need to reposition the cursor after each COMMIT - DB2 can maintain internal positioning, potentially avoiding index lookups on each reopen

Disadvantages: - WITH HOLD cursors retain certain resources across COMMITs, potentially affecting DB2 memory utilization - If the underlying tablespace is reorganized or the index is rebuilt between commits (unlikely during batch, but possible during a long run), the cursor position may be lost - On restart after an abend, the cursor is NOT preserved — it was open in the failed thread, which no longer exists. You still need the restart key logic for the initial cursor open on restart.

The critical point: WITH HOLD eliminates cursor reopen overhead during normal processing, but does not eliminate the need for restart key positioning. You still need the WHERE clause with the restart key for the restart case. WITH HOLD helps between checkpoints within a single run; the restart key helps when restarting a failed run.

At CNB, Lisa Park uses WITH HOLD for all checkpoint/restart cursors. The elapsed time improvement is measurable — about 2% for CBNC4500, which takes 2,840 checkpoint COMMITs during a full run. For programs with fewer commits, the benefit is negligible.

Idempotency and the Duplicate Processing Problem

A well-designed checkpoint/restart program must be aware of idempotency — the property that processing a record more than once produces the same result as processing it once.

Consider what happens if the program fails between processing a record and reaching the next checkpoint. On restart, that record is the first one fetched by the cursor (because it is after the last committed checkpoint key). But it was already processed in the failed UR, and those changes were rolled back. So the record is correctly reprocessed.

Now consider an INSERT operation. If the program inserts a row into a results table for each input record, and the program fails after the insert but before the commit, the insert is rolled back. On restart, the insert executes again — no problem, because the first insert was rolled back.

But what if the program writes to an external system — sends a message to MQ Series, calls a web service, or writes to a non-DB2 file? Those operations are not rolled back by DB2. On restart, the program sends the message again or writes the record again. This is the duplicate processing problem.

The solution is to ensure that all side effects that cannot be rolled back are either: 1. Deferred until after COMMIT — perform the non-reversible action only after the data is committed 2. Made idempotent — design the receiving system to detect and ignore duplicates (using a unique identifier like transaction ID) 3. Performed within the same unit of recovery — use two-phase commit with MQ Series or other XA-compliant resource managers

For most COBOL batch programs that only touch DB2 and VSAM, idempotency is natural — DB2 rollback handles it. The problem surfaces when the program has external side effects. Be aware of this when adding checkpoint/restart to programs that interact with external systems.

24.4 DB2 Commit Frequency Analysis

Choosing the right commit frequency is an engineering decision, not a guess. It involves tradeoffs among four competing concerns:

Recovery time — How long does restart take?
Lock duration — How long are rows locked?
Log volume — How much log space does each UR consume?
Commit overhead — How much CPU does each COMMIT cost?

The Tradeoff Matrix

Commit Frequency	Recovery Time	Lock Duration	Log per UR	Commit Overhead
Every 100 records	Minimal (seconds)	Very short	Very small	Very high
Every 1,000 records	Low (seconds)	Short	Small	High
Every 5,000 records	Low (minutes)	Moderate	Moderate	Moderate
Every 10,000 records	Moderate (minutes)	Moderate	Moderate	Low
Every 50,000 records	High (tens of minutes)	Long	Large	Very low
Never (end of job)	Maximum (full rerun)	Entire run	Entire run	None

Understanding Commit Overhead

Each DB2 COMMIT is not free. It involves:

Writing the log buffer to the active log dataset (synchronous I/O)
Releasing all page and row locks held by the UR
Internal DB2 bookkeeping for the new UR

On modern z/OS systems with zHyperLink-attached storage, a single COMMIT takes approximately 0.1–0.3 milliseconds of CPU time and 0.5–2 milliseconds of elapsed time. For a program processing 10 million records:

Commit Frequency	Number of COMMITs	Commit CPU Overhead	Commit Elapsed Overhead
100	100,000	10–30 seconds	50–200 seconds
1,000	10,000	1–3 seconds	5–20 seconds
5,000	2,000	0.2–0.6 seconds	1–4 seconds
10,000	1,000	0.1–0.3 seconds	0.5–2 seconds

At a commit frequency of 5,000, the overhead is negligible — a fraction of a second of CPU for a job that runs for hours. Even at 1,000, the overhead is minimal. Below 500, you start to notice it, but even then it's usually acceptable.

The practical guideline: For most batch programs, a commit frequency between 1,000 and 10,000 gives a good balance. Start with 5,000 and adjust based on measurement.

Lock Duration and Concurrency

The commit frequency directly controls how long your program holds DB2 locks. Between commits, every row your program updates (or reads with a lock) remains locked. Other programs that need those rows must wait.

Consider a program that updates account balances. With a commit frequency of 50,000, it locks 50,000 account rows at a time. If an online transaction needs one of those rows, it waits. If the wait exceeds the lock timeout threshold (typically 30–60 seconds), the online transaction gets SQLCODE -911 and the user sees an error.

This is why Rob Calloway's original CBNC4500 — with zero commits — was particularly dangerous. It locked every row it touched for the entire 3+ hour run. During that time, no online system could update those rows.

Log Volume and Active Log Sizing

Each unit of recovery generates log records. DB2 writes before-images (for rollback) and after-images (for forward recovery) of every changed row. If a single UR updates 1 million rows and each row is 200 bytes, the log volume for that UR is approximately:

Before-images: 1,000,000 x 200 bytes = 200 MB
After-images: 1,000,000 x 200 bytes = 200 MB
Log record headers and control records: ~50 MB
Total: ~450 MB for one UR

If your active log datasets are sized at 1 GB each (a common configuration), a single UR that generates 450 MB of log data uses nearly half an active log. If two such programs run concurrently, the active logs fill, forcing an archive switch. During the archive switch, all DB2 logging activity stalls. Every application waiting to write a log record waits.

With a commit frequency of 5,000, the same program generates approximately:

5,000 x 200 bytes x 2 (before + after) + overhead = ~2.3 MB per UR

This is trivial. The active log handles it without breaking a sweat.

📊 CNB's Standard: After the CBNC4500 incident, Kwame established a standard: no batch UR may generate more than 100 MB of log data. This translates to a commit frequency that depends on the row size, but 5,000–10,000 records is typical for CNB's transaction tables.

Making the Commit Frequency Configurable

Hard-coding the commit frequency is a maintenance headache. Different environments (development, QA, production) may need different values. Production may need to change the value during a particularly busy night.

The best practice is to read the commit frequency from a control table or a parameter:

       01  WS-PARM-AREA.
           05  WS-PARM-LENGTH      PIC S9(04) COMP.
           05  WS-PARM-DATA        PIC X(100).

       PROCEDURE DIVISION USING WS-PARM-AREA.
       ...
       1500-PARSE-PARAMETERS.
           IF WS-PARM-LENGTH > 0
             UNSTRING WS-PARM-DATA DELIMITED BY ','
               INTO WS-COMMIT-FREQ-ALPHA
                    WS-OTHER-PARM
             END-UNSTRING
             MOVE FUNCTION NUMVAL(WS-COMMIT-FREQ-ALPHA)
               TO WS-COMMIT-FREQUENCY
             IF WS-COMMIT-FREQUENCY < 100
               OR WS-COMMIT-FREQUENCY > 100000
               DISPLAY 'INVALID COMMIT FREQ: '
                       WS-COMMIT-FREQUENCY
                       ' - USING DEFAULT 5000'
               MOVE 5000 TO WS-COMMIT-FREQUENCY
             END-IF
           ELSE
             MOVE 5000 TO WS-COMMIT-FREQUENCY
           END-IF
           .

The JCL passes the parameter via PARM:

//STEP010  EXEC PGM=CBNC4500,PARM='5000'

Now operations can change the commit frequency without a program recompile. If the batch window is tight one night and they want faster commits (shorter lock duration, letting concurrent jobs run faster), they increase the commit frequency. If commit overhead is a concern, they decrease it. This flexibility is worth the ten extra lines of code.

24.5 VSAM and Sequential File Checkpoint/Restart

DB2 tables are easy to checkpoint — you commit and the data is safe. VSAM and sequential files are harder because they don't participate in DB2's transaction management. You need explicit strategies for each file type.

VSAM File Repositioning

VSAM KSDS (Key-Sequenced Data Sets) support random access by key. This makes restart repositioning straightforward:

On restart, reposition the VSAM file using the last checkpoint key:

       3100-REPOSITION-VSAM-INPUT.
           MOVE WS-RESTART-KEY TO VSAM-KEY-FIELD
           START VSAM-INPUT-FILE
             KEY IS GREATER THAN VSAM-KEY-FIELD
             INVALID KEY
               DISPLAY 'VSAM REPOSITION FAILED AT KEY: '
                       VSAM-KEY-FIELD
               MOVE 16 TO WS-RETURN-CODE
               PERFORM 9000-ABEND-HANDLER
           END-START
           .

After the START, subsequent READ NEXT operations return records after the checkpoint position.

For VSAM RRDS (Relative Record Data Sets), you store the relative record number in the restart table and use it to reposition.

For VSAM ESDS (Entry-Sequenced Data Sets), repositioning is more complex because ESDS does not support keyed access. You have two options:

Store the RBA (Relative Byte Address) in the restart table and use it to reposition. This requires low-level access that standard COBOL doesn't provide directly.
Skip forward on restart by reading and discarding records until you reach the checkpoint count. This works but is slow for large files.

Most shops avoid ESDS for input files that need checkpoint/restart. Use KSDS instead.

VSAM Output File Challenges

VSAM output files present a different challenge. If your program writes records to a VSAM KSDS output file and then fails, the records written since the last checkpoint are already physically in the VSAM file — but the corresponding DB2 changes have been rolled back. You have orphaned VSAM records.

There are three approaches to handle this:

Approach 1: Delete on restart. On restart, delete all VSAM output records written since the last checkpoint. This requires knowing which records were written after the checkpoint — typically by using a timestamp or sequence number stored in the VSAM record.

       3200-CLEANUP-VSAM-OUTPUT.
           MOVE WS-RESTART-KEY TO VSAM-OUT-KEY
           START VSAM-OUTPUT-FILE
             KEY IS GREATER THAN VSAM-OUT-KEY
             INVALID KEY
               GO TO 3200-CLEANUP-DONE
           END-START
           PERFORM UNTIL WS-VSAM-EOF = 'Y'
             READ VSAM-OUTPUT-FILE NEXT
               AT END
                 SET WS-VSAM-EOF TO TRUE
               NOT AT END
                 DELETE VSAM-OUTPUT-FILE
             END-READ
           END-PERFORM
           .
       3200-CLEANUP-DONE.
           EXIT.

Approach 2: Write to a temporary file first. Write all output to a temporary sequential file, then copy to VSAM in a separate step after the main program completes successfully. This eliminates the VSAM inconsistency problem but adds a step.

Approach 3: Use CICS File Control for transactional VSAM. If the VSAM file is managed by CICS, you can use CICS recoverable file support, which participates in two-phase commit with DB2. This is the cleanest solution but requires CICS infrastructure.

Sequential Output File Strategies

Sequential output files are the trickiest for checkpoint/restart because you cannot easily "un-write" records from a sequential file. Once a record is written, it's written.

Strategy 1: Generation Data Groups (GDGs). Write each checkpoint's output to a new generation of a GDG. On restart, delete the partial generation and start a new one from the checkpoint position. On successful completion, concatenate all generations into the final output.

//OUTPUT   DD DSN=PROD.RECON.OUTPUT(+1),
//            DISP=(NEW,CATLG,DELETE),
//            SPACE=(CYL,(50,10)),
//            DCB=(RECFM=FB,LRECL=200,BLKSIZE=0)

Strategy 2: Rewrite from checkpoint. On restart, reallocate the output dataset with DISP=(NEW,CATLG,DELETE) and rewrite all output from the beginning — but only process input records from the checkpoint key forward. This works when the output is a subset transformation of the input. You lose previously written records, but those will be regenerated from the committed data.

Strategy 3: Track byte position. Store the byte offset of the sequential file in the restart table. On restart, position to that offset and continue writing. This requires low-level I/O manipulation and is fragile — avoid it unless no other option works.

💡 Practitioner Note: At CNB, Lisa Park standardized on Strategy 2 for most sequential output files. The rationale: sequential output files are almost always consumed by a downstream job, not directly by users. Rewriting the output from committed DB2 data is safe and simple. The downstream job gets a complete, consistent file regardless of how many times the producing job restarted.

Coordinating Across All Three: DB2 + VSAM + Sequential

The hardest checkpoint/restart scenarios involve programs that read from DB2, update VSAM, and write sequential output — all in the same job step. Each resource type has different transactional capabilities:

Resource	Participates in DB2 COMMIT?	Can be repositioned on restart?	Can be "rolled back"?
DB2 tables	Yes	Yes (cursor with key > restart_key)	Yes (automatic rollback)
VSAM KSDS	No	Yes (START with key)	Manual (delete orphans)
Sequential output	No	No (append-only)	No (must rewrite)

The coordination strategy:

COMMIT handles DB2. The restart table and all business table updates are committed together.
VSAM updates use the same key range as DB2. On restart, delete VSAM records written after the last checkpoint key.
Sequential output is regenerated. On restart, rewrite the output file from committed data.
The restart table is the single source of truth. It records the last committed key, which tells you exactly where DB2 is consistent, where VSAM cleanup starts, and what sequential output to regenerate.

This three-layer coordination is why application-level checkpointing is more complex than it first appears — and why it's worth investing in a reusable framework rather than coding it ad hoc in every program.

24.6 Multi-Step Job Checkpoint Strategy

Real batch jobs are not single steps. They are multi-step JCL jobs where each step depends on the output of the previous step. Checkpoint/restart must work at the job level, not just the step level.

Step-Level Restart with COND and IF/THEN/ELSE

JCL provides the COND parameter and IF/THEN/ELSE/ENDIF constructs for conditional step execution. When combined with restart, these control which steps execute on a restart run.

Consider a three-step job:

//JOBRECON JOB (ACCT),'DAILY RECON',CLASS=A,MSGCLASS=X
//*
//STEP010  EXEC PGM=EXTRACT,RD=R
//INPUT    DD DSN=PROD.DAILY.TRANS,DISP=SHR
//OUTPUT   DD DSN=PROD.EXTRACT.DATA,
//            DISP=(NEW,CATLG,DELETE),
//            SPACE=(CYL,(100,20))
//*
//STEP020  EXEC PGM=MATCH,RD=R
//INPUT    DD DSN=PROD.EXTRACT.DATA,DISP=SHR
//MASTER   DD DSN=PROD.ACCT.MASTER,DISP=SHR
//OUTPUT   DD DSN=PROD.MATCHED.DATA,
//            DISP=(NEW,CATLG,DELETE),
//            SPACE=(CYL,(50,10))
//*
//STEP030  EXEC PGM=REPORT,RD=R
//INPUT    DD DSN=PROD.MATCHED.DATA,DISP=SHR
//REPORT   DD SYSOUT=*

If STEP020 abends, you want to restart from STEP020 — not from STEP010. You can specify this with the RESTART parameter:

//JOBRECON JOB (ACCT),'DAILY RECON',CLASS=A,MSGCLASS=X,
//         RESTART=STEP020

But here's the problem: STEP020's input (PROD.EXTRACT.DATA) was created by STEP010. If STEP020 is restarted, STEP010 doesn't run, so the input dataset must already exist from the previous run. The DISP on STEP020's input DD must be SHR or OLD, not NEW.

For STEP010's output dataset, if it was created successfully in the first run, it still exists. The restart run skips STEP010, so the DISP=(NEW,...) on STEP010 is not executed. This works correctly.

The Passed Dataset Problem

If STEP010 passes the dataset to STEP020 using DISP=(NEW,PASS), a restart from STEP020 fails because passed datasets only exist for the duration of the job step that passed them. On restart, the passed dataset is gone.

Solution: For jobs that need checkpoint/restart, use cataloged datasets instead of passed datasets. The small overhead of cataloging is irrelevant compared to the restart capability you gain.

Multi-Step Restart Table Coordination

When multiple steps in a job all use application-level checkpointing with a restart table, the restart table must record state per step. This is why the restart table has a STEP_NAME column.

The job-level restart strategy:

Each step reads its own row from the restart table using PROGRAM_NAME + JOB_NAME + STEP_NAME as the key.
On fresh start, each step initializes its row to RUN_STATUS = 'S'.
On restart, the scheduler restarts from the failed step. Previous steps' restart table rows still show RUN_STATUS = 'E' (completed), so if those steps accidentally re-execute, they quickly determine they already finished and exit with RC=0.
On successful completion, each step sets RUN_STATUS = 'E'.

This means each step should include logic like:

       2000-CHECK-IF-ALREADY-DONE.
           PERFORM 2010-READ-RESTART-TABLE
           IF RESTART-ENDED
             DISPLAY 'STEP ALREADY COMPLETED - SKIPPING'
             MOVE 0 TO RETURN-CODE
             STOP RUN
           END-IF
           .

This is a safety net. The JCL restart should skip completed steps, but defense-in-depth means the program also checks.

Conditional Execution and Restart

Modern JCL uses IF/THEN/ELSE for conditional execution:

// IF (STEP010.RC <= 4) THEN
//STEP020  EXEC PGM=MATCH
// ...
// ENDIF

On restart from STEP020, the IF condition is not re-evaluated — JES skips directly to the restart step. This is usually what you want. But be aware: if STEP010's return code influenced which path the job took, and you restart from a step inside a conditional block, the condition is assumed to be true.

The Job Completion Marker

At CNB, every multi-step batch job ends with a "completion marker" step:

//STEPFIN  EXEC PGM=IEFBR14
//MARKER   DD DSN=PROD.JOBRECON.COMPLETE.D&LYYMMDD,
//            DISP=(NEW,CATLG,DELETE),
//            SPACE=(TRK,0)

This creates a zero-length dataset whose existence proves the job completed successfully. Downstream jobs check for this dataset before starting. If the job failed and was restarted, the marker is only created when all steps complete. This prevents downstream jobs from running on incomplete data.

24.7 Testing Checkpoint/Restart

You cannot trust checkpoint/restart logic that has never been tested. And yet, testing it is one of the most commonly skipped activities in mainframe development. The reason is simple: it's hard. You have to simulate failures, verify recovery, and confirm data consistency — all in an environment where failures are, by definition, abnormal.

The Test Plan

Every checkpoint/restart implementation needs a test plan that covers these scenarios:

Scenario 1: Normal completion — fresh start. - Run the program from the beginning. - Verify all records are processed. - Verify the restart table shows RUN_STATUS = 'E'. - Verify control totals are correct.

Scenario 2: Normal completion — restart after completion. - Run the program again without resetting the restart table. - Verify it detects the previous run completed (RUN_STATUS = 'E') and starts fresh. - Verify results match Scenario 1.

Scenario 3: Failure after first checkpoint. - Run the program, and after the first or second checkpoint, simulate a failure. - Verify the restart table shows RUN_STATUS = 'C' with the correct key and counts. - Restart the program. - Verify it resumes from the checkpoint position. - Verify final control totals match Scenario 1.

Scenario 4: Failure before first checkpoint. - Run the program and simulate a failure before the first commit. - Verify the restart table shows RUN_STATUS = 'S'. - Restart the program. - Verify it starts from the beginning (no checkpoint to resume from). - Verify final results match Scenario 1.

Scenario 5: Multiple failures. - Run the program, simulate failure, restart, simulate another failure, restart again. - Verify the program handles consecutive restarts correctly. - Verify final results match Scenario 1.

Scenario 6: Failure with VSAM and sequential coordination. - Run the program, let it write to VSAM and sequential output, simulate failure. - Verify VSAM cleanup occurs on restart (orphaned records deleted). - Verify sequential output is correct after restart. - Verify final results match Scenario 1.

Simulating Failures

There are several ways to simulate failures in a test environment:

Method 1: ABEND code in the program. Add a testing hook that abends after a configurable number of records:

       01  WS-TEST-ABORT-AFTER    PIC S9(09) COMP VALUE ZERO.
       ...
       4500-CHECK-TEST-ABORT.
           IF WS-TEST-ABORT-AFTER > ZERO
             AND WS-RESTART-REC-READ >= WS-TEST-ABORT-AFTER
             DISPLAY 'TEST ABORT AFTER ' WS-RESTART-REC-READ
                     ' RECORDS'
             EXEC SQL ROLLBACK END-EXEC
             CALL 'CEE3ABD' USING WS-ABEND-CODE WS-TIMING
           END-IF
           .

Pass the abort-after count via PARM: PARM='5000,7500' (commit frequency 5000, abort after 7500 records). In production, the second parameter is zero or omitted.

Method 2: DB2 DSNTEP2 to update the restart table. Between runs, use a DB2 utility to manipulate the restart table to simulate a mid-run state:

UPDATE RESTART_CONTROL
SET    RUN_STATUS = 'C',
       LAST_KEY_VALUE = '00050000',
       RECORDS_READ = 50000,
       RECORDS_WRITTEN = 48500,
       RECORDS_UPDATED = 50000,
       RECORDS_ERROR = 1500
WHERE  PROGRAM_NAME = 'CBNC4500'
  AND  JOB_NAME = 'JOBRECON'
  AND  STEP_NAME = 'STEP010';
COMMIT;

Then run the program and verify it restarts from the simulated checkpoint.

Method 3: Cancel the job. Submit the job and cancel it while it's running. This simulates the most realistic failure mode: an unexpected termination. The downside is timing — you may not cancel it at the exact point you want.

Verifying Data Consistency

After every restart test, you must verify that the final results are identical to a clean run. This means:

Record counts match. The total records processed (read, written, updated, error) must be identical whether the job ran cleanly or restarted five times.
Control totals match. Accumulated amounts, hash totals, and balance figures must be identical.
DB2 data matches. Run a query to compare the final state of all updated tables against a baseline from a clean run.
Output files match. Compare the sequential output from a restart run against the output from a clean run. They should be byte-for-byte identical.

//VERIFY   EXEC PGM=IEBCOMPR
//SYSUT1   DD DSN=PROD.CLEAN.RUN.OUTPUT,DISP=SHR
//SYSUT2   DD DSN=PROD.RESTART.RUN.OUTPUT,DISP=SHR
//SYSPRINT DD SYSOUT=*
//SYSIN    DD DUMMY

If IEBCOMPR reports any differences, the checkpoint/restart logic has a bug.

Rob's Testing Rule

"If you haven't tested your checkpoint/restart by actually killing the job mid-run and restarting it, you haven't tested it. A code review doesn't count. A desk check doesn't count. Kill it. Restart it. Verify every number." — Rob Calloway

At CNB, no batch program with checkpoint/restart goes into production without a sign-off from operations that the restart was tested end-to-end. This is part of the production readiness checklist that Kwame instituted after the CBNC4500 incident.

Automated Restart Testing

For ongoing regression testing, CNB uses a testing harness that:

Loads a known test dataset into DB2 and VSAM
Runs the program cleanly to establish a baseline
Runs the program with TEST-ABORT-AFTER set to various values (10%, 25%, 50%, 75%, 90% of input)
Restarts after each abort
Compares final results against the baseline
Reports any discrepancies

This harness runs monthly as part of the batch regression test suite. It has caught three bugs since it was implemented — all in edge cases where the restart key handling was slightly wrong for boundary records.

24.8 Checkpoint/Restart in the HA Banking System

Now let's apply everything we've learned to the Progressive Project: the HA Banking Transaction Processing System. This section designs the checkpoint/restart strategy for the banking batch pipeline.

The HA Banking Batch Pipeline

The HA system processes daily banking transactions in a batch pipeline with these steps:

Step	Program	Input	Output	DB2 Tables
STEP010	HAEXTRACT	DB2 TRANS_STAGING	SEQ: TRANS.EXTRACT	Reads TRANS_STAGING
STEP020	HAVALIDATE	SEQ: TRANS.EXTRACT	SEQ: VALID.TRANS + SEQ: REJECT.TRANS	Reads ACCT_MASTER (VSAM)
STEP030	HAPOSTING	SEQ: VALID.TRANS	SEQ: POST.AUDIT	Updates ACCT_MASTER (VSAM), ACCT_BALANCE (DB2)
STEP040	HAREPORT	DB2 ACCT_BALANCE, SEQ: POST.AUDIT	Report (SYSOUT)	Reads ACCT_BALANCE

Each step has different checkpoint/restart requirements based on its resource access patterns.

STEP010: HAEXTRACT — DB2 to Sequential

This step reads from DB2 and writes to a sequential file. It does not update DB2.

Checkpoint strategy: - Commit frequency: 10,000 (reading only, no lock concerns) - Restart: Rewrite the sequential output file from the checkpoint position - Restart table: Stores last account number extracted

Since HAEXTRACT only reads DB2 (no updates), the commit frequency controls checkpoint interval, not lock duration. We can use a higher value.

On restart, the sequential output must be rewritten. Strategy: use DISP=(MOD,...) with careful byte-position tracking, or (simpler) regenerate the entire output from committed data. Since the input is a stable DB2 table and the extract is fast, regeneration is acceptable.

STEP020: HAVALIDATE — Sequential + VSAM Read

This step reads sequential input and reads (but does not write) VSAM. It writes two sequential output files.

Checkpoint strategy: - This step is read-only for persistent stores — it reads sequential input and VSAM, writes sequential output. - Commit frequency: N/A (no DB2 updates). Use the restart table for positioning only, committed every 5,000 records. - Restart: Re-read input from the last checkpoint position (skip forward), regenerate output files. - The restart table stores the last input record sequence number.

For sequential input repositioning on restart:

       3100-SKIP-TO-RESTART-POINT.
           MOVE ZERO TO WS-SKIP-COUNT
           PERFORM UNTIL WS-SKIP-COUNT >=
                         WS-RESTART-REC-READ
             READ INPUT-FILE INTO WS-INPUT-RECORD
               AT END
                 DISPLAY 'UNEXPECTED EOF DURING SKIP'
                 MOVE 16 TO WS-RETURN-CODE
                 PERFORM 9000-ABEND-HANDLER
             END-READ
             ADD 1 TO WS-SKIP-COUNT
           END-PERFORM
           DISPLAY 'SKIPPED ' WS-SKIP-COUNT
                   ' RECORDS TO RESTART POINT'
           .

STEP030: HAPOSTING — The Critical Step

This is the most complex step. It reads sequential input, updates VSAM (account master), and updates DB2 (account balance). It must coordinate all three resource types.

Checkpoint strategy: - Commit frequency: 2,000 (updates DB2 and VSAM — lock duration matters) - Lower commit frequency than other steps because this step updates both DB2 and VSAM, and the account balance rows are also accessed by online banking. - Restart table stores: last transaction key processed, running totals for debits and credits, record counts.

VSAM coordination: - On restart, the VSAM account master may have partial updates. Since HAPOSTING updates account balances by adding/subtracting amounts, the VSAM updates since the last checkpoint must be reversed. - Strategy: Store the keys and amounts of all VSAM updates since the last checkpoint in a DB2 staging table (committed with each checkpoint). On restart, reverse those updates before resuming.

Alternatively (and this is what the HA system uses):

The VSAM account master stores a "last update timestamp."
On restart, for any record updated after the last checkpoint timestamp, the program reverses the update using the before-image stored in the DB2 audit trail.

Sequential input repositioning: - Same skip-forward approach as STEP020.

DB2 coordination: - Automatic — the COMMIT/ROLLBACK handles it.

STEP040: HAREPORT — Read-Only

This step only reads data and produces a report. No checkpoint/restart is needed — if it fails, rerun it from the beginning. It runs in under 10 minutes and produces no persistent output other than a report.

Decision: No checkpoint/restart for STEP040. This is a legitimate design choice. Not every step needs checkpointing. If a step is fast, read-only, and produces no persistent state changes, the overhead of checkpoint/restart logic is not justified.

The Complete JCL

//HABATCH  JOB (ACCT),'HA DAILY BATCH',CLASS=A,MSGCLASS=X,
//         NOTIFY=&SYSUID
//*
//* ---- STEP 1: EXTRACT TRANSACTIONS FROM DB2 ----
//*
//STEP010  EXEC PGM=HAEXTRACT,RD=R,
//         PARM='10000'
//STEPLIB  DD DSN=PROD.HA.LOADLIB,DISP=SHR
//SYSPRINT DD SYSOUT=*
//OUTPUT   DD DSN=PROD.HA.TRANS.EXTRACT.D&LYYMMDD,
//            DISP=(NEW,CATLG,DELETE),
//            SPACE=(CYL,(200,50)),
//            DCB=(RECFM=FB,LRECL=500,BLKSIZE=27500)
//SYSCHK   DD DSN=PROD.HA.CHKPT.STEP010,
//            DISP=(NEW,KEEP,KEEP),
//            SPACE=(CYL,(2,2)),
//            UNIT=SYSDA
//*
//* ---- STEP 2: VALIDATE TRANSACTIONS ----
//*
//STEP020  EXEC PGM=HAVALIDATE,RD=R,
//         PARM='5000'
//STEPLIB  DD DSN=PROD.HA.LOADLIB,DISP=SHR
//SYSPRINT DD SYSOUT=*
//INPUT    DD DSN=PROD.HA.TRANS.EXTRACT.D&LYYMMDD,DISP=SHR
//ACCTMSTR DD DSN=PROD.HA.ACCT.MASTER,DISP=SHR
//VALIDOUT DD DSN=PROD.HA.VALID.TRANS.D&LYYMMDD,
//            DISP=(NEW,CATLG,DELETE),
//            SPACE=(CYL,(150,30)),
//            DCB=(RECFM=FB,LRECL=500,BLKSIZE=27500)
//REJECTS  DD DSN=PROD.HA.REJECT.TRANS.D&LYYMMDD,
//            DISP=(NEW,CATLG,DELETE),
//            SPACE=(CYL,(10,5)),
//            DCB=(RECFM=FB,LRECL=600,BLKSIZE=27000)
//*
//* ---- STEP 3: POST TRANSACTIONS ----
//*
//STEP030  EXEC PGM=HAPOSTING,RD=R,
//         PARM='2000'
//STEPLIB  DD DSN=PROD.HA.LOADLIB,DISP=SHR
//SYSPRINT DD SYSOUT=*
//INPUT    DD DSN=PROD.HA.VALID.TRANS.D&LYYMMDD,DISP=SHR
//ACCTMSTR DD DSN=PROD.HA.ACCT.MASTER,DISP=OLD
//POSTAUDT DD DSN=PROD.HA.POST.AUDIT.D&LYYMMDD,
//            DISP=(NEW,CATLG,DELETE),
//            SPACE=(CYL,(100,20)),
//            DCB=(RECFM=FB,LRECL=400,BLKSIZE=27600)
//*
//* ---- STEP 4: GENERATE REPORTS ----
//*
//STEP040  EXEC PGM=HAREPORT,RD=NC
//STEPLIB  DD DSN=PROD.HA.LOADLIB,DISP=SHR
//SYSPRINT DD SYSOUT=*
//INPUT    DD DSN=PROD.HA.POST.AUDIT.D&LYYMMDD,DISP=SHR
//REPORT   DD SYSOUT=*
//*
//* ---- COMPLETION MARKER ----
//*
//STEPFIN  EXEC PGM=IEFBR14
//MARKER   DD DSN=PROD.HA.BATCH.COMPLETE.D&LYYMMDD,
//            DISP=(NEW,CATLG,DELETE),
//            SPACE=(TRK,0)

Note the RD parameter values: - STEP010, STEP020, STEP030: RD=R (checkpoint and restart enabled) - STEP040: RD=NC (no checkpoint, no restart — read-only report step)

Recovery Scenarios

Scenario A: STEP030 abends at record 150,000 of 500,000. 1. DB2 automatically rolls back the current UR (records 150,001 to the failure point). 2. The restart table shows LAST_KEY_VALUE for the last committed checkpoint (record 150,000 if commit frequency is 2,000 and it committed at exactly 150,000, or the nearest lower multiple of 2,000). 3. Operator restarts with RESTART=STEP030. 4. HAPOSTING reads the restart table, finds RUN_STATUS='C', resumes from the checkpoint. 5. VSAM cleanup: reverses any VSAM updates made after the last checkpoint. 6. Processing continues from record 150,001 (approximately). 7. Recovery time: minutes, not hours.

Scenario B: STEP010 abends due to DB2 space issue. 1. DBA resolves the space issue. 2. Operator restarts with RESTART=STEP010. 3. HAEXTRACT resumes from its last checkpoint. 4. STEP020, STEP030, STEP040 run after STEP010 completes.

Scenario C: STEP020 abends, but operator doesn't notice until STEP030 has started (impossible with standard JCL, but consider automation errors). 1. The completion marker dataset does not exist. 2. Downstream jobs wait. 3. Operator investigates, finds STEP020 failed. 4. Restarts from STEP020. 5. STEP030 re-executes because it depends on STEP020 output. 6. STEP030's restart table detects the fresh input and reinitializes.

24.9 Spaced Review: Connecting to Prior Chapters

This chapter builds on foundations laid in three earlier chapters. Let's explicitly connect them.

Chapter 4: Datasets — File Positioning

In Chapter 4, you learned about sequential and VSAM dataset organization — QSAM buffering, VSAM KSDS key access, and how z/OS manages file I/O. That knowledge is directly applied here:

Sequential file repositioning on restart depends on understanding how QSAM reads work (Section 24.5).
VSAM START and READ NEXT operations for restart repositioning use the KSDS keyed access path you learned in Chapter 4.
The choice between KSDS, RRDS, and ESDS for checkpoint/restart compatibility depends on the access patterns covered in Chapter 4.

Review question: Why is a VSAM ESDS problematic for checkpoint/restart, while a KSDS handles it naturally? (Answer: ESDS has no key — you cannot directly position to a specific record. KSDS supports START with a key, enabling direct repositioning.)

Chapter 8: Locking — Commit Frequency vs. Lock Duration

Chapter 8 covered DB2 locking: lock modes (S, X, U, IS, IX), lock escalation, deadlock detection, and timeout handling. The commit frequency analysis in Section 24.4 is a direct application:

Lock duration equals the time between commits. Short commit intervals mean short lock hold times.
Lock escalation from row to page to tablespace occurs when too many individual locks are held. Frequent commits release locks and prevent escalation.
SQLCODE -911 (deadlock/timeout) is exactly what killed Rob's CBNC4500. Frequent commits reduce the window for deadlocks.

Review question: If a batch program commits every 5,000 records and processes 200 records per second, what is the maximum lock hold time? (Answer: 5,000 / 200 = 25 seconds. Any row locked by this program is released within 25 seconds.)

Chapter 23: Batch Window — Restart Impact

Chapter 23 analyzed batch window management — scheduling, critical path analysis, and the consequences of overruns. Checkpoint/restart directly protects the batch window:

A 4-hour job that fails at 3 hours without checkpointing needs 7+ hours total. With checkpointing, it needs ~4 hours 15 minutes.
Restart recovery time is bounded by the commit interval: maximum recovery overhead is the time to reprocess one commit interval's worth of records.
Multi-step job restart (Section 24.6) avoids re-running completed steps, further protecting the batch window.

Review question: If the batch window is 6 hours and a critical 4-hour job fails at the 3-hour mark, can it still complete within the window? (Answer: Without checkpointing, no — it needs 7 hours. With checkpointing at 5,000-record intervals, yes — it needs approximately 4 hours plus restart overhead of a few minutes.)

24.10 Common Mistakes and How to Avoid Them

Twenty-five years of mainframe batch has shown me these mistakes repeatedly. Learn from other people's failures.

Mistake 1: Committing the Restart Table Separately from Business Data

The bug: The program updates DB2 business tables, commits, then updates the restart table, then commits again. If the program fails between the two commits, the restart table doesn't reflect the committed business data. On restart, the program reprocesses records that were already committed — creating duplicates.

The fix: One COMMIT that covers both business data and the restart table. Always.

Mistake 2: Not Testing Restart After the Last Checkpoint

The bug: The program checkpoints at records 5000, 10000, 15000, and the last record is 17500. Testing only covers failure at exact checkpoint boundaries. No one tests failure at record 16200 — after the last checkpoint but before completion. The restart logic has a subtle bug in this case (e.g., it doesn't handle the partial batch of 1200 records correctly).

The fix: Test failure at non-checkpoint boundaries. Specifically test: before first checkpoint, at a checkpoint, between checkpoints, and after the last checkpoint but before completion.

Mistake 3: Forgetting to Preserve Accumulators

The bug: The program maintains running totals — total debit amount, total credit amount, transaction counts. On restart, it restores the record-processed count from the restart table but reinitializes the accumulators to zero. The end-of-job control totals are wrong — they only reflect records processed since the last restart.

The fix: Store all accumulators in the restart table. Every counter, every running total, every hash value that contributes to end-of-job reporting.

Mistake 4: Hardcoding the Commit Frequency

The bug: The commit frequency is a literal in the COBOL source: IF WS-RECORD-COUNT / 5000 * 5000 = WS-RECORD-COUNT. To change it, you must modify source, compile, link-edit, and promote. On a night when the batch window is tight and you need a different commit frequency, you're stuck.

The fix: Read the commit frequency from PARM or a control table. Validate it within a reasonable range (100–100,000). Default to a sensible value if not provided.

Mistake 5: Not Handling the "Already Completed" Case

The bug: The restart table shows RUN_STATUS = 'E' (completed), but someone accidentally submits the job again. The program doesn't check — it processes all records again, creating duplicates.

The fix: On startup, if RUN_STATUS = 'E', either (a) treat it as a fresh start (reset everything and reprocess — appropriate if the input is idempotent) or (b) skip processing and exit with RC=0 (appropriate if the job should only run once per day). Choose based on your business rules, but always handle this case explicitly.

Mistake 6: Not Logging Checkpoint Information

The bug: The program takes checkpoints but doesn't write any messages to SYSPRINT or the job log. When the program restarts, operations has no way to confirm that it's actually resuming from a checkpoint. When something goes wrong, there's no audit trail of when checkpoints were taken and what state was saved.

The fix: Log every checkpoint: checkpoint number, key value, record counts, and timestamp. Log the restart detection at startup: fresh start or restart, and if restart, what key and counts are being restored. This logging is invaluable for production debugging and operator confidence.

Mistake 7: Sequential Output Without Regeneration Strategy

The bug: The program writes 100,000 records to a sequential output file, then fails. On restart, it appends the remaining records to the same file. The file now has 100,000 records that correspond to rolled-back DB2 changes, followed by the correct records from the restart point. The downstream job processes all records, including the orphaned first 100,000.

The fix: On restart, delete and recreate the sequential output file, then regenerate output from committed data. Or use a GDG approach where each checkpoint writes to a new generation.

Chapter Summary

Checkpoint/restart is not an optional enhancement for serious batch programs. It is a fundamental design requirement. The key principles:

Design for recovery, not prevention. Accept that failures will happen. Design your program so that recovery is fast and automatic.
Use application-level checkpointing. The z/OS checkpoint/restart facility is a useful safety net, but application-level checkpointing with a restart table gives you full control over DB2, VSAM, and sequential file coordination.
Commit the restart table with the business data. One COMMIT, one unit of recovery. This is the atomicity guarantee that makes restart correct.
Choose commit frequency based on tradeoffs. Balance recovery time, lock duration, log volume, and commit overhead. Start with 5,000 records and adjust based on measurement.
Coordinate across all resource types. DB2 handles itself via COMMIT/ROLLBACK. VSAM needs explicit cleanup. Sequential files need regeneration. The restart table is the single source of truth.
Test by actually killing the job. Code review is not enough. Run the program, kill it, restart it, and verify every number matches a clean run.
Make it configurable. Commit frequency, test abort points, and restart behavior should be parameters, not compiled-in constants.

Rob's CBNC4500 incident cost CNB a late wire transfer, a conversation with the Fed, and three missed SLAs. The redesigned program — with application-level checkpointing at 5,000-record intervals, a restart table, and tested recovery procedures — has been restarted nine times in the seven years since. Average recovery time: 4 minutes. Zero missed SLAs.

That is the difference checkpoint/restart makes.

Learning Objectives

In This Chapter

Chapter 24: Checkpoint/Restart Design — Building Batch Programs That Survive Any Failure

24.1 Why Checkpoints Matter

The Cost of Not Checkpointing

The Threshold Concept: Design for Recovery, Not Prevention

Checkpoint/Restart Terminology

24.2 The z/OS Checkpoint/Restart Facility

The CHKPT Macro

The RD Parameter

SYSCKEOV — Checkpoint at End of Volume

Limitations of the System Facility

24.3 Application-Level Checkpointing in COBOL

The Restart Table Pattern

The Program Flow

COBOL Implementation

The Critical Atomicity Requirement

Retrieving Job and Step Names

Handling the Input Cursor on Restart

The WITH HOLD Cursor Consideration

Idempotency and the Duplicate Processing Problem

24.4 DB2 Commit Frequency Analysis

The Tradeoff Matrix

Understanding Commit Overhead

Lock Duration and Concurrency

Log Volume and Active Log Sizing

Making the Commit Frequency Configurable

24.5 VSAM and Sequential File Checkpoint/Restart

VSAM File Repositioning

VSAM Output File Challenges

Sequential Output File Strategies

Coordinating Across All Three: DB2 + VSAM + Sequential

24.6 Multi-Step Job Checkpoint Strategy

Step-Level Restart with COND and IF/THEN/ELSE

The Passed Dataset Problem

Multi-Step Restart Table Coordination

Conditional Execution and Restart

The Job Completion Marker

24.7 Testing Checkpoint/Restart

The Test Plan

Simulating Failures

Verifying Data Consistency

Rob's Testing Rule

Automated Restart Testing

24.8 Checkpoint/Restart in the HA Banking System

The HA Banking Batch Pipeline

STEP010: HAEXTRACT — DB2 to Sequential

STEP020: HAVALIDATE — Sequential + VSAM Read

STEP030: HAPOSTING — The Critical Step

STEP040: HAREPORT — Read-Only

The Complete JCL

Recovery Scenarios

24.9 Spaced Review: Connecting to Prior Chapters

Chapter 4: Datasets — File Positioning

Chapter 8: Locking — Commit Frequency vs. Lock Duration

Chapter 23: Batch Window — Restart Impact

24.10 Common Mistakes and How to Avoid Them

Mistake 1: Committing the Restart Table Separately from Business Data

Mistake 2: Not Testing Restart After the Last Checkpoint

Mistake 3: Forgetting to Preserve Accumulators

Mistake 4: Hardcoding the Commit Frequency

Mistake 5: Not Handling the "Already Completed" Case

Mistake 6: Not Logging Checkpoint Information

Mistake 7: Sequential Output Without Regeneration Strategy

Chapter Summary

Related Reading