> "You don't build checkpoint/restart because you think your job will fail. You build it because the one time it does, at 2:47 AM on a Saturday before a holiday Monday, you want to go back to sleep instead of rerunning six hours of processing."
Learning Objectives
- Design checkpoint/restart logic for COBOL batch programs that coordinate across DB2, VSAM, and sequential files
- Implement application-level checkpointing using DB2 COMMIT frequency and restart table patterns
- Configure JCL for automatic restart using the z/OS checkpoint/restart facility and RD parameter
- Analyze checkpoint frequency tradeoffs (commit interval, recovery time, lock duration, log volume)
- Design checkpoint/restart for the HA banking system's batch processing pipeline
In This Chapter
- 24.1 Why Checkpoints Matter
- 24.2 The z/OS Checkpoint/Restart Facility
- 24.3 Application-Level Checkpointing in COBOL
- 24.4 DB2 Commit Frequency Analysis
- 24.5 VSAM and Sequential File Checkpoint/Restart
- 24.6 Multi-Step Job Checkpoint Strategy
- 24.7 Testing Checkpoint/Restart
- 24.8 Checkpoint/Restart in the HA Banking System
- 24.9 Spaced Review: Connecting to Prior Chapters
- 24.10 Common Mistakes and How to Avoid Them
- Chapter Summary
Chapter 24: Checkpoint/Restart Design — Building Batch Programs That Survive Any Failure
"You don't build checkpoint/restart because you think your job will fail. You build it because the one time it does, at 2:47 AM on a Saturday before a holiday Monday, you want to go back to sleep instead of rerunning six hours of processing." — Rob Calloway, CNB Infrastructure Lead
24.1 Why Checkpoints Matter
Rob Calloway has been running batch systems at Central National Bank for nineteen years. He remembers every major outage. Not because he keeps a log — though he does — but because each one carved a lesson into his operational instincts the way a chisel cuts stone.
The one he tells new hires about happened on a Thursday in March, 2019. CNB's nightly batch cycle included a job called CBNC4500 — the daily account reconciliation program. It read 14.2 million transaction records from a DB2 table, matched them against VSAM master files, wrote adjustment records to a sequential output dataset, and updated three DB2 tables along the way. On a good night, it ran in 2 hours and 47 minutes. On that Thursday, it had been running for 3 hours and 52 minutes when a storage controller hiccupped. The DB2 address space took an abnormal termination. The job abended with a -911 SQLCODE — deadlock/timeout.
The program had no checkpoint logic. Zero. It had been written in 1997 by a contractor who was long gone, and it processed all 14.2 million records as a single unit of work. When it failed, DB2 rolled back every update it had made in those 3 hours and 52 minutes. The rollback itself took 48 minutes. Then the entire job had to be restarted from the beginning. Total elapsed time before the batch window closed: 7 hours and 27 minutes. Three downstream jobs missed their SLA. The wire transfer file was late. The Fed noticed.
Rob's post-mortem had one recommendation: implement checkpoint/restart for every batch program that processes more than 100,000 records or runs longer than 30 minutes.
That recommendation is the foundation of this chapter.
The Cost of Not Checkpointing
Let's be precise about what checkpoint/restart gives you, because vague benefits don't survive budget meetings:
Recovery time. Without checkpointing, a failed job restarts from record one. With checkpointing every 5,000 records, a job that fails at record 4,200,000 restarts from record 4,200,000 — not from zero. If processing takes 0.001 seconds per record, you save 4,200 seconds (70 minutes) of reprocessing.
Batch window protection. Chapter 23 covered the reality of shrinking batch windows. A 4-hour job that fails at the 3-hour mark and has to restart from scratch needs 7 hours total — if nothing else goes wrong. With checkpoint/restart, it needs 4 hours and perhaps 15 minutes. That's the difference between making the window and blowing it.
Lock duration. A program that commits every 5,000 records holds DB2 locks for the time it takes to process 5,000 records — maybe 5 seconds. A program that never commits holds locks for the entire run. Other programs that need those rows wait. Batch throughput degrades. Online systems that share those tables may timeout. This is not theoretical — we covered the mechanics of lock escalation in Chapter 8.
Log volume. DB2 writes before-images of every changed row to the active log. A single unit of work that updates 14.2 million rows generates enormous log volume. If the active log fills before the commit, DB2 forces an archive, and the entire system slows. Frequent commits keep the active log manageable.
Operational confidence. Operations staff who know that a failing job can be restarted from its last checkpoint handle incidents differently than staff who know that failure means a complete rerun. The first group follows procedure. The second group panics.
The Threshold Concept: Design for Recovery, Not Prevention
Here is the mental shift that separates junior mainframe developers from senior ones:
Checkpoint/restart is not about preventing failure. It is about making failure recovery fast and automatic.
You cannot prevent all failures. Hardware fails. Software has bugs. Network connections drop. Databases run out of space. The question is not "will this job ever fail?" The answer to that is always yes. The question is "when this job fails, how fast can we recover?"
The programs that survive in production for decades are not the ones that never fail. They are the ones that fail gracefully — that leave the system in a known state, that can be restarted without manual intervention, that pick up exactly where they left off.
This is the principle we will apply throughout this chapter: design every batch program so that failure at any point results in fast, automatic recovery.
Checkpoint/Restart Terminology
Before we go further, let's nail down the vocabulary. You will hear these terms used loosely in conversation. In this chapter, they have precise meanings:
| Term | Definition |
|---|---|
| Checkpoint | A recorded point in a program's execution from which processing can resume after a failure. Includes saving the program's position in its input files, the state of its counters and accumulators, and committing database changes. |
| Restart | The process of resuming a program's execution from a previously recorded checkpoint rather than from the beginning. |
| Commit | A database operation that makes all changes since the last commit (or program start) permanent. In DB2, this is EXEC SQL COMMIT. |
| Rollback | A database operation that undoes all changes since the last commit. In DB2, EXEC SQL ROLLBACK. |
| Unit of Recovery (UR) | The set of database changes between two consecutive commits. If a failure occurs, the current UR is rolled back. |
| Restart Table | An application-maintained DB2 table that stores checkpoint information — record counts, key values, timestamps — so the program knows where to resume. |
| Commit Frequency / Commit Interval | The number of records processed between commits. A commit frequency of 5,000 means the program commits after every 5,000 records. |
| Forward Recovery | Reapplying committed changes from a log to bring a database forward to a consistent state after a media failure. |
| Backward Recovery | Undoing uncommitted changes (rollback) to return a database to its last consistent state after a program failure. |
| Checkpoint Dataset | A z/OS dataset written by the CHKPT macro to save program state for the z/OS checkpoint/restart facility. |
| RD Parameter | A JCL DD parameter that controls automatic checkpoint/restart behavior at the step level. |
24.2 The z/OS Checkpoint/Restart Facility
z/OS provides a system-level checkpoint/restart facility that has been part of the operating system since the MVS days. It works at the job step level and is primarily useful for programs that process sequential datasets. Let's understand what it offers and where it falls short.
The CHKPT Macro
A program can issue a checkpoint by calling the z/OS CHKPT macro. In COBOL, this is done through a system service call. When CHKPT executes, the operating system:
- Writes the contents of the program's working storage to a checkpoint dataset
- Records the position of all open sequential datasets (by block count)
- Records the status of all open QSAM/BSAM files
- Writes a checkpoint record that can be used for restart
The checkpoint dataset is specified in JCL using the SYSCHK DD statement:
//SYSCHK DD DSN=PROD.BATCH.CHKPT.DATA,
// DISP=(NEW,KEEP,KEEP),
// SPACE=(CYL,(5,5)),
// UNIT=SYSDA
The RD Parameter
The RD (Restart Definition) parameter on the JOB or EXEC statement controls checkpoint/restart behavior. It has four possible values:
| RD Value | Meaning |
|---|---|
RD=R |
Automatic restart is allowed. Checkpoints are allowed. |
RD=RNC |
Automatic restart is allowed. Checkpoints are suppressed (Not Checkpoint). |
RD=NR |
Automatic restart is suppressed (Not Restart). Checkpoints are allowed. |
RD=NC |
Neither restart nor checkpoints are allowed. |
In practice, you specify RD on the EXEC statement for the step you want to protect:
//STEP010 EXEC PGM=CBNC4500,RD=R
When a step with RD=R abends, the operator can restart the job from the last checkpoint using the RESTART parameter on the JOB statement:
//CBNC4500 JOB (ACCT),'RECON',CLASS=A,
// RESTART=(STEP010,chkptname)
SYSCKEOV — Checkpoint at End of Volume
For programs that process multi-volume sequential datasets, the SYSCKEOV DD statement tells the system to take an automatic checkpoint every time an input dataset reaches the end of a volume:
//SYSCKEOV DD DSN=PROD.BATCH.CHKPT.EOV,
// DISP=(NEW,KEEP,KEEP),
// SPACE=(CYL,(2,2)),
// UNIT=SYSDA
This was more useful in the tape era when volumes were physical tape reels. A 100-reel input dataset would get 99 automatic checkpoints. Today, with DASD datasets, SYSCKEOV is less commonly used, but it still functions with multi-volume DASD datasets.
Limitations of the System Facility
The z/OS checkpoint/restart facility has significant limitations that you need to understand:
It does not checkpoint DB2 state. The facility saves file positions and working storage, but DB2 transactions are separate. If your program updates DB2 tables and you restart from a system checkpoint, the DB2 changes made since the last DB2 COMMIT are already rolled back. Your program's working storage says you processed 500,000 records, but DB2 only has the first 495,000 committed. You have a mismatch.
It does not reposition VSAM files. VSAM datasets are not managed by the same I/O subsystem as sequential files. The checkpoint facility does not record VSAM file positions.
It requires operator intervention. The RESTART parameter must be coded on the JOB statement when resubmitting. This means someone has to modify JCL and resubmit. At 2:47 AM, that someone may not be immediately available.
It does not handle application state beyond working storage. If your program maintains state in external files, temporary datasets, or cross-memory structures, the checkpoint facility doesn't know about them.
These limitations are why most modern mainframe shops use application-level checkpointing instead of or in addition to the system facility. The system facility is a safety net. Application-level checkpointing is the primary strategy.
💡 Practitioner Note: I've worked in shops that relied entirely on the z/OS checkpoint/restart facility, and shops that used purely application-level checkpointing. The shops that had the smoothest operations used both — application-level checkpointing as the primary mechanism, with the system facility as a fallback for programs that didn't have application checkpointing yet.
24.3 Application-Level Checkpointing in COBOL
Application-level checkpointing means your COBOL program manages its own checkpoint and restart logic. The program decides when to checkpoint, what to save, and how to restart. This gives you complete control — and complete responsibility.
The Restart Table Pattern
The restart table pattern is the most widely used approach for application-level checkpointing in DB2/COBOL batch programs. Here's how it works:
- You create a DB2 table specifically to hold checkpoint information
- Your program writes its checkpoint state to this table every N records
- The checkpoint write is part of the same COMMIT that commits the business data
- On restart, the program reads the restart table to determine where to resume
The restart table typically looks like this:
CREATE TABLE RESTART_CONTROL (
PROGRAM_NAME CHAR(8) NOT NULL,
JOB_NAME CHAR(8) NOT NULL,
STEP_NAME CHAR(8) NOT NULL,
LAST_KEY_VALUE VARCHAR(100) NOT NULL,
RECORDS_READ INTEGER NOT NULL,
RECORDS_WRITTEN INTEGER NOT NULL,
RECORDS_UPDATED INTEGER NOT NULL,
RECORDS_ERROR INTEGER NOT NULL,
CHECKPOINT_TS TIMESTAMP NOT NULL,
RUN_STATUS CHAR(1) NOT NULL,
ACCUM_AMOUNT_1 DECIMAL(15,2) NOT NULL WITH DEFAULT 0,
ACCUM_AMOUNT_2 DECIMAL(15,2) NOT NULL WITH DEFAULT 0,
ACCUM_AMOUNT_3 DECIMAL(15,2) NOT NULL WITH DEFAULT 0,
USER_DATA VARCHAR(500),
PRIMARY KEY (PROGRAM_NAME, JOB_NAME, STEP_NAME)
);
The key columns:
- PROGRAM_NAME / JOB_NAME / STEP_NAME: Uniquely identify the running instance. This matters when the same program runs in multiple jobs.
- LAST_KEY_VALUE: The key of the last record successfully processed. This is how the program knows where to resume.
- RECORDS_READ / WRITTEN / UPDATED / ERROR: Counters that must be preserved across restart so that end-of-job totals are correct.
- CHECKPOINT_TS: When the checkpoint was taken. Useful for monitoring and debugging.
- RUN_STATUS: 'S' for started, 'C' for checkpointed, 'E' for ended. Tells the restart logic whether a previous run completed or was interrupted.
- ACCUM_AMOUNT_1/2/3: Accumulators for running totals (e.g., total debit amount, total credit amount). These must be preserved so that control totals at end-of-job are correct.
The Program Flow
Here is the complete flow of a checkpoint/restart-enabled COBOL batch program:
Initialization (performed once at program start):
1. Read the restart table for this program/job/step
2. IF RUN_STATUS = 'E' (previous run completed normally)
OR no row exists (first run ever)
THEN this is a fresh start:
- Initialize all counters to zero
- Set starting position to beginning of input
- INSERT or UPDATE restart table with RUN_STATUS = 'S'
- COMMIT
3. IF RUN_STATUS = 'S' or 'C' (previous run did not complete)
THEN this is a restart:
- Load counters from restart table
- Set starting position to LAST_KEY_VALUE
- Position input file/cursor past LAST_KEY_VALUE
- Log: "RESTARTING FROM KEY: [value], RECORDS PREVIOUSLY PROCESSED: [count]"
Main Processing Loop:
FOR each input record:
1. Process the record (updates, inserts, writes)
2. Increment counters
3. IF records-since-last-commit >= COMMIT-FREQUENCY
THEN take a checkpoint:
a. UPDATE restart table with current counters and last key
b. EXEC SQL COMMIT
c. Reset records-since-last-commit counter
d. Log: "CHECKPOINT AT KEY: [value], RECORDS: [count]"
Termination (performed once at normal end):
1. Process any remaining records
2. UPDATE restart table with final counters and RUN_STATUS = 'E'
3. EXEC SQL COMMIT
4. Write control totals to report
5. Close all files
COBOL Implementation
Here is the WORKING-STORAGE section for checkpoint/restart support:
01 WS-RESTART-AREA.
05 WS-RESTART-PROGRAM PIC X(08) VALUE 'CBNC4500'.
05 WS-RESTART-JOB PIC X(08).
05 WS-RESTART-STEP PIC X(08).
05 WS-RESTART-KEY PIC X(100).
05 WS-RESTART-REC-READ PIC S9(09) COMP VALUE ZERO.
05 WS-RESTART-REC-WRIT PIC S9(09) COMP VALUE ZERO.
05 WS-RESTART-REC-UPD PIC S9(09) COMP VALUE ZERO.
05 WS-RESTART-REC-ERR PIC S9(09) COMP VALUE ZERO.
05 WS-RESTART-STATUS PIC X(01).
88 RESTART-STARTED VALUE 'S'.
88 RESTART-CHECKPOINTED VALUE 'C'.
88 RESTART-ENDED VALUE 'E'.
05 WS-RESTART-ACCUM-1 PIC S9(13)V99 COMP-3
VALUE ZERO.
05 WS-RESTART-ACCUM-2 PIC S9(13)V99 COMP-3
VALUE ZERO.
05 WS-RESTART-ACCUM-3 PIC S9(13)V99 COMP-3
VALUE ZERO.
01 WS-CHECKPOINT-CONTROL.
05 WS-COMMIT-FREQUENCY PIC S9(09) COMP VALUE 5000.
05 WS-RECORDS-SINCE-CMT PIC S9(09) COMP VALUE ZERO.
05 WS-IS-RESTART PIC X(01) VALUE 'N'.
88 IS-RESTART VALUE 'Y'.
88 IS-FRESH-START VALUE 'N'.
05 WS-CHECKPOINT-COUNT PIC S9(09) COMP VALUE ZERO.
The initialization paragraph:
2000-INITIALIZE-RESTART.
MOVE SPACES TO WS-RESTART-KEY
EXEC SQL
SELECT LAST_KEY_VALUE,
RECORDS_READ,
RECORDS_WRITTEN,
RECORDS_UPDATED,
RECORDS_ERROR,
RUN_STATUS,
ACCUM_AMOUNT_1,
ACCUM_AMOUNT_2,
ACCUM_AMOUNT_3
INTO :WS-RESTART-KEY,
:WS-RESTART-REC-READ,
:WS-RESTART-REC-WRIT,
:WS-RESTART-REC-UPD,
:WS-RESTART-REC-ERR,
:WS-RESTART-STATUS,
:WS-RESTART-ACCUM-1,
:WS-RESTART-ACCUM-2,
:WS-RESTART-ACCUM-3
FROM RESTART_CONTROL
WHERE PROGRAM_NAME = :WS-RESTART-PROGRAM
AND JOB_NAME = :WS-RESTART-JOB
AND STEP_NAME = :WS-RESTART-STEP
END-EXEC
EVALUATE SQLCODE
WHEN 0
IF RESTART-ENDED
PERFORM 2100-FRESH-START
ELSE
PERFORM 2200-RESTART-FROM-CHECKPOINT
END-IF
WHEN +100
PERFORM 2100-FRESH-START
WHEN OTHER
DISPLAY 'RESTART TABLE READ FAILED, SQLCODE='
SQLCODE
MOVE 16 TO WS-RETURN-CODE
PERFORM 9000-ABEND-HANDLER
END-EVALUATE.
2100-FRESH-START.
SET IS-FRESH-START TO TRUE
MOVE ZEROS TO WS-RESTART-REC-READ
WS-RESTART-REC-WRIT
WS-RESTART-REC-UPD
WS-RESTART-REC-ERR
WS-RESTART-ACCUM-1
WS-RESTART-ACCUM-2
WS-RESTART-ACCUM-3
MOVE SPACES TO WS-RESTART-KEY
SET RESTART-STARTED TO TRUE
EXEC SQL
MERGE INTO RESTART_CONTROL RC
USING (VALUES (:WS-RESTART-PROGRAM,
:WS-RESTART-JOB,
:WS-RESTART-STEP))
AS SRC(PGM, JOB, STP)
ON RC.PROGRAM_NAME = SRC.PGM
AND RC.JOB_NAME = SRC.JOB
AND RC.STEP_NAME = SRC.STP
WHEN MATCHED THEN
UPDATE SET RUN_STATUS = 'S',
LAST_KEY_VALUE = ' ',
RECORDS_READ = 0,
RECORDS_WRITTEN = 0,
RECORDS_UPDATED = 0,
RECORDS_ERROR = 0,
CHECKPOINT_TS = CURRENT TIMESTAMP,
ACCUM_AMOUNT_1 = 0,
ACCUM_AMOUNT_2 = 0,
ACCUM_AMOUNT_3 = 0
WHEN NOT MATCHED THEN
INSERT (PROGRAM_NAME, JOB_NAME, STEP_NAME,
LAST_KEY_VALUE, RECORDS_READ,
RECORDS_WRITTEN, RECORDS_UPDATED,
RECORDS_ERROR, CHECKPOINT_TS,
RUN_STATUS, ACCUM_AMOUNT_1,
ACCUM_AMOUNT_2, ACCUM_AMOUNT_3)
VALUES (:WS-RESTART-PROGRAM,
:WS-RESTART-JOB,
:WS-RESTART-STEP,
' ', 0, 0, 0, 0,
CURRENT TIMESTAMP,
'S', 0, 0, 0)
END-EXEC
EXEC SQL COMMIT END-EXEC
DISPLAY 'CBNC4500 - FRESH START INITIATED'
.
2200-RESTART-FROM-CHECKPOINT.
SET IS-RESTART TO TRUE
DISPLAY 'CBNC4500 - RESTARTING FROM KEY: '
WS-RESTART-KEY
DISPLAY ' RECORDS PREVIOUSLY READ: '
WS-RESTART-REC-READ
DISPLAY ' RECORDS PREVIOUSLY WRITTEN: '
WS-RESTART-REC-WRIT
DISPLAY ' RECORDS PREVIOUSLY UPDATED: '
WS-RESTART-REC-UPD
DISPLAY ' RECORDS PREVIOUSLY IN ERROR:'
WS-RESTART-REC-ERR
.
The checkpoint paragraph:
5000-TAKE-CHECKPOINT.
ADD 1 TO WS-CHECKPOINT-COUNT
SET RESTART-CHECKPOINTED TO TRUE
EXEC SQL
UPDATE RESTART_CONTROL
SET LAST_KEY_VALUE = :WS-RESTART-KEY,
RECORDS_READ = :WS-RESTART-REC-READ,
RECORDS_WRITTEN = :WS-RESTART-REC-WRIT,
RECORDS_UPDATED = :WS-RESTART-REC-UPD,
RECORDS_ERROR = :WS-RESTART-REC-ERR,
CHECKPOINT_TS = CURRENT TIMESTAMP,
RUN_STATUS = 'C',
ACCUM_AMOUNT_1 = :WS-RESTART-ACCUM-1,
ACCUM_AMOUNT_2 = :WS-RESTART-ACCUM-2,
ACCUM_AMOUNT_3 = :WS-RESTART-ACCUM-3
WHERE PROGRAM_NAME = :WS-RESTART-PROGRAM
AND JOB_NAME = :WS-RESTART-JOB
AND STEP_NAME = :WS-RESTART-STEP
END-EXEC
IF SQLCODE NOT = 0
DISPLAY 'CHECKPOINT UPDATE FAILED, SQLCODE='
SQLCODE
MOVE 16 TO WS-RETURN-CODE
PERFORM 9000-ABEND-HANDLER
END-IF
EXEC SQL COMMIT END-EXEC
DISPLAY 'CHECKPOINT #' WS-CHECKPOINT-COUNT
' AT KEY: ' WS-RESTART-KEY
' RECORDS: ' WS-RESTART-REC-READ
MOVE ZERO TO WS-RECORDS-SINCE-CMT
.
The Critical Atomicity Requirement
Notice that the restart table UPDATE and the COMMIT happen together. The business data updates and the restart table update are all part of the same unit of recovery. This is not optional — it is the fundamental guarantee that makes checkpoint/restart work.
If you update the restart table in a separate commit from the business data, you create a window where the restart table says "I processed up to record 500,000" but DB2 has only committed changes through record 495,000. On restart, you'd skip 5,000 records. Or worse: the restart table commit succeeds but the business data commit fails, and you skip records that were never processed.
⚠️ Critical Rule: The restart table update and the business data updates MUST be committed in the same COMMIT. They must be in the same unit of recovery. This is non-negotiable.
Retrieving Job and Step Names
Your program needs to know its own job name and step name to use as keys in the restart table. In COBOL running under DB2, you can retrieve these from the JCL environment:
2050-GET-JOB-INFO.
ACCEPT WS-RESTART-JOB FROM JOB-NAME
ACCEPT WS-RESTART-STEP FROM STEP-NAME
.
Some shops use a utility call or LE callable service instead. The point is: get the values dynamically, don't hardcode them. The same program may run in different jobs with different step names.
Handling the Input Cursor on Restart
When restarting, the program must skip past all records that were already processed. For a DB2 cursor, this means adding the restart key to the WHERE clause:
3000-OPEN-INPUT-CURSOR.
IF IS-RESTART
EXEC SQL
OPEN CSR-TRANSACTIONS
USING :WS-RESTART-KEY
END-EXEC
ELSE
EXEC SQL
OPEN CSR-TRANSACTIONS
END-EXEC
END-IF
.
Where the cursor is declared with a parameter marker for the restart case:
DECLARE CSR-TRANSACTIONS CURSOR FOR
SELECT ACCT_NUM, TRANS_DATE, TRANS_AMT, TRANS_TYPE
FROM DAILY_TRANSACTIONS
WHERE ACCT_NUM > :restart-key
OR :restart-key = ' '
ORDER BY ACCT_NUM
FOR FETCH ONLY
The OR :restart-key = ' ' clause makes this a single cursor that handles both fresh start (returns all rows) and restart (returns rows after the checkpoint key). Some shops use two separate cursors instead — one for fresh start, one for restart. Either approach works. The single-cursor approach is cleaner but the optimizer may not produce an ideal access path for both cases. Profile your specific queries.
The WITH HOLD Cursor Consideration
By default, DB2 closes all open cursors when you issue a COMMIT. This means after every checkpoint COMMIT, your input cursor is closed and you must reopen it.
You can avoid this by declaring the cursor WITH HOLD:
DECLARE CSR-TRANSACTIONS CURSOR WITH HOLD FOR
SELECT ACCT_NUM, TRANS_DATE, TRANS_AMT, TRANS_TYPE
FROM DAILY_TRANSACTIONS
WHERE ACCT_NUM > :restart-key
OR :restart-key = ' '
ORDER BY ACCT_NUM
FOR FETCH ONLY
A WITH HOLD cursor survives a COMMIT — it stays open, and the next FETCH returns the next row as if the COMMIT hadn't happened. This eliminates the overhead of closing and reopening the cursor at every checkpoint.
However, WITH HOLD cursors have tradeoffs:
Advantages: - No cursor close/reopen overhead at each checkpoint (can be significant for complex cursor queries) - Simpler code — no need to reposition the cursor after each COMMIT - DB2 can maintain internal positioning, potentially avoiding index lookups on each reopen
Disadvantages: - WITH HOLD cursors retain certain resources across COMMITs, potentially affecting DB2 memory utilization - If the underlying tablespace is reorganized or the index is rebuilt between commits (unlikely during batch, but possible during a long run), the cursor position may be lost - On restart after an abend, the cursor is NOT preserved — it was open in the failed thread, which no longer exists. You still need the restart key logic for the initial cursor open on restart.
The critical point: WITH HOLD eliminates cursor reopen overhead during normal processing, but does not eliminate the need for restart key positioning. You still need the WHERE clause with the restart key for the restart case. WITH HOLD helps between checkpoints within a single run; the restart key helps when restarting a failed run.
At CNB, Lisa Park uses WITH HOLD for all checkpoint/restart cursors. The elapsed time improvement is measurable — about 2% for CBNC4500, which takes 2,840 checkpoint COMMITs during a full run. For programs with fewer commits, the benefit is negligible.
Idempotency and the Duplicate Processing Problem
A well-designed checkpoint/restart program must be aware of idempotency — the property that processing a record more than once produces the same result as processing it once.
Consider what happens if the program fails between processing a record and reaching the next checkpoint. On restart, that record is the first one fetched by the cursor (because it is after the last committed checkpoint key). But it was already processed in the failed UR, and those changes were rolled back. So the record is correctly reprocessed.
Now consider an INSERT operation. If the program inserts a row into a results table for each input record, and the program fails after the insert but before the commit, the insert is rolled back. On restart, the insert executes again — no problem, because the first insert was rolled back.
But what if the program writes to an external system — sends a message to MQ Series, calls a web service, or writes to a non-DB2 file? Those operations are not rolled back by DB2. On restart, the program sends the message again or writes the record again. This is the duplicate processing problem.
The solution is to ensure that all side effects that cannot be rolled back are either: 1. Deferred until after COMMIT — perform the non-reversible action only after the data is committed 2. Made idempotent — design the receiving system to detect and ignore duplicates (using a unique identifier like transaction ID) 3. Performed within the same unit of recovery — use two-phase commit with MQ Series or other XA-compliant resource managers
For most COBOL batch programs that only touch DB2 and VSAM, idempotency is natural — DB2 rollback handles it. The problem surfaces when the program has external side effects. Be aware of this when adding checkpoint/restart to programs that interact with external systems.
24.4 DB2 Commit Frequency Analysis
Choosing the right commit frequency is an engineering decision, not a guess. It involves tradeoffs among four competing concerns:
- Recovery time — How long does restart take?
- Lock duration — How long are rows locked?
- Log volume — How much log space does each UR consume?
- Commit overhead — How much CPU does each COMMIT cost?
The Tradeoff Matrix
| Commit Frequency | Recovery Time | Lock Duration | Log per UR | Commit Overhead |
|---|---|---|---|---|
| Every 100 records | Minimal (seconds) | Very short | Very small | Very high |
| Every 1,000 records | Low (seconds) | Short | Small | High |
| Every 5,000 records | Low (minutes) | Moderate | Moderate | Moderate |
| Every 10,000 records | Moderate (minutes) | Moderate | Moderate | Low |
| Every 50,000 records | High (tens of minutes) | Long | Large | Very low |
| Never (end of job) | Maximum (full rerun) | Entire run | Entire run | None |
Understanding Commit Overhead
Each DB2 COMMIT is not free. It involves:
- Writing the log buffer to the active log dataset (synchronous I/O)
- Releasing all page and row locks held by the UR
- Internal DB2 bookkeeping for the new UR
On modern z/OS systems with zHyperLink-attached storage, a single COMMIT takes approximately 0.1–0.3 milliseconds of CPU time and 0.5–2 milliseconds of elapsed time. For a program processing 10 million records:
| Commit Frequency | Number of COMMITs | Commit CPU Overhead | Commit Elapsed Overhead |
|---|---|---|---|
| 100 | 100,000 | 10–30 seconds | 50–200 seconds |
| 1,000 | 10,000 | 1–3 seconds | 5–20 seconds |
| 5,000 | 2,000 | 0.2–0.6 seconds | 1–4 seconds |
| 10,000 | 1,000 | 0.1–0.3 seconds | 0.5–2 seconds |
At a commit frequency of 5,000, the overhead is negligible — a fraction of a second of CPU for a job that runs for hours. Even at 1,000, the overhead is minimal. Below 500, you start to notice it, but even then it's usually acceptable.
The practical guideline: For most batch programs, a commit frequency between 1,000 and 10,000 gives a good balance. Start with 5,000 and adjust based on measurement.
Lock Duration and Concurrency
The commit frequency directly controls how long your program holds DB2 locks. Between commits, every row your program updates (or reads with a lock) remains locked. Other programs that need those rows must wait.
Consider a program that updates account balances. With a commit frequency of 50,000, it locks 50,000 account rows at a time. If an online transaction needs one of those rows, it waits. If the wait exceeds the lock timeout threshold (typically 30–60 seconds), the online transaction gets SQLCODE -911 and the user sees an error.
This is why Rob Calloway's original CBNC4500 — with zero commits — was particularly dangerous. It locked every row it touched for the entire 3+ hour run. During that time, no online system could update those rows.
Log Volume and Active Log Sizing
Each unit of recovery generates log records. DB2 writes before-images (for rollback) and after-images (for forward recovery) of every changed row. If a single UR updates 1 million rows and each row is 200 bytes, the log volume for that UR is approximately:
- Before-images: 1,000,000 x 200 bytes = 200 MB
- After-images: 1,000,000 x 200 bytes = 200 MB
- Log record headers and control records: ~50 MB
- Total: ~450 MB for one UR
If your active log datasets are sized at 1 GB each (a common configuration), a single UR that generates 450 MB of log data uses nearly half an active log. If two such programs run concurrently, the active logs fill, forcing an archive switch. During the archive switch, all DB2 logging activity stalls. Every application waiting to write a log record waits.
With a commit frequency of 5,000, the same program generates approximately:
- 5,000 x 200 bytes x 2 (before + after) + overhead = ~2.3 MB per UR
This is trivial. The active log handles it without breaking a sweat.
📊 CNB's Standard: After the CBNC4500 incident, Kwame established a standard: no batch UR may generate more than 100 MB of log data. This translates to a commit frequency that depends on the row size, but 5,000–10,000 records is typical for CNB's transaction tables.
Making the Commit Frequency Configurable
Hard-coding the commit frequency is a maintenance headache. Different environments (development, QA, production) may need different values. Production may need to change the value during a particularly busy night.
The best practice is to read the commit frequency from a control table or a parameter:
01 WS-PARM-AREA.
05 WS-PARM-LENGTH PIC S9(04) COMP.
05 WS-PARM-DATA PIC X(100).
PROCEDURE DIVISION USING WS-PARM-AREA.
...
1500-PARSE-PARAMETERS.
IF WS-PARM-LENGTH > 0
UNSTRING WS-PARM-DATA DELIMITED BY ','
INTO WS-COMMIT-FREQ-ALPHA
WS-OTHER-PARM
END-UNSTRING
MOVE FUNCTION NUMVAL(WS-COMMIT-FREQ-ALPHA)
TO WS-COMMIT-FREQUENCY
IF WS-COMMIT-FREQUENCY < 100
OR WS-COMMIT-FREQUENCY > 100000
DISPLAY 'INVALID COMMIT FREQ: '
WS-COMMIT-FREQUENCY
' - USING DEFAULT 5000'
MOVE 5000 TO WS-COMMIT-FREQUENCY
END-IF
ELSE
MOVE 5000 TO WS-COMMIT-FREQUENCY
END-IF
.
The JCL passes the parameter via PARM:
//STEP010 EXEC PGM=CBNC4500,PARM='5000'
Now operations can change the commit frequency without a program recompile. If the batch window is tight one night and they want faster commits (shorter lock duration, letting concurrent jobs run faster), they increase the commit frequency. If commit overhead is a concern, they decrease it. This flexibility is worth the ten extra lines of code.
24.5 VSAM and Sequential File Checkpoint/Restart
DB2 tables are easy to checkpoint — you commit and the data is safe. VSAM and sequential files are harder because they don't participate in DB2's transaction management. You need explicit strategies for each file type.
VSAM File Repositioning
VSAM KSDS (Key-Sequenced Data Sets) support random access by key. This makes restart repositioning straightforward:
On restart, reposition the VSAM file using the last checkpoint key:
3100-REPOSITION-VSAM-INPUT.
MOVE WS-RESTART-KEY TO VSAM-KEY-FIELD
START VSAM-INPUT-FILE
KEY IS GREATER THAN VSAM-KEY-FIELD
INVALID KEY
DISPLAY 'VSAM REPOSITION FAILED AT KEY: '
VSAM-KEY-FIELD
MOVE 16 TO WS-RETURN-CODE
PERFORM 9000-ABEND-HANDLER
END-START
.
After the START, subsequent READ NEXT operations return records after the checkpoint position.
For VSAM RRDS (Relative Record Data Sets), you store the relative record number in the restart table and use it to reposition.
For VSAM ESDS (Entry-Sequenced Data Sets), repositioning is more complex because ESDS does not support keyed access. You have two options:
- Store the RBA (Relative Byte Address) in the restart table and use it to reposition. This requires low-level access that standard COBOL doesn't provide directly.
- Skip forward on restart by reading and discarding records until you reach the checkpoint count. This works but is slow for large files.
Most shops avoid ESDS for input files that need checkpoint/restart. Use KSDS instead.
VSAM Output File Challenges
VSAM output files present a different challenge. If your program writes records to a VSAM KSDS output file and then fails, the records written since the last checkpoint are already physically in the VSAM file — but the corresponding DB2 changes have been rolled back. You have orphaned VSAM records.
There are three approaches to handle this:
Approach 1: Delete on restart. On restart, delete all VSAM output records written since the last checkpoint. This requires knowing which records were written after the checkpoint — typically by using a timestamp or sequence number stored in the VSAM record.
3200-CLEANUP-VSAM-OUTPUT.
MOVE WS-RESTART-KEY TO VSAM-OUT-KEY
START VSAM-OUTPUT-FILE
KEY IS GREATER THAN VSAM-OUT-KEY
INVALID KEY
GO TO 3200-CLEANUP-DONE
END-START
PERFORM UNTIL WS-VSAM-EOF = 'Y'
READ VSAM-OUTPUT-FILE NEXT
AT END
SET WS-VSAM-EOF TO TRUE
NOT AT END
DELETE VSAM-OUTPUT-FILE
END-READ
END-PERFORM
.
3200-CLEANUP-DONE.
EXIT.
Approach 2: Write to a temporary file first. Write all output to a temporary sequential file, then copy to VSAM in a separate step after the main program completes successfully. This eliminates the VSAM inconsistency problem but adds a step.
Approach 3: Use CICS File Control for transactional VSAM. If the VSAM file is managed by CICS, you can use CICS recoverable file support, which participates in two-phase commit with DB2. This is the cleanest solution but requires CICS infrastructure.
Sequential Output File Strategies
Sequential output files are the trickiest for checkpoint/restart because you cannot easily "un-write" records from a sequential file. Once a record is written, it's written.
Strategy 1: Generation Data Groups (GDGs). Write each checkpoint's output to a new generation of a GDG. On restart, delete the partial generation and start a new one from the checkpoint position. On successful completion, concatenate all generations into the final output.
//OUTPUT DD DSN=PROD.RECON.OUTPUT(+1),
// DISP=(NEW,CATLG,DELETE),
// SPACE=(CYL,(50,10)),
// DCB=(RECFM=FB,LRECL=200,BLKSIZE=0)
Strategy 2: Rewrite from checkpoint. On restart, reallocate the output dataset with DISP=(NEW,CATLG,DELETE) and rewrite all output from the beginning — but only process input records from the checkpoint key forward. This works when the output is a subset transformation of the input. You lose previously written records, but those will be regenerated from the committed data.
Strategy 3: Track byte position. Store the byte offset of the sequential file in the restart table. On restart, position to that offset and continue writing. This requires low-level I/O manipulation and is fragile — avoid it unless no other option works.
💡 Practitioner Note: At CNB, Lisa Park standardized on Strategy 2 for most sequential output files. The rationale: sequential output files are almost always consumed by a downstream job, not directly by users. Rewriting the output from committed DB2 data is safe and simple. The downstream job gets a complete, consistent file regardless of how many times the producing job restarted.
Coordinating Across All Three: DB2 + VSAM + Sequential
The hardest checkpoint/restart scenarios involve programs that read from DB2, update VSAM, and write sequential output — all in the same job step. Each resource type has different transactional capabilities:
| Resource | Participates in DB2 COMMIT? | Can be repositioned on restart? | Can be "rolled back"? |
|---|---|---|---|
| DB2 tables | Yes | Yes (cursor with key > restart_key) | Yes (automatic rollback) |
| VSAM KSDS | No | Yes (START with key) | Manual (delete orphans) |
| Sequential output | No | No (append-only) | No (must rewrite) |
The coordination strategy:
- COMMIT handles DB2. The restart table and all business table updates are committed together.
- VSAM updates use the same key range as DB2. On restart, delete VSAM records written after the last checkpoint key.
- Sequential output is regenerated. On restart, rewrite the output file from committed data.
- The restart table is the single source of truth. It records the last committed key, which tells you exactly where DB2 is consistent, where VSAM cleanup starts, and what sequential output to regenerate.
This three-layer coordination is why application-level checkpointing is more complex than it first appears — and why it's worth investing in a reusable framework rather than coding it ad hoc in every program.
24.6 Multi-Step Job Checkpoint Strategy
Real batch jobs are not single steps. They are multi-step JCL jobs where each step depends on the output of the previous step. Checkpoint/restart must work at the job level, not just the step level.
Step-Level Restart with COND and IF/THEN/ELSE
JCL provides the COND parameter and IF/THEN/ELSE/ENDIF constructs for conditional step execution. When combined with restart, these control which steps execute on a restart run.
Consider a three-step job:
//JOBRECON JOB (ACCT),'DAILY RECON',CLASS=A,MSGCLASS=X
//*
//STEP010 EXEC PGM=EXTRACT,RD=R
//INPUT DD DSN=PROD.DAILY.TRANS,DISP=SHR
//OUTPUT DD DSN=PROD.EXTRACT.DATA,
// DISP=(NEW,CATLG,DELETE),
// SPACE=(CYL,(100,20))
//*
//STEP020 EXEC PGM=MATCH,RD=R
//INPUT DD DSN=PROD.EXTRACT.DATA,DISP=SHR
//MASTER DD DSN=PROD.ACCT.MASTER,DISP=SHR
//OUTPUT DD DSN=PROD.MATCHED.DATA,
// DISP=(NEW,CATLG,DELETE),
// SPACE=(CYL,(50,10))
//*
//STEP030 EXEC PGM=REPORT,RD=R
//INPUT DD DSN=PROD.MATCHED.DATA,DISP=SHR
//REPORT DD SYSOUT=*
If STEP020 abends, you want to restart from STEP020 — not from STEP010. You can specify this with the RESTART parameter:
//JOBRECON JOB (ACCT),'DAILY RECON',CLASS=A,MSGCLASS=X,
// RESTART=STEP020
But here's the problem: STEP020's input (PROD.EXTRACT.DATA) was created by STEP010. If STEP020 is restarted, STEP010 doesn't run, so the input dataset must already exist from the previous run. The DISP on STEP020's input DD must be SHR or OLD, not NEW.
For STEP010's output dataset, if it was created successfully in the first run, it still exists. The restart run skips STEP010, so the DISP=(NEW,...) on STEP010 is not executed. This works correctly.
The Passed Dataset Problem
If STEP010 passes the dataset to STEP020 using DISP=(NEW,PASS), a restart from STEP020 fails because passed datasets only exist for the duration of the job step that passed them. On restart, the passed dataset is gone.
Solution: For jobs that need checkpoint/restart, use cataloged datasets instead of passed datasets. The small overhead of cataloging is irrelevant compared to the restart capability you gain.
Multi-Step Restart Table Coordination
When multiple steps in a job all use application-level checkpointing with a restart table, the restart table must record state per step. This is why the restart table has a STEP_NAME column.
The job-level restart strategy:
- Each step reads its own row from the restart table using PROGRAM_NAME + JOB_NAME + STEP_NAME as the key.
- On fresh start, each step initializes its row to RUN_STATUS = 'S'.
- On restart, the scheduler restarts from the failed step. Previous steps' restart table rows still show RUN_STATUS = 'E' (completed), so if those steps accidentally re-execute, they quickly determine they already finished and exit with RC=0.
- On successful completion, each step sets RUN_STATUS = 'E'.
This means each step should include logic like:
2000-CHECK-IF-ALREADY-DONE.
PERFORM 2010-READ-RESTART-TABLE
IF RESTART-ENDED
DISPLAY 'STEP ALREADY COMPLETED - SKIPPING'
MOVE 0 TO RETURN-CODE
STOP RUN
END-IF
.
This is a safety net. The JCL restart should skip completed steps, but defense-in-depth means the program also checks.
Conditional Execution and Restart
Modern JCL uses IF/THEN/ELSE for conditional execution:
// IF (STEP010.RC <= 4) THEN
//STEP020 EXEC PGM=MATCH
// ...
// ENDIF
On restart from STEP020, the IF condition is not re-evaluated — JES skips directly to the restart step. This is usually what you want. But be aware: if STEP010's return code influenced which path the job took, and you restart from a step inside a conditional block, the condition is assumed to be true.
The Job Completion Marker
At CNB, every multi-step batch job ends with a "completion marker" step:
//STEPFIN EXEC PGM=IEFBR14
//MARKER DD DSN=PROD.JOBRECON.COMPLETE.D&LYYMMDD,
// DISP=(NEW,CATLG,DELETE),
// SPACE=(TRK,0)
This creates a zero-length dataset whose existence proves the job completed successfully. Downstream jobs check for this dataset before starting. If the job failed and was restarted, the marker is only created when all steps complete. This prevents downstream jobs from running on incomplete data.
24.7 Testing Checkpoint/Restart
You cannot trust checkpoint/restart logic that has never been tested. And yet, testing it is one of the most commonly skipped activities in mainframe development. The reason is simple: it's hard. You have to simulate failures, verify recovery, and confirm data consistency — all in an environment where failures are, by definition, abnormal.
The Test Plan
Every checkpoint/restart implementation needs a test plan that covers these scenarios:
Scenario 1: Normal completion — fresh start. - Run the program from the beginning. - Verify all records are processed. - Verify the restart table shows RUN_STATUS = 'E'. - Verify control totals are correct.
Scenario 2: Normal completion — restart after completion. - Run the program again without resetting the restart table. - Verify it detects the previous run completed (RUN_STATUS = 'E') and starts fresh. - Verify results match Scenario 1.
Scenario 3: Failure after first checkpoint. - Run the program, and after the first or second checkpoint, simulate a failure. - Verify the restart table shows RUN_STATUS = 'C' with the correct key and counts. - Restart the program. - Verify it resumes from the checkpoint position. - Verify final control totals match Scenario 1.
Scenario 4: Failure before first checkpoint. - Run the program and simulate a failure before the first commit. - Verify the restart table shows RUN_STATUS = 'S'. - Restart the program. - Verify it starts from the beginning (no checkpoint to resume from). - Verify final results match Scenario 1.
Scenario 5: Multiple failures. - Run the program, simulate failure, restart, simulate another failure, restart again. - Verify the program handles consecutive restarts correctly. - Verify final results match Scenario 1.
Scenario 6: Failure with VSAM and sequential coordination. - Run the program, let it write to VSAM and sequential output, simulate failure. - Verify VSAM cleanup occurs on restart (orphaned records deleted). - Verify sequential output is correct after restart. - Verify final results match Scenario 1.
Simulating Failures
There are several ways to simulate failures in a test environment:
Method 1: ABEND code in the program. Add a testing hook that abends after a configurable number of records:
01 WS-TEST-ABORT-AFTER PIC S9(09) COMP VALUE ZERO.
...
4500-CHECK-TEST-ABORT.
IF WS-TEST-ABORT-AFTER > ZERO
AND WS-RESTART-REC-READ >= WS-TEST-ABORT-AFTER
DISPLAY 'TEST ABORT AFTER ' WS-RESTART-REC-READ
' RECORDS'
EXEC SQL ROLLBACK END-EXEC
CALL 'CEE3ABD' USING WS-ABEND-CODE WS-TIMING
END-IF
.
Pass the abort-after count via PARM: PARM='5000,7500' (commit frequency 5000, abort after 7500 records). In production, the second parameter is zero or omitted.
Method 2: DB2 DSNTEP2 to update the restart table. Between runs, use a DB2 utility to manipulate the restart table to simulate a mid-run state:
UPDATE RESTART_CONTROL
SET RUN_STATUS = 'C',
LAST_KEY_VALUE = '00050000',
RECORDS_READ = 50000,
RECORDS_WRITTEN = 48500,
RECORDS_UPDATED = 50000,
RECORDS_ERROR = 1500
WHERE PROGRAM_NAME = 'CBNC4500'
AND JOB_NAME = 'JOBRECON'
AND STEP_NAME = 'STEP010';
COMMIT;
Then run the program and verify it restarts from the simulated checkpoint.
Method 3: Cancel the job. Submit the job and cancel it while it's running. This simulates the most realistic failure mode: an unexpected termination. The downside is timing — you may not cancel it at the exact point you want.
Verifying Data Consistency
After every restart test, you must verify that the final results are identical to a clean run. This means:
- Record counts match. The total records processed (read, written, updated, error) must be identical whether the job ran cleanly or restarted five times.
- Control totals match. Accumulated amounts, hash totals, and balance figures must be identical.
- DB2 data matches. Run a query to compare the final state of all updated tables against a baseline from a clean run.
- Output files match. Compare the sequential output from a restart run against the output from a clean run. They should be byte-for-byte identical.
//VERIFY EXEC PGM=IEBCOMPR
//SYSUT1 DD DSN=PROD.CLEAN.RUN.OUTPUT,DISP=SHR
//SYSUT2 DD DSN=PROD.RESTART.RUN.OUTPUT,DISP=SHR
//SYSPRINT DD SYSOUT=*
//SYSIN DD DUMMY
If IEBCOMPR reports any differences, the checkpoint/restart logic has a bug.
Rob's Testing Rule
"If you haven't tested your checkpoint/restart by actually killing the job mid-run and restarting it, you haven't tested it. A code review doesn't count. A desk check doesn't count. Kill it. Restart it. Verify every number." — Rob Calloway
At CNB, no batch program with checkpoint/restart goes into production without a sign-off from operations that the restart was tested end-to-end. This is part of the production readiness checklist that Kwame instituted after the CBNC4500 incident.
Automated Restart Testing
For ongoing regression testing, CNB uses a testing harness that:
- Loads a known test dataset into DB2 and VSAM
- Runs the program cleanly to establish a baseline
- Runs the program with TEST-ABORT-AFTER set to various values (10%, 25%, 50%, 75%, 90% of input)
- Restarts after each abort
- Compares final results against the baseline
- Reports any discrepancies
This harness runs monthly as part of the batch regression test suite. It has caught three bugs since it was implemented — all in edge cases where the restart key handling was slightly wrong for boundary records.
24.8 Checkpoint/Restart in the HA Banking System
Now let's apply everything we've learned to the Progressive Project: the HA Banking Transaction Processing System. This section designs the checkpoint/restart strategy for the banking batch pipeline.
The HA Banking Batch Pipeline
The HA system processes daily banking transactions in a batch pipeline with these steps:
| Step | Program | Input | Output | DB2 Tables |
|---|---|---|---|---|
| STEP010 | HAEXTRACT | DB2 TRANS_STAGING | SEQ: TRANS.EXTRACT | Reads TRANS_STAGING |
| STEP020 | HAVALIDATE | SEQ: TRANS.EXTRACT | SEQ: VALID.TRANS + SEQ: REJECT.TRANS | Reads ACCT_MASTER (VSAM) |
| STEP030 | HAPOSTING | SEQ: VALID.TRANS | SEQ: POST.AUDIT | Updates ACCT_MASTER (VSAM), ACCT_BALANCE (DB2) |
| STEP040 | HAREPORT | DB2 ACCT_BALANCE, SEQ: POST.AUDIT | Report (SYSOUT) | Reads ACCT_BALANCE |
Each step has different checkpoint/restart requirements based on its resource access patterns.
STEP010: HAEXTRACT — DB2 to Sequential
This step reads from DB2 and writes to a sequential file. It does not update DB2.
Checkpoint strategy: - Commit frequency: 10,000 (reading only, no lock concerns) - Restart: Rewrite the sequential output file from the checkpoint position - Restart table: Stores last account number extracted
Since HAEXTRACT only reads DB2 (no updates), the commit frequency controls checkpoint interval, not lock duration. We can use a higher value.
On restart, the sequential output must be rewritten. Strategy: use DISP=(MOD,...) with careful byte-position tracking, or (simpler) regenerate the entire output from committed data. Since the input is a stable DB2 table and the extract is fast, regeneration is acceptable.
STEP020: HAVALIDATE — Sequential + VSAM Read
This step reads sequential input and reads (but does not write) VSAM. It writes two sequential output files.
Checkpoint strategy: - This step is read-only for persistent stores — it reads sequential input and VSAM, writes sequential output. - Commit frequency: N/A (no DB2 updates). Use the restart table for positioning only, committed every 5,000 records. - Restart: Re-read input from the last checkpoint position (skip forward), regenerate output files. - The restart table stores the last input record sequence number.
For sequential input repositioning on restart:
3100-SKIP-TO-RESTART-POINT.
MOVE ZERO TO WS-SKIP-COUNT
PERFORM UNTIL WS-SKIP-COUNT >=
WS-RESTART-REC-READ
READ INPUT-FILE INTO WS-INPUT-RECORD
AT END
DISPLAY 'UNEXPECTED EOF DURING SKIP'
MOVE 16 TO WS-RETURN-CODE
PERFORM 9000-ABEND-HANDLER
END-READ
ADD 1 TO WS-SKIP-COUNT
END-PERFORM
DISPLAY 'SKIPPED ' WS-SKIP-COUNT
' RECORDS TO RESTART POINT'
.
STEP030: HAPOSTING — The Critical Step
This is the most complex step. It reads sequential input, updates VSAM (account master), and updates DB2 (account balance). It must coordinate all three resource types.
Checkpoint strategy: - Commit frequency: 2,000 (updates DB2 and VSAM — lock duration matters) - Lower commit frequency than other steps because this step updates both DB2 and VSAM, and the account balance rows are also accessed by online banking. - Restart table stores: last transaction key processed, running totals for debits and credits, record counts.
VSAM coordination: - On restart, the VSAM account master may have partial updates. Since HAPOSTING updates account balances by adding/subtracting amounts, the VSAM updates since the last checkpoint must be reversed. - Strategy: Store the keys and amounts of all VSAM updates since the last checkpoint in a DB2 staging table (committed with each checkpoint). On restart, reverse those updates before resuming.
Alternatively (and this is what the HA system uses):
- The VSAM account master stores a "last update timestamp."
- On restart, for any record updated after the last checkpoint timestamp, the program reverses the update using the before-image stored in the DB2 audit trail.
Sequential input repositioning: - Same skip-forward approach as STEP020.
DB2 coordination: - Automatic — the COMMIT/ROLLBACK handles it.
STEP040: HAREPORT — Read-Only
This step only reads data and produces a report. No checkpoint/restart is needed — if it fails, rerun it from the beginning. It runs in under 10 minutes and produces no persistent output other than a report.
Decision: No checkpoint/restart for STEP040. This is a legitimate design choice. Not every step needs checkpointing. If a step is fast, read-only, and produces no persistent state changes, the overhead of checkpoint/restart logic is not justified.
The Complete JCL
//HABATCH JOB (ACCT),'HA DAILY BATCH',CLASS=A,MSGCLASS=X,
// NOTIFY=&SYSUID
//*
//* ---- STEP 1: EXTRACT TRANSACTIONS FROM DB2 ----
//*
//STEP010 EXEC PGM=HAEXTRACT,RD=R,
// PARM='10000'
//STEPLIB DD DSN=PROD.HA.LOADLIB,DISP=SHR
//SYSPRINT DD SYSOUT=*
//OUTPUT DD DSN=PROD.HA.TRANS.EXTRACT.D&LYYMMDD,
// DISP=(NEW,CATLG,DELETE),
// SPACE=(CYL,(200,50)),
// DCB=(RECFM=FB,LRECL=500,BLKSIZE=27500)
//SYSCHK DD DSN=PROD.HA.CHKPT.STEP010,
// DISP=(NEW,KEEP,KEEP),
// SPACE=(CYL,(2,2)),
// UNIT=SYSDA
//*
//* ---- STEP 2: VALIDATE TRANSACTIONS ----
//*
//STEP020 EXEC PGM=HAVALIDATE,RD=R,
// PARM='5000'
//STEPLIB DD DSN=PROD.HA.LOADLIB,DISP=SHR
//SYSPRINT DD SYSOUT=*
//INPUT DD DSN=PROD.HA.TRANS.EXTRACT.D&LYYMMDD,DISP=SHR
//ACCTMSTR DD DSN=PROD.HA.ACCT.MASTER,DISP=SHR
//VALIDOUT DD DSN=PROD.HA.VALID.TRANS.D&LYYMMDD,
// DISP=(NEW,CATLG,DELETE),
// SPACE=(CYL,(150,30)),
// DCB=(RECFM=FB,LRECL=500,BLKSIZE=27500)
//REJECTS DD DSN=PROD.HA.REJECT.TRANS.D&LYYMMDD,
// DISP=(NEW,CATLG,DELETE),
// SPACE=(CYL,(10,5)),
// DCB=(RECFM=FB,LRECL=600,BLKSIZE=27000)
//*
//* ---- STEP 3: POST TRANSACTIONS ----
//*
//STEP030 EXEC PGM=HAPOSTING,RD=R,
// PARM='2000'
//STEPLIB DD DSN=PROD.HA.LOADLIB,DISP=SHR
//SYSPRINT DD SYSOUT=*
//INPUT DD DSN=PROD.HA.VALID.TRANS.D&LYYMMDD,DISP=SHR
//ACCTMSTR DD DSN=PROD.HA.ACCT.MASTER,DISP=OLD
//POSTAUDT DD DSN=PROD.HA.POST.AUDIT.D&LYYMMDD,
// DISP=(NEW,CATLG,DELETE),
// SPACE=(CYL,(100,20)),
// DCB=(RECFM=FB,LRECL=400,BLKSIZE=27600)
//*
//* ---- STEP 4: GENERATE REPORTS ----
//*
//STEP040 EXEC PGM=HAREPORT,RD=NC
//STEPLIB DD DSN=PROD.HA.LOADLIB,DISP=SHR
//SYSPRINT DD SYSOUT=*
//INPUT DD DSN=PROD.HA.POST.AUDIT.D&LYYMMDD,DISP=SHR
//REPORT DD SYSOUT=*
//*
//* ---- COMPLETION MARKER ----
//*
//STEPFIN EXEC PGM=IEFBR14
//MARKER DD DSN=PROD.HA.BATCH.COMPLETE.D&LYYMMDD,
// DISP=(NEW,CATLG,DELETE),
// SPACE=(TRK,0)
Note the RD parameter values:
- STEP010, STEP020, STEP030: RD=R (checkpoint and restart enabled)
- STEP040: RD=NC (no checkpoint, no restart — read-only report step)
Recovery Scenarios
Scenario A: STEP030 abends at record 150,000 of 500,000.
1. DB2 automatically rolls back the current UR (records 150,001 to the failure point).
2. The restart table shows LAST_KEY_VALUE for the last committed checkpoint (record 150,000 if commit frequency is 2,000 and it committed at exactly 150,000, or the nearest lower multiple of 2,000).
3. Operator restarts with RESTART=STEP030.
4. HAPOSTING reads the restart table, finds RUN_STATUS='C', resumes from the checkpoint.
5. VSAM cleanup: reverses any VSAM updates made after the last checkpoint.
6. Processing continues from record 150,001 (approximately).
7. Recovery time: minutes, not hours.
Scenario B: STEP010 abends due to DB2 space issue.
1. DBA resolves the space issue.
2. Operator restarts with RESTART=STEP010.
3. HAEXTRACT resumes from its last checkpoint.
4. STEP020, STEP030, STEP040 run after STEP010 completes.
Scenario C: STEP020 abends, but operator doesn't notice until STEP030 has started (impossible with standard JCL, but consider automation errors). 1. The completion marker dataset does not exist. 2. Downstream jobs wait. 3. Operator investigates, finds STEP020 failed. 4. Restarts from STEP020. 5. STEP030 re-executes because it depends on STEP020 output. 6. STEP030's restart table detects the fresh input and reinitializes.
24.9 Spaced Review: Connecting to Prior Chapters
This chapter builds on foundations laid in three earlier chapters. Let's explicitly connect them.
Chapter 4: Datasets — File Positioning
In Chapter 4, you learned about sequential and VSAM dataset organization — QSAM buffering, VSAM KSDS key access, and how z/OS manages file I/O. That knowledge is directly applied here:
- Sequential file repositioning on restart depends on understanding how QSAM reads work (Section 24.5).
- VSAM START and READ NEXT operations for restart repositioning use the KSDS keyed access path you learned in Chapter 4.
- The choice between KSDS, RRDS, and ESDS for checkpoint/restart compatibility depends on the access patterns covered in Chapter 4.
Review question: Why is a VSAM ESDS problematic for checkpoint/restart, while a KSDS handles it naturally? (Answer: ESDS has no key — you cannot directly position to a specific record. KSDS supports START with a key, enabling direct repositioning.)
Chapter 8: Locking — Commit Frequency vs. Lock Duration
Chapter 8 covered DB2 locking: lock modes (S, X, U, IS, IX), lock escalation, deadlock detection, and timeout handling. The commit frequency analysis in Section 24.4 is a direct application:
- Lock duration equals the time between commits. Short commit intervals mean short lock hold times.
- Lock escalation from row to page to tablespace occurs when too many individual locks are held. Frequent commits release locks and prevent escalation.
- SQLCODE -911 (deadlock/timeout) is exactly what killed Rob's CBNC4500. Frequent commits reduce the window for deadlocks.
Review question: If a batch program commits every 5,000 records and processes 200 records per second, what is the maximum lock hold time? (Answer: 5,000 / 200 = 25 seconds. Any row locked by this program is released within 25 seconds.)
Chapter 23: Batch Window — Restart Impact
Chapter 23 analyzed batch window management — scheduling, critical path analysis, and the consequences of overruns. Checkpoint/restart directly protects the batch window:
- A 4-hour job that fails at 3 hours without checkpointing needs 7+ hours total. With checkpointing, it needs ~4 hours 15 minutes.
- Restart recovery time is bounded by the commit interval: maximum recovery overhead is the time to reprocess one commit interval's worth of records.
- Multi-step job restart (Section 24.6) avoids re-running completed steps, further protecting the batch window.
Review question: If the batch window is 6 hours and a critical 4-hour job fails at the 3-hour mark, can it still complete within the window? (Answer: Without checkpointing, no — it needs 7 hours. With checkpointing at 5,000-record intervals, yes — it needs approximately 4 hours plus restart overhead of a few minutes.)
24.10 Common Mistakes and How to Avoid Them
Twenty-five years of mainframe batch has shown me these mistakes repeatedly. Learn from other people's failures.
Mistake 1: Committing the Restart Table Separately from Business Data
The bug: The program updates DB2 business tables, commits, then updates the restart table, then commits again. If the program fails between the two commits, the restart table doesn't reflect the committed business data. On restart, the program reprocesses records that were already committed — creating duplicates.
The fix: One COMMIT that covers both business data and the restart table. Always.
Mistake 2: Not Testing Restart After the Last Checkpoint
The bug: The program checkpoints at records 5000, 10000, 15000, and the last record is 17500. Testing only covers failure at exact checkpoint boundaries. No one tests failure at record 16200 — after the last checkpoint but before completion. The restart logic has a subtle bug in this case (e.g., it doesn't handle the partial batch of 1200 records correctly).
The fix: Test failure at non-checkpoint boundaries. Specifically test: before first checkpoint, at a checkpoint, between checkpoints, and after the last checkpoint but before completion.
Mistake 3: Forgetting to Preserve Accumulators
The bug: The program maintains running totals — total debit amount, total credit amount, transaction counts. On restart, it restores the record-processed count from the restart table but reinitializes the accumulators to zero. The end-of-job control totals are wrong — they only reflect records processed since the last restart.
The fix: Store all accumulators in the restart table. Every counter, every running total, every hash value that contributes to end-of-job reporting.
Mistake 4: Hardcoding the Commit Frequency
The bug: The commit frequency is a literal in the COBOL source: IF WS-RECORD-COUNT / 5000 * 5000 = WS-RECORD-COUNT. To change it, you must modify source, compile, link-edit, and promote. On a night when the batch window is tight and you need a different commit frequency, you're stuck.
The fix: Read the commit frequency from PARM or a control table. Validate it within a reasonable range (100–100,000). Default to a sensible value if not provided.
Mistake 5: Not Handling the "Already Completed" Case
The bug: The restart table shows RUN_STATUS = 'E' (completed), but someone accidentally submits the job again. The program doesn't check — it processes all records again, creating duplicates.
The fix: On startup, if RUN_STATUS = 'E', either (a) treat it as a fresh start (reset everything and reprocess — appropriate if the input is idempotent) or (b) skip processing and exit with RC=0 (appropriate if the job should only run once per day). Choose based on your business rules, but always handle this case explicitly.
Mistake 6: Not Logging Checkpoint Information
The bug: The program takes checkpoints but doesn't write any messages to SYSPRINT or the job log. When the program restarts, operations has no way to confirm that it's actually resuming from a checkpoint. When something goes wrong, there's no audit trail of when checkpoints were taken and what state was saved.
The fix: Log every checkpoint: checkpoint number, key value, record counts, and timestamp. Log the restart detection at startup: fresh start or restart, and if restart, what key and counts are being restored. This logging is invaluable for production debugging and operator confidence.
Mistake 7: Sequential Output Without Regeneration Strategy
The bug: The program writes 100,000 records to a sequential output file, then fails. On restart, it appends the remaining records to the same file. The file now has 100,000 records that correspond to rolled-back DB2 changes, followed by the correct records from the restart point. The downstream job processes all records, including the orphaned first 100,000.
The fix: On restart, delete and recreate the sequential output file, then regenerate output from committed data. Or use a GDG approach where each checkpoint writes to a new generation.
Chapter Summary
Checkpoint/restart is not an optional enhancement for serious batch programs. It is a fundamental design requirement. The key principles:
-
Design for recovery, not prevention. Accept that failures will happen. Design your program so that recovery is fast and automatic.
-
Use application-level checkpointing. The z/OS checkpoint/restart facility is a useful safety net, but application-level checkpointing with a restart table gives you full control over DB2, VSAM, and sequential file coordination.
-
Commit the restart table with the business data. One COMMIT, one unit of recovery. This is the atomicity guarantee that makes restart correct.
-
Choose commit frequency based on tradeoffs. Balance recovery time, lock duration, log volume, and commit overhead. Start with 5,000 records and adjust based on measurement.
-
Coordinate across all resource types. DB2 handles itself via COMMIT/ROLLBACK. VSAM needs explicit cleanup. Sequential files need regeneration. The restart table is the single source of truth.
-
Test by actually killing the job. Code review is not enough. Run the program, kill it, restart it, and verify every number matches a clean run.
-
Make it configurable. Commit frequency, test abort points, and restart behavior should be parameters, not compiled-in constants.
Rob's CBNC4500 incident cost CNB a late wire transfer, a conversation with the Fed, and three missed SLAs. The redesigned program — with application-level checkpointing at 5,000-record intervals, a restart table, and tested recovery procedures — has been restarted nine times in the seven years since. Average recovery time: 4 minutes. Zero missed SLAs.
That is the difference checkpoint/restart makes.
Related Reading
Explore this topic in other books
Advanced COBOL Batch Window Management Learning COBOL Batch Processing Intermediate COBOL Batch Processing Patterns