Exercises — Chapter 24: Checkpoint/Restart Design

Section 24.1 — Why Checkpoints Matter

Exercise 1: Cost of Failure Calculation

A batch program processes 8 million records in 3 hours and 20 minutes. It has no checkpoint/restart logic. The program fails at the 2-hour-45-minute mark.

a) How many records had been processed before the failure (assuming uniform processing rate)? b) What is the total elapsed time from initial start to successful completion, assuming the rerun succeeds without errors? c) If the batch window is 5 hours, will the job complete within the window? d) If the program checkpointed every 10,000 records, approximately how long would recovery take?

Exercise 2: Failure Mode Analysis

List five distinct failure modes that could cause a batch COBOL/DB2 program to terminate abnormally. For each, explain: a) Whether DB2 performs automatic rollback b) Whether the z/OS checkpoint/restart facility can help c) Whether application-level checkpointing can help

Exercise 3: Business Impact Assessment

You are asked to justify the development effort for adding checkpoint/restart to a batch program. The program runs nightly, processes 2 million records in 90 minutes, and has failed 4 times in the past year. Each failure required a complete rerun.

Write a one-page justification that quantifies the business value of checkpoint/restart. Include: time saved per failure, SLA impact, operational staff impact, and risk reduction.

Exercise 4: Checkpoint vs. No Checkpoint Comparison

Create a table comparing two versions of the same program — one with checkpoint/restart every 5,000 records and one without. The program processes 5 million records, takes 2 hours, and updates a DB2 table. Compare: maximum lock hold time, maximum rollback time on failure, maximum reprocessing time on failure, log volume per unit of recovery, and operator intervention needed for restart.

Section 24.2 — z/OS Checkpoint/Restart Facility

Exercise 5: RD Parameter Configuration

For each of the following scenarios, specify the appropriate RD parameter value and explain your reasoning:

a) A program that processes a 50-volume tape dataset and should checkpoint at each volume boundary b) A program that manages its own application-level checkpointing and should not use system checkpoints c) A program that should allow both system checkpoints and automatic restart d) A test program where you want to disable all checkpoint/restart behavior

Exercise 6: CHKPT and RESTART JCL

Write the JCL for: a) A job step that enables system-level checkpointing with a checkpoint dataset b) The JOB statement to restart the same job from the third checkpoint of step STEP020 c) A job with SYSCKEOV for a multi-volume input dataset

Exercise 7: System Facility Limitations

A colleague proposes using only the z/OS checkpoint/restart facility (CHKPT macro, RD parameter) for a program that reads a sequential file, updates a DB2 table, and writes to a VSAM KSDS. Explain why this approach is insufficient. Identify specifically which resources are not protected and what inconsistencies could result on restart.

Exercise 8: System vs. Application Checkpointing

Create a comparison matrix showing the capabilities of system-level checkpointing (CHKPT/RD) versus application-level checkpointing (restart table) across these dimensions: DB2 transaction coordination, VSAM file repositioning, sequential file repositioning, working storage preservation, automatic restart capability, operator intervention required, and cross-step coordination.

Section 24.3 — Application-Level Checkpointing

Exercise 9: Restart Table Design

Design a restart table for a program that: - Reads from two DB2 input tables (ORDERS and ORDER_LINES) - Writes to a VSAM output file (ORDER_SUMMARY) - Maintains three running totals: total order amount, total line item count, total discount amount - Processes orders by ORDER_DATE, then by ORDER_ID within each date

Include all columns, data types, and the primary key. Explain why you chose each column.

Exercise 10: Restart Logic Flowchart

Draw a detailed flowchart (or write pseudocode) for the initialization logic of a checkpoint/restart-enabled program. The flowchart must handle all four cases: a) First run ever (no row in restart table) b) Previous run completed successfully (RUN_STATUS = 'E') c) Previous run was interrupted after at least one checkpoint (RUN_STATUS = 'C') d) Previous run was interrupted before any checkpoint (RUN_STATUS = 'S')

Exercise 11: Cursor Repositioning

Write the DB2 cursor declaration and OPEN logic for a program that reads from an EMPLOYEE table ordered by DEPT_ID, EMPLOYEE_ID. The cursor must support both fresh start (read all rows) and restart (read rows after the last checkpoint key). The restart key is a composite: DEPT_ID + EMPLOYEE_ID.

Exercise 12: Atomicity Violation Bug

The following code has a critical bug related to checkpoint atomicity. Identify the bug and write the corrected version:

       5000-TAKE-CHECKPOINT.
           EXEC SQL COMMIT END-EXEC.

           EXEC SQL
             UPDATE RESTART_CONTROL
             SET LAST_KEY_VALUE = :WS-LAST-KEY
                 RECORDS_READ = :WS-REC-READ
                 RUN_STATUS = 'C'
             WHERE PROGRAM_NAME = :WS-PGM-NAME
           END-EXEC.

           EXEC SQL COMMIT END-EXEC.

Exercise 13: Complete Checkpoint Paragraph

Write a complete COBOL paragraph (5000-TAKE-CHECKPOINT) that: a) Updates the restart table with current state b) Checks the SQLCODE c) Issues COMMIT d) Checks the COMMIT SQLCODE e) Logs the checkpoint information to SYSPRINT f) Resets the records-since-commit counter g) Handles errors by calling an abend routine

Exercise 14: Multi-Key Restart

A program processes records keyed by a composite key: REGION (CHAR(2)), BRANCH (CHAR(4)), ACCOUNT (CHAR(10)). Design the restart table columns and write the cursor declaration that supports restart from a composite key. Handle the case where the restart is at a region or branch boundary.

Exercise 15: Job Name Retrieval

Write the COBOL code to dynamically retrieve the current job name and step name at runtime. Show two methods: (a) using ACCEPT FROM, and (b) using the LE callable service CEE3GRN. Explain when you would use each approach.

Section 24.4 — DB2 Commit Frequency Analysis

Exercise 16: Commit Frequency Calculation

A batch program updates rows in a table with an average row length of 350 bytes. The DB2 active log datasets are 2 GB each (two copies, dual logging). The shop standard requires that no single unit of recovery use more than 5% of an active log.

a) Calculate the maximum number of rows that can be updated in a single UR. b) What commit frequency satisfies this requirement? c) If the program processes 500 records per second, what is the maximum lock hold time at this commit frequency?

Exercise 17: Commit Overhead Measurement Plan

You need to determine the optimal commit frequency for a specific program. Design a measurement plan that: a) Tests at least five different commit frequencies b) Measures CPU time, elapsed time, number of I/O operations, and lock wait time for each c) Runs on a representative dataset d) Accounts for concurrent workload

Specify the exact metrics you would collect, the tools you would use (DB2 accounting trace, SMF records, RMF), and how you would analyze the results.

Exercise 18: Lock Escalation Prevention

A program commits every 25,000 records. Each record update acquires a row-level X lock. The DB2 LOCKMAX parameter for the tablespace is set to 10,000 locks. What happens when the program reaches record 10,001? How would you prevent this? Provide two solutions.

Exercise 19: Configurable Commit Frequency

Write the COBOL PROCEDURE DIVISION code to: a) Accept a commit frequency from the PARM field b) Validate it is between 500 and 50,000 c) Default to 5,000 if no PARM is provided or if validation fails d) Display the effective commit frequency to SYSPRINT

Exercise 20: Log Volume Estimation

A program reads 12 million rows from TABLE_A and for each row, inserts one row into TABLE_B (avg 200 bytes) and updates one row in TABLE_C (avg 400 bytes, but only 50 bytes of each row change). Estimate the total log volume generated at commit frequencies of 1,000, 5,000, and 25,000 records. Consider both before-images and after-images.

Section 24.5 — VSAM and Sequential File Checkpoint/Restart

Exercise 21: VSAM KSDS Restart

Write the COBOL code for a paragraph that repositions a VSAM KSDS input file to the record after the last checkpoint key. Include error handling for the case where the key is not found (the record was deleted between the checkpoint and restart).

Exercise 22: VSAM Output Cleanup

A program writes records to a VSAM KSDS output file. Each output record contains a PROCESS_TIMESTAMP field. On restart, records written after the last checkpoint must be deleted.

Write the COBOL paragraphs to: a) Read the checkpoint timestamp from the restart table b) Position to the first record written after that timestamp c) Delete all records from that point forward d) Log the number of orphaned records deleted

Exercise 23: Sequential Output Strategy Selection

For each of the following scenarios, recommend the best sequential output restart strategy (regeneration, GDG, or byte-position tracking) and justify your choice:

a) A report generation program that writes a 500,000-line report b) A file transfer program that creates a 2 GB transmission file for an external partner c) A data warehouse load program that creates extract files consumed by an ETL tool d) A regulatory filing program that creates a precisely formatted file with header and trailer records containing record counts

Exercise 24: Three-Resource Coordination

A program reads from a DB2 cursor, updates a VSAM KSDS master file, and writes to a sequential output file. On restart after failure:

a) Which resource is automatically rolled back? Explain why. b) Which resource may have orphaned records? Describe how to clean them up. c) Which resource needs to be regenerated? Describe the strategy. d) Write pseudocode for the complete restart initialization sequence that coordinates all three resources.

Exercise 25: ESDS Workaround

Your shop has a legacy VSAM ESDS file that is used as input to a batch program requiring checkpoint/restart. You cannot change the file type. Design a checkpoint/restart strategy that handles the ESDS limitation. Consider using a record counter, and discuss the performance implications of the skip-forward approach for large files.

Section 24.6 — Multi-Step Job Checkpoint Strategy

Exercise 26: Multi-Step JCL Design

Design a four-step batch job with proper checkpoint/restart configuration:

Step 1: Extract (reads DB2, writes sequential) Step 2: Sort (utility sort of extract file) Step 3: Process (reads sorted file, updates DB2 and VSAM) Step 4: Report (reads DB2, writes report)

For each step, specify: the RD parameter, the DISP parameters for input/output datasets, whether the step needs application-level checkpointing, and the restart strategy if that step fails.

Exercise 27: Passed Dataset Conversion

Convert the following JCL fragment from passed datasets to cataloged datasets, maintaining the ability to restart from any step:

//STEP010  EXEC PGM=EXTRACT
//OUTPUT   DD DSN=&&TEMPEXT,DISP=(NEW,PASS),
//            SPACE=(CYL,(50,10))
//STEP020  EXEC PGM=PROCESS
//INPUT    DD DSN=&&TEMPEXT,DISP=(OLD,DELETE)
//OUTPUT   DD DSN=&&TEMPOUT,DISP=(NEW,PASS),
//            SPACE=(CYL,(30,10))
//STEP030  EXEC PGM=REPORT
//INPUT    DD DSN=&&TEMPOUT,DISP=(OLD,DELETE)

Exercise 28: Step Completion Guard

Write a COBOL paragraph that checks the restart table at the beginning of a program to determine if this step already completed successfully in a prior run of the same job. If it did, the program should display a message and exit with return code 0 without processing any records.

Exercise 29: Completion Marker Design

Design a completion marker strategy for a job with five steps, where steps 3, 4, and 5 can run in parallel (using JES2 job networking or a scheduler). The downstream process must not begin until all five steps are complete. Show the JCL and explain how partial completion is detected.

Section 24.7 — Testing Checkpoint/Restart

Exercise 30: Test Scenario Matrix

Create a complete test scenario matrix for a checkpoint/restart-enabled program. Include at least 10 scenarios, covering: fresh start, restart at various points, multiple restarts, failure before first checkpoint, failure after last checkpoint, failure during COMMIT processing, concurrent job conflict, and full regression comparison.

For each scenario, specify: preconditions, test steps, expected results, and verification method.

Exercise 31: Test Abort Hook

Write the COBOL code for a test abort hook that: a) Reads an abort-after record count from the PARM field (second parameter, comma-separated) b) Checks after each record whether the threshold has been reached c) If the threshold is reached, issues ROLLBACK and calls CEE3ABD to abend with user abend code 999 d) Is completely inactive when the parameter is zero or not provided

Exercise 32: Automated Restart Verification

Design a JCL procedure (PROC) that automates restart testing: Step 1: Load test data into DB2 and VSAM Step 2: Run the program cleanly to establish a baseline Step 3: Run the program with abort at 25% of input Step 4: Restart the program Step 5: Compare final DB2 data against baseline (using DSNTEP2 queries) Step 6: Compare output files against baseline (using IEBCOMPR or SUPERC) Step 7: Report results

Show the JCL skeleton and explain how each verification step works.

Exercise 33: Edge Case Identification

A program processes records where the key is ACCOUNT_NUMBER (10-digit numeric, ascending). The program commits every 5,000 records. Identify five edge cases that could cause restart to fail or produce incorrect results. For each, describe the scenario, explain why it's problematic, and propose a fix.

Exercise 34: Restart Table Forensics

You are troubleshooting a batch program that failed and restarted but produced incorrect results. The restart table shows:

PROGRAM_NAME: CBNC8100
JOB_NAME:     NIGHTBAT
STEP_NAME:    STEP030
LAST_KEY:     0005847200
REC_READ:     584720
REC_WRITTEN:  571903
REC_UPDATED:  584720
REC_ERROR:    12817
RUN_STATUS:   C
ACCUM_1:      125847293.45
CHECKPOINT_TS: 2026-03-14-03.27.45.123456

The program was restarted and completed, but the end-of-job report shows a total record count of 1,584,720 instead of the expected 1,000,000. What went wrong? How would you investigate? What fix would prevent this in the future?

Exercise 35: Performance Impact Assessment

A program currently runs in 2 hours with no checkpoint/restart logic. Estimate the performance impact of adding checkpoint/restart with a commit frequency of 5,000 records. The program processes 3 million records and updates one DB2 table per record. Consider: COMMIT overhead, restart table UPDATE overhead, log write frequency, and any changes to the access path. Provide your estimate as a percentage increase in CPU time and elapsed time.

Comprehensive Exercises

Exercise 36: Full Program Design

Design (write pseudocode or COBOL outline for) a complete checkpoint/restart-enabled batch program that: - Reads a sequential input file of customer payments (1.5 million records) - Validates each payment against a VSAM KSDS customer master file - Updates the customer balance in a DB2 table - Writes accepted payments to a sequential output file - Writes rejected payments to a separate sequential output file - Maintains running totals of: accepted count/amount, rejected count/amount, total count

Include: restart table design, initialization logic, main processing loop, checkpoint logic, VSAM repositioning on restart, sequential file handling on restart, and termination logic.

Exercise 37: Checkpoint Frequency Optimization

You have a program that can run with commit frequencies of 1,000, 2,500, 5,000, 10,000, and 25,000. You've collected the following test data:

Commit Freq CPU Time Elapsed Time Max Lock Time Avg Concurrent Wait Restarts in 1yr
1,000 47 min 2h 35m 3 sec 0.1 sec 4
2,500 44 min 2h 22m 8 sec 0.3 sec 4
5,000 43 min 2h 18m 15 sec 0.8 sec 4
10,000 42 min 2h 15m 30 sec 2.1 sec 4
25,000 41 min 2h 12m 75 sec 8.5 sec 4

The batch window is 4 hours. Online systems timeout at 30 seconds. The program has historically failed 4 times per year. Which commit frequency would you recommend? Show your analysis, including recovery time for each option.

Exercise 38: Legacy Program Assessment

You inherit a 3,000-line COBOL batch program written in 1994 that has no checkpoint/restart logic. It processes 10 million records nightly, runs for 4 hours, updates two DB2 tables and one VSAM file, and writes one sequential output file. The batch window is 5 hours.

Create a detailed plan to add checkpoint/restart. Include: restart table design, where to add checkpoint calls (which paragraph), how to handle each file type on restart, what COBOL changes are needed, what JCL changes are needed, what testing is required, and an estimate of the development effort.

Exercise 39: Disaster Recovery Scenario

Your data center has a planned failover to the DR site. The nightly batch cycle was in progress when the failover was initiated. Three jobs were running: - Job A: at step 3 of 5, had checkpointed 12 times - Job B: at step 1 of 3, had checkpointed 3 times - Job C: at step 2 of 4, had not yet checkpointed (still in first commit interval)

All three jobs have application-level checkpointing with restart tables in DB2. DB2 data was replicated to the DR site via GDPS with an RPO of 2 minutes. Describe the recovery procedure for each job at the DR site. What assumptions are you making about the restart table data?

Exercise 40: Checkpoint/Restart Framework Design

Design a reusable checkpoint/restart framework that can be INCLUDEd (via COPY) into any COBOL batch program. The framework should provide: - Standard restart table layout and SQL - Initialization paragraph (handles fresh start and restart) - Checkpoint paragraph (commits business data and restart table atomically) - Termination paragraph (marks completion) - Error handling paragraph - Configurable commit frequency via PARM

Write the COPY member(s) and the standard paragraphs. Show how a developer would integrate the framework into a new program by coding only the business-specific logic.