Further Reading — Chapter 24: Checkpoint/Restart Design

IBM Official Documentation

z/OS MVS JCL Reference — Checkpoint/Restart

  • Publication: SA23-1385 (z/OS V2R5)
  • Relevant sections: RESTART parameter on the JOB statement; RD parameter on JOB and EXEC statements; CHKPT macro; SYSCKEOV and SYSCHK DD statements
  • Why read it: The definitive reference for system-level checkpoint/restart. The JCL Reference is the authority on RD parameter values, RESTART syntax, and the interaction between checkpoint datasets and job restart. Read the examples carefully — the edge cases around conditional execution and restart are subtle.

z/OS MVS Programming: Authorized Assembler Services Guide

  • Publication: SA23-1372
  • Relevant sections: Chapter on Checkpoint/Restart services; CHKPT macro parameters and return codes
  • Why read it: If you need to understand what the system checkpoint facility actually saves and restores — and its precise limitations — this is the source. The sections on what is NOT saved (DB2 state, VSAM positions, cross-memory structures) are as important as what is saved.

DB2 for z/OS Administration Guide — Recovery

  • Publication: SC27-8844 (DB2 13)
  • Relevant sections: Unit of recovery concepts; COMMIT and ROLLBACK processing; Active log management; Forward and backward recovery
  • Why read it: Understanding how DB2 manages units of recovery is essential for designing commit frequency and restart logic. The sections on active log sizing and the relationship between UR size and log volume are directly applicable to Section 24.4's commit frequency analysis.

DB2 for z/OS Application Programming and SQL Guide

  • Publication: SC27-8845 (DB2 13)
  • Relevant sections: COMMIT and ROLLBACK in application programs; Cursor positioning after COMMIT; Lock duration and commit scope; Distributed unit of work
  • Why read it: The application programming guide explains how COMMIT affects open cursors (they are closed unless declared WITH HOLD), how locks are released, and how to design programs that commit efficiently. The section on WITH HOLD cursors is particularly relevant for checkpoint/restart programs that want to avoid cursor close/reopen overhead.

z/OS DFSMS Using Data Sets — Checkpoint/Restart with VSAM

  • Publication: SC23-6855
  • Relevant sections: VSAM processing considerations for checkpoint/restart; KSDS repositioning; ESDS limitations
  • Why read it: This manual explains the specific challenges of VSAM files in a checkpoint/restart context. The sections on why VSAM does not participate in system-level checkpoint and how to handle VSAM repositioning programmatically are directly relevant.

Books

z/OS JCL by Gary DeWard Brown (5th Edition)

  • Publisher: Wiley
  • Relevant chapters: Chapter on restart and recovery; JCL parameter reference for RD and RESTART
  • Why read it: Brown's book is the standard reference for z/OS JCL. His treatment of checkpoint/restart is practical and example-driven. The chapter on restart includes real-world scenarios and common mistakes. If you maintain JCL for production batch systems, this book should be on your desk.

DB2 for z/OS and OS/390: Ready for Java by Sloan, Hernandez

  • Note: While focused on Java/DB2, the chapters on transaction management, commit scope, and recovery concepts apply equally to COBOL/DB2 batch programs.

Enterprise COBOL for z/OS: Programming Guide

  • Publisher: IBM (SC27-8714)
  • Relevant sections: File I/O considerations; ACCEPT statement for retrieving job information; LE callable services for abend handling
  • Why read it: The official COBOL programming guide covers the language features used in checkpoint/restart: the ACCEPT statement for retrieving job and step names, the START statement for VSAM repositioning, and LE callable services (CEE3ABD) for controlled abends during testing.

IBM Redbooks

DB2 for z/OS: Data Sharing in a Nutshell (SG24-8481)

  • Relevant sections: Commit frequency in a data sharing environment; Lock contention across members; Group buffer pool considerations
  • Why read it: If your shop runs DB2 data sharing (multiple DB2 members sharing the same data), commit frequency has additional implications. Locks are global, and commit frequency affects not just your program but programs running on other DB2 members. This Redbook explains the data sharing context for commit strategy.

Batch Modernization on z/OS (SG24-7779)

  • Relevant sections: Checkpoint/restart patterns; Modern batch architectures; Parallel batch processing with restart
  • Why read it: This Redbook covers modern approaches to batch processing, including checkpoint/restart in the context of parallel batch, multi-step coordination, and integration with workload automation tools. The patterns described align with the application-level checkpointing approach in Section 24.3.

DB2 for z/OS: Diagnosed Faster Batch (SG24-8336)

  • Relevant sections: Commit frequency tuning; Log volume management; Batch performance diagnosis
  • Why read it: Focused specifically on DB2 batch performance. The sections on commit frequency tuning include measurement techniques, analysis frameworks, and real-world examples that extend the commit frequency analysis in Section 24.4.

Technical Articles and Papers

"Designing Restartable Batch Applications" — IBM Developer

  • URL: Available on IBM Developer (developer.ibm.com)
  • Why read it: A practical article with code examples for restartable batch COBOL programs. Covers the restart table pattern, cursor repositioning, and VSAM handling. Includes a downloadable sample program.

"Commit Frequency: Finding the Sweet Spot" — IDUG (International DB2 Users Group)

  • Availability: IDUG conference proceedings and online content library
  • Why read it: An in-depth analysis of commit frequency tradeoffs with empirical data from production systems. Includes formulas for calculating optimal commit frequency based on row size, processing rate, lock timeout thresholds, and active log capacity.

"Checkpoint/Restart Patterns for Enterprise Batch" — IEEE Computer Society

  • Why read it: An academic treatment of checkpoint/restart that covers the theoretical foundations: consistency models, idempotency requirements, and the relationship between checkpoint frequency and expected recovery time under various failure distributions.

Vendor Documentation for Workload Automation

IBM Workload Automation (IWS/TWS) Documentation

  • Relevant sections: Automatic restart and cleanup; Conditional job stream processing; Step-level restart integration
  • Why read it: Modern batch environments use workload automation tools that can automatically restart failed jobs from the appropriate step. Understanding how these tools interact with application-level checkpoint/restart is essential for production operations. The documentation explains how to configure automatic restart policies, how the scheduler determines the restart point, and how to integrate application restart logic with scheduler restart logic.

CA Autosys / BMC Control-M Documentation

  • Relevant sections: Job restart policies; Recovery options; Step restart configuration
  • Why read it: If your shop uses a third-party scheduler, the restart integration may differ from IBM's native tools. The vendor documentation explains how the scheduler detects failures, how it determines restart points, and how to configure cleanup actions that run before a restart.

Forward Recovery and Point-in-Time Recovery

  • The IBM manual DB2 for z/OS Utility Guide and Reference (SC27-8846) covers the RECOVER utility, which performs forward recovery using log records. Understanding forward recovery complements the backward recovery (rollback) concepts covered in this chapter.

Two-Phase Commit and Distributed Transactions

  • If your batch programs access multiple resource managers (e.g., DB2 and MQ Series, or DB2 and IMS), the commit/restart logic must account for two-phase commit coordination. The DB2 for z/OS Administration Guide sections on Resource Recovery Services (RRS) and distributed units of work explain how z/OS coordinates commit across multiple resource managers.

CICS Batch and Transactional VSAM

  • For shops running batch under CICS, the CICS Transaction Server Application Programming Guide covers SYNCPOINT (the CICS equivalent of COMMIT), recoverable VSAM files, and transactional coordination between DB2 and VSAM through the CICS transaction manager. This is the cleanest solution for VSAM transactional integrity but requires CICS infrastructure.

Parallel Sysplex and GDPS

  • For disaster recovery considerations, the IBM Parallel Sysplex Operations manual and GDPS (Geographically Dispersed Parallel Sysplex) documentation explain how checkpoint/restart data is replicated to disaster recovery sites and how batch jobs are restarted after a site failover. The restart table approach described in this chapter integrates naturally with GDPS because the restart table is a DB2 table that participates in DB2 data replication.