Key Takeaways — Chapter 24: Checkpoint/Restart Design

DataField.Dev

Key Takeaways — Chapter 24: Checkpoint/Restart Design

Threshold Concept

Checkpoint/restart is not about preventing failure — it is about making failure recovery fast and automatic. The mental shift from "don't let it fail" to "design for fast recovery" is the foundation of resilient batch architecture. Every long-running batch program will eventually fail. The question is whether recovery takes minutes or hours.

Core Principles

Application-level checkpointing is the primary strategy. The z/OS checkpoint/restart facility (CHKPT macro, RD parameter) handles sequential file positions and working storage but cannot coordinate DB2 transactions or VSAM files. Application-level checkpointing with a restart table gives you full control over all resource types.
The restart table and business data must be committed in the same COMMIT. This is the atomicity guarantee that makes checkpoint/restart correct. If they are committed separately, a failure between the two commits creates inconsistency — records appear processed but weren't, or records are skipped on restart.
Commit frequency is an engineering tradeoff, not a guess. Balance four competing concerns: recovery time (lower frequency = longer recovery), lock duration (lower frequency = longer locks), log volume (lower frequency = larger log per UR), and commit overhead (higher frequency = more CPU). Start with 5,000 records and adjust based on measurement.
Different resource types require different restart strategies. DB2 handles itself via COMMIT/ROLLBACK. VSAM KSDS files support keyed repositioning but may need orphan cleanup. Sequential output files must be regenerated on restart. The restart table is the single source of truth that coordinates all three.
Multi-step jobs require step-level restart coordination. Use cataloged datasets instead of passed datasets. Each step checks its own restart table row. A completion marker dataset proves the entire job finished. The pipeline control table (for complex pipelines) provides cross-step awareness.

Practical Guidelines

Minimum viable checkpointing: Any batch program processing more than 100,000 records or running longer than 30 minutes should have checkpoint/restart.
Configurable commit frequency: Read from PARM or a control table, not hardcoded. Default to 5,000. Validate within 100–100,000.
Restart key design: Use the same key that orders the input cursor. For composite keys, store all key components. The cursor WHERE clause should handle both fresh start and restart with a single declaration.
Counter preservation: Store all accumulators, running totals, and counts in the restart table. End-of-job control totals must be correct regardless of how many times the program restarted.
Sequential output: Prefer the regeneration strategy — delete and rewrite from committed data. It is the simplest and most reliable approach.
VSAM output: For simple inserts, delete orphans on restart. For accumulator updates, use before-image logging to restore VSAM to checkpoint-consistent state.

Testing Requirements

Test at four failure points: Before the first checkpoint, at a checkpoint boundary, between checkpoints, and after the last checkpoint but before completion.
Verify data equivalence: Final results after restart must be byte-for-byte identical to a clean run. Compare record counts, control totals, DB2 table contents, and output files.
Test multiple restarts: Fail, restart, fail again, restart again. Verify the program handles consecutive restarts correctly.
Test the VSAM cleanup: Verify that orphaned VSAM records are properly removed on restart.

Common Mistakes to Avoid

Mistake	Consequence	Fix
Separate COMMIT for restart table and business data	Data inconsistency on restart — skipped or duplicated records	Single COMMIT covering both
Hardcoded commit frequency	Cannot tune without recompile; no operational flexibility	Read from PARM with validation
Not preserving accumulators in restart table	Incorrect end-of-job control totals after restart	Store all accumulators in restart table
Not handling RUN_STATUS = 'E' on re-submission	Duplicate processing if job accidentally resubmitted	Check status and reinitialize or skip
Appending to sequential output on restart	Orphaned records from rolled-back processing in the output	Delete and regenerate from committed data
Not testing restart end-to-end	Bugs in restart logic discovered in production at 3 AM	Kill the job, restart it, verify every number

Key Formulas

Maximum lock hold time = Commit frequency / Processing rate (records/second)
Log volume per UR ≈ Commit frequency x Average row size x 2 (before + after images) + overhead
Recovery time on restart ≈ Time to reprocess one commit interval + restart overhead (typically < 1 minute)
Number of COMMITs = Total records / Commit frequency
Total commit overhead = Number of COMMITs x Cost per COMMIT (~0.1-0.3ms CPU each)

Connection to Other Chapters

Chapter 4 (Datasets): File positioning concepts — VSAM KSDS keyed access, sequential file I/O — are directly applied in restart repositioning.
Chapter 8 (Locking): Commit frequency determines lock duration. Understanding lock escalation thresholds helps you choose a commit frequency that avoids escalation.
Chapter 23 (Batch Window): Checkpoint/restart protects the batch window by converting a full rerun (hours) into a partial reprocessing (minutes). This is the most direct way to reduce batch window risk.