Quiz — Chapter 24: Checkpoint/Restart Design
Question 1
What is the primary purpose of checkpoint/restart in batch processing?
A) To prevent batch programs from failing B) To make failure recovery fast and automatic C) To improve the CPU performance of batch programs D) To reduce the storage requirements of batch datasets
Answer: B Checkpoint/restart does not prevent failures — it makes recovery from failures fast and automatic. The threshold concept of this chapter is the mental shift from "don't let it fail" to "design for fast recovery."
Question 2
In the restart table pattern, why must the restart table UPDATE and the business data updates be committed in the same COMMIT?
A) To reduce the number of COMMIT operations for performance B) To ensure the restart table and business data are always consistent C) Because DB2 only allows one COMMIT per program execution D) To prevent lock escalation on the restart table
Answer: B If the restart table and business data are committed separately, a failure between the two commits creates an inconsistency — the restart table may say records were processed that were actually rolled back, or vice versa. A single COMMIT ensures atomicity.
Question 3
What does the RD parameter value RD=R on an EXEC statement specify?
A) Restart only, no checkpoints B) Read-only mode for the step C) Both automatic restart and checkpoints are allowed D) Recovery mode with deferred restart
Answer: C
RD=R allows both checkpoint operations and automatic restart for the step. RD=RNC allows restart but suppresses checkpoints. RD=NR allows checkpoints but suppresses restart. RD=NC suppresses both.
Question 4
A batch program processes 6 million records in 3 hours with a commit frequency of 5,000 records. The program fails at record 4,500,000. Approximately how long will recovery take after restart?
A) 3 hours (complete rerun) B) 1.5 hours (from the midpoint) C) Less than 5 minutes (from the last checkpoint) D) Zero — no reprocessing needed
Answer: C With a commit frequency of 5,000, the last checkpoint was at most 5,000 records before the failure point. Reprocessing 5,000 records out of 6 million at the same processing rate (2 million records/hour) takes about 9 seconds. Including restart overhead, total recovery is well under 5 minutes.
Question 5
Which of the following resources does the z/OS checkpoint/restart facility (CHKPT macro) NOT save?
A) Working storage contents B) Sequential file positions C) DB2 transaction state D) QSAM buffer positions
Answer: C The z/OS checkpoint/restart facility saves working storage and sequential file positions, but it does not coordinate with DB2 transactions. DB2 manages its own commit/rollback independently.
Question 6
What is a unit of recovery (UR) in DB2?
A) The entire batch program execution from start to finish B) The set of database changes between two consecutive COMMITs C) A single SQL statement and its effects D) The data written to one checkpoint dataset
Answer: B A unit of recovery is the set of all database changes made between two consecutive COMMIT operations (or between program start and the first COMMIT). If a failure occurs, the current UR is rolled back.
Question 7
A program updates rows averaging 300 bytes each. The DB2 active log is 1 GB. What is the approximate maximum number of rows that can be updated in a single unit of recovery before risking active log exhaustion?
A) 100,000 B) 500,000 C) 1,500,000 D) 5,000,000
Answer: C Each row update generates approximately 300 bytes of before-image + 300 bytes of after-image + overhead, roughly 700 bytes of log data per row. 1 GB / 700 bytes = approximately 1.5 million rows. In practice, you would want to stay well below this to leave room for other concurrent work.
Question 8
Which approach is recommended for handling sequential output files on restart?
A) Append new records to the existing partial output file B) Use byte-position tracking to continue writing from the exact failure point C) Delete and regenerate the output file from committed data D) Leave the partial file and create a supplementary file for the remaining records
Answer: C Regenerating the output file from committed data is the safest and simplest approach. Appending creates orphaned records from the rolled-back portion. Byte-position tracking is fragile. Supplementary files complicate downstream processing.
Question 9
What is the SYSCKEOV DD statement used for?
A) System checkpoint at end of volume for multi-volume datasets B) Sequential checkpoint error override C) System checkpoint at end of every record D) Checkpoint verification at end of job
Answer: A SYSCKEOV triggers automatic system-level checkpoints at each end-of-volume point for multi-volume sequential datasets. This was particularly useful in the tape era with multi-reel datasets.
Question 10
A batch program commits every 10,000 records and processes 400 records per second. What is the maximum duration that any single row lock is held?
A) 2.5 seconds B) 10 seconds C) 25 seconds D) 40 seconds
Answer: C Lock duration = commit interval / processing rate = 10,000 / 400 = 25 seconds. A row locked at the beginning of a unit of recovery is held until the next COMMIT, which occurs after 10,000 records at 400 records/second = 25 seconds.
Question 11
Why should you avoid using passed datasets (DISP=(NEW,PASS)) in jobs that require step-level restart?
A) Passed datasets consume too much storage B) Passed datasets are deleted when the step that passed them ends, so they don't exist on restart C) Passed datasets cannot be read by DB2 D) Passed datasets cause JCL errors on restart
Answer: B Passed datasets exist only for the duration of the job. On restart, if the passing step is skipped, the dataset no longer exists, and the receiving step fails. Cataloged datasets persist across job restarts.
Question 12
In a VSAM KSDS output file, what problem occurs when a program fails between checkpoints?
A) The VSAM file becomes corrupted and unreadable B) Records written since the last checkpoint exist in VSAM but the corresponding DB2 changes were rolled back C) The VSAM file automatically rolls back to the last checkpoint D) All records in the VSAM file are deleted
Answer: B VSAM does not participate in DB2 COMMIT/ROLLBACK. Records written to VSAM between the last checkpoint and the failure are physically present, but the DB2 changes from the same period were rolled back. These are orphaned records that must be cleaned up on restart.
Question 13
What should a program do if it reads the restart table at startup and finds RUN_STATUS = 'E'?
A) Immediately abend with an error message B) Roll back the previous run's data and start over C) Treat it as a fresh start — the previous run completed successfully D) Resume from the last checkpoint position
Answer: C RUN_STATUS = 'E' means the previous run ended successfully. The program should treat this as a fresh start: reinitialize all counters and process from the beginning.
Question 14
What is the recommended range for commit frequency in most batch programs?
A) 10–100 records B) 100–500 records C) 1,000–10,000 records D) 100,000–1,000,000 records
Answer: C A commit frequency between 1,000 and 10,000 provides a good balance among recovery time, lock duration, log volume, and commit overhead. The specific value depends on the program's characteristics, but 5,000 is a common starting point.
Question 15
Which of the following is NOT a valid reason to make the commit frequency configurable rather than hardcoded?
A) Different environments (dev/QA/prod) may need different values B) Operations can adjust it during batch window pressure without recompiling C) It allows the program to automatically increase commit frequency when locks are detected D) Tuning can be done through JCL changes rather than program changes
Answer: C Making commit frequency configurable allows environment-specific tuning and operational flexibility without recompilation. However, the program does not typically auto-adjust commit frequency at runtime based on lock detection — that would add significant complexity and is not a standard practice.
Question 16
In a multi-step job, what is the purpose of a "completion marker" dataset created in the final step?
A) To trigger automatic job scheduling for the next day B) To prove the job completed all steps successfully, preventing downstream jobs from running on incomplete data C) To store checkpoint information for all steps D) To signal the operator that the batch window can close
Answer: B The completion marker is a zero-length dataset created only when all steps complete successfully. Downstream jobs check for its existence before starting, ensuring they don't process incomplete data from a partially completed job.
Question 17
A program's restart table shows RUN_STATUS = 'S' (started) after a failure. What does this indicate?
A) The program completed successfully B) The program failed before taking its first checkpoint C) The program failed during the checkpoint write D) The restart table is corrupted
Answer: B RUN_STATUS = 'S' is set during initialization. It changes to 'C' only when the first checkpoint is taken. If the program fails before the first checkpoint, the status remains 'S'. On restart, the program should start from the beginning since no checkpoint data exists.
Question 18
When designing checkpoint/restart for a program that reads from DB2 using a cursor, how should the cursor support restart?
A) Close and reopen the cursor on restart B) Use a WHERE clause that filters rows after the restart key value C) Skip rows by fetching and discarding until reaching the restart position D) Use a scrollable cursor to position directly to the restart row
Answer: B
The cursor's WHERE clause should include a condition like WHERE key_column > :restart-key OR :restart-key = ' '. This efficiently positions the cursor after the last processed record on restart, and returns all rows on fresh start.
Question 19
During checkpoint/restart testing, what is the minimum set of failure points that must be tested?
A) Only at exact checkpoint boundaries B) At the beginning and end of the program C) Before the first checkpoint, at a checkpoint, between checkpoints, and after the last checkpoint before completion D) At every 100th record
Answer: C Testing must cover all four critical zones: before any checkpoint is taken, exactly at a checkpoint boundary, between two checkpoints (the most common failure point), and after the last checkpoint but before program completion. Each zone exercises different restart logic paths.
Question 20
A batch program with checkpoint/restart processes 10 million records with a commit frequency of 5,000. The program has been restarted 12 times over 3 years with an average recovery time of 3 minutes each. What is the total time saved compared to a program without checkpoint/restart that would require complete reruns? Assume the full job runs in 4 hours.
A) 36 minutes B) 4 hours C) 47 hours and 24 minutes D) 48 hours
Answer: C Without checkpoint/restart, each failure requires a 4-hour rerun. With checkpoint/restart, each failure requires approximately 3 minutes. Time saved per failure: 4 hours - 3 minutes = 3 hours 57 minutes. Over 12 failures: 12 x 3 hours 57 minutes = 47 hours 24 minutes.