Chapter 18 Quiz

Instructions

Select the best answer for each question. Questions are designed to test understanding at the Evaluate/Create level of Bloom's taxonomy — you'll need to analyze recovery scenarios, evaluate architectural trade-offs, and design recovery strategies, not merely recall definitions.


Question 1

A CICS AOR fails with 200 active tasks, 50 of which hold DB2 locks. What happens to those DB2 locks during the 60-second emergency restart?

A) DB2 releases the locks immediately when it detects the CICS thread failure B) DB2 holds the locks until CICS restarts and resolves each unit of work via the recovery manager C) DB2 holds the locks for the IRLMRWT timeout period, then releases them automatically D) The coupling facility detects the failure and coordinates lock release across DB2 members

Answer: B Explanation: DB2 locks held by CICS transactions remain in place until CICS resolves each UOW — either committing (releasing locks after applying changes) or backing out (releasing locks after undoing changes). DB2 does not unilaterally release locks owned by a connected CICS region; it waits for the coordinator (CICS) to make the commit/backout decision during emergency restart. This is why fast CICS recovery is critical for the entire system — those locked rows are unavailable to all other transactions until resolution. The IRLMRWT timeout only applies to other transactions waiting for those locks, not to the lock holder.


Question 2

What is the PRIMARY purpose of the CICS activity keypoint?

A) To synchronize data between CICS regions in an MRO topology B) To limit how far back the recovery manager must read the system log during emergency restart C) To create a checkpoint that enables warm start after a planned shutdown D) To provide a consistent backup point for the CICS region's VSAM files

Answer: B Explanation: The activity keypoint is a snapshot of all active UOWs written periodically to the system log. During emergency restart, the recovery manager reads backward from the log end only to the most recent activity keypoint — not to the beginning of the log. This bounds the recovery time. A KEYINTV of 60 seconds means recovery reads at most ~60 seconds of log records. Without activity keypoints, recovery would need to scan the entire log, which in a high-volume region could take minutes instead of seconds.


Question 3

Why does CNB set LOGDEFER=NO on all production CICS regions processing financial transactions?

A) LOGDEFER=YES causes higher CPU utilization due to batching overhead B) LOGDEFER=YES creates a window where committed transactions are not yet on the system log, risking data loss on region failure C) LOGDEFER=YES is incompatible with two-phase commit D) LOGDEFER=YES prevents activity keypoints from being written

Answer: B Explanation: LOGDEFER=YES batches system log writes for performance (typically 2–5% CPU reduction). But this creates a window where a transaction has been committed (from the application's perspective) but the commit record hasn't been physically written to the log stream. If the region fails during this window, the recovery manager has no commit record and backs out the transaction — resulting in data loss for transactions the application believed were committed. For financial transactions where data loss is unacceptable, LOGDEFER=NO ensures every commit record is written synchronously.


Question 4

In a two-phase commit involving CICS, DB2, and MQ, what is the EXACT moment that determines whether a distributed transaction is committed or backed out?

A) When all resource managers respond YES to PREPARE B) When CICS writes the commit record to its system log C) When CICS sends the COMMIT signal to all resource managers D) When the last resource manager acknowledges the COMMIT signal

Answer: B Explanation: This is the critical distinction in 2PC. Phase 1 (PREPARE) establishes that all participants can commit, but the transaction is not yet committed. The commit record on CICS's system log is the point of no return — once written, the transaction is committed regardless of what happens next. Phase 2 (sending COMMIT to participants) is merely executing the already-made decision. If CICS fails after writing the commit record but before sending COMMIT, emergency restart reads the commit record and sends COMMIT during recovery. The coordinator's log is the source of truth.


Question 5

An indoubt transaction occurs when:

A) A resource manager responds NO to PREPARE during two-phase commit B) The coordinator (CICS) fails after all participants respond YES to PREPARE but before the commit record is written C) The network between CICS and DB2 is temporarily lost during normal processing D) A CICS transaction exceeds its DTIMOUT (deadlock timeout) value

Answer: B Explanation: The "indoubt window" is the time between successful PREPARE (all participants said YES) and the commit record write. During this window, participants have agreed to commit but the coordinator hasn't yet recorded its decision. If the coordinator fails, participants are "indoubt" — they don't know whether the coordinator decided COMMIT or BACKOUT. They hold their locks and wait for the coordinator to restart and inform them. Answer A describes a normal ROLLBACK (not indoubt). Answer C is a communication failure (different recovery path). Answer D is a deadlock, not an indoubt.


Question 6

RESYNCMEMBER(GROUPRESYNC) on the DB2CONN definition is critical for Sysplex recovery because:

A) It allows CICS to use DB2 group bufferpool flushing during restart B) It allows CICS to resolve indoubt transactions through any available DB2 data sharing group member, not just the original member C) It enables DB2 to automatically restart when CICS restarts D) It synchronizes CICS's system log with DB2's recovery log

Answer: B Explanation: In a Sysplex, if both CICS and its DB2 member are on the same LPAR and the LPAR fails, CICS cannot resolve indoubt transactions with the failed DB2 member. GROUPRESYNC tells CICS to contact any surviving member of the DB2 data sharing group. Since all members share the same data via the coupling facility, any member can apply the commit or backout decision on behalf of the failed member. Without GROUPRESYNC, indoubt resolution is blocked until the specific failed DB2 member restarts.


Question 7

What is the correct recovery action for a unit of work that has no commit record on the CICS system log after emergency restart?

A) Commit the UOW, because the absence of a backout record means it should be committed B) Back out the UOW, because without a commit record the transaction was never committed C) Shunt the UOW for manual resolution D) Ignore the UOW, because it was a read-only transaction

Answer: B Explanation: The recovery rule is definitive: no commit record on the log means the transaction was not committed. The recovery manager backs out the UOW by coordinating with all participating resource managers to undo their changes. This is the correct behavior even if the application was "about to commit" — the commit record is the sole evidence of commitment. Shunting occurs only when a participant resource manager is unavailable for backout, not as a default action.


Question 8

A shunted UOW at CNB holds a DB2 lock on account 00047291. This account processes 50 transactions per hour. RMRETRY is set to 30 seconds. The MQ queue manager (the unavailable participant) takes 5 minutes to restart. What is the approximate number of transactions that will fail due to lock timeout for this account?

A) 4 transactions (50/hr * 5min/60min) B) 0 — other transactions will wait, not fail C) 4 transactions will timeout if IRLMRWT < 5 minutes, fewer if lock wait is enabled D) 50 transactions — all transactions for the next hour

Answer: C Explanation: With 50 transactions per hour, approximately 4 transactions will attempt to access account 00047291 during the 5-minute shunt window. Whether they fail (timeout) or succeed (wait) depends on DB2's IRLMRWT setting. If IRLMRWT is less than 5 minutes (which it typically is — common values are 30–60 seconds), those 4 transactions will wait until IRLMRWT expires and then receive a -911 timeout. CICS's RMRETRY at 30 seconds means CICS retries resolution every 30 seconds — as soon as MQ is available (~5 minutes), the next retry resolves the shunted UOW and releases the lock.


Question 9

Why is a cold start in production considered a "data integrity event" at CNB?

A) Cold start erases all VSAM files in the region B) Cold start does not replay the system log, so in-flight transactions are not backed out, leaving resources potentially inconsistent C) Cold start resets all DB2 connections, causing DB2 to cold start as well D) Cold start invalidates all CICSPlex SM routing tables

Answer: B Explanation: A cold start initializes CICS from scratch with an empty system log. No recovery occurs. If there were in-flight transactions modifying DB2 tables or VSAM files when the region failed, those modifications may be partially applied. DB2 tables may have partial updates (debit applied, credit not applied). VSAM files may have uncommitted records. This is why CNB requires a mandatory 4–6 hour data reconciliation procedure after any cold start — every recoverable resource must be verified for consistency.


Question 10

Which z/OS facility handles automatic CICS region restart after an abnormal termination?

A) CICSPlex SM (CPSM) B) z/OS Automatic Restart Management (ARM) C) CICS recovery manager (DFHRM) D) z/OS Workload Manager (WLM)

Answer: B Explanation: ARM is the z/OS facility that monitors registered elements (like CICS regions) and automatically restarts them after failure. CICSPlex SM operates at a higher level — it manages routing and health monitoring but relies on ARM for the actual restart. The CICS recovery manager operates within the CICS region during emergency restart (log scan, UOW resolution). WLM manages workload dispatching, not restart. ARM answers the question "is it running?" — CPSM answers "is it healthy?"


Question 11

CICSPlex SM's HEALTHCHK feature adds value beyond ARM's basic restart capability because:

A) HEALTHCHK restarts the region faster than ARM B) HEALTHCHK verifies that the region can actually process transactions (DB2 connected, MQ available, critical files open) before routing work to it C) HEALTHCHK prevents the region from failing in the first place D) HEALTHCHK provides SNMP traps that ARM cannot

Answer: B Explanation: ARM considers a restart "successful" when the CICS address space is running. But a running CICS region is not necessarily a healthy one — the DB2 attachment may not have reconnected, MQ may not have initialized, a critical VSAM file may not have reopened. HEALTHCHK runs a lightweight transaction that verifies actual processing capability. Only when the health check passes does CPSM add the region back to the routing table. This prevents the scenario where ARM restarts the region, CPSM routes work to it, and transactions immediately fail because DB2 isn't connected.


Question 12

What architectural decision does CNB make to ensure each LPAR can handle 100% of the transaction workload during a failover?

A) Running each LPAR at approximately 40% CPU capacity under normal conditions B) Using CICS MAXTASK values that are double the normal requirement C) Configuring DB2 with double the buffer pool allocation D) Maintaining a hot standby LPAR with no normal workload

Answer: A Explanation: In the active-active cross-LPAR pattern, each LPAR must be sized to handle 100% of the workload alone. CNB achieves this by running each LPAR at ~40% capacity under normal conditions, leaving 60% headroom for failover. This is not waste — it's the cost of true HA. If each LPAR ran at 80% capacity, a failover would push the surviving LPAR to 160%, causing cascading performance degradation. The 40% target is a deliberate architectural decision that trades cost efficiency for recovery capability.


Question 13

An idempotent transaction design prevents which specific recovery problem?

A) Indoubt transactions B) Shunted units of work C) Duplicate processing when a transaction is retried after a communication failure D) DB2 lock contention during emergency restart

Answer: C Explanation: Idempotency addresses the end-to-end recovery gap: the scenario where CICS processes a transaction successfully, but the response is lost (network failure, TOR failure, client timeout). The client retries, and without idempotency, the transaction is processed twice — a duplicate debit, a duplicate payment, a duplicate order. Idempotent design uses a unique request ID to detect duplicates and return the previous result. Indoubt transactions, shunted UOWs, and lock contention are platform-level recovery concerns that idempotency does not address.


Question 14

Why should retry logic include exponential backoff rather than immediate retry?

A) Immediate retry violates CICS's TRANCLASS limits B) Immediate retry on a contention-caused failure adds to the contention, potentially making the problem worse C) CICS's dispatcher does not allow immediate task restarts D) Exponential backoff is required by the XA protocol specification

Answer: B Explanation: If a failure is caused by resource contention (DB2 lock timeout, MQ queue full, CICS storage shortage), immediate retry adds another contender for the same resource. With thousands of transactions retrying immediately, the contention amplifies and can cascade into a system-wide performance degradation. Exponential backoff (500ms, 1s, 2s, 4s) gives the underlying contention time to resolve. The first retry at 500ms handles transient issues; later retries with longer delays handle more persistent conditions.


Question 15

In the Pinnacle Health incident described in section 18.4, what was the root cause of the 7-minute service impact?

A) The CICS AOR took 7 minutes to complete emergency restart B) Two shunted UOWs held DB2 locks on member eligibility records while waiting for MQ recovery C) The DB2 data sharing group was unavailable for 7 minutes D) CICSPlex SM took 7 minutes to update its routing table

Answer: B Explanation: The AOR restarted within 90 seconds. Of 23 indoubt transactions, 21 resolved automatically. Two could not resolve because MQ was still restarting. Those two UOWs were shunted — set aside for later resolution — but they held DB2 locks on two member eligibility records. For 7 minutes (until MQ completed its recovery and CICS's retry resolved the shunted UOWs), any eligibility verification for those two members timed out. The fix: RESYNCMEMBER for MQ connections and reducing RMRETRY from 300 seconds to 30 seconds.


Question 16

A CICS transaction processes a funds transfer in two units of work. UOW1 (debit + audit record) commits successfully. UOW2 (credit + notification) fails. Why can't the recovery manager automatically fix this?

A) The recovery manager cannot coordinate across multiple UOWs B) UOW1 has already committed — its changes are permanent and cannot be backed out by the recovery manager C) Two-unit-of-work transactions are not supported by CICS D) The recovery manager only handles DB2, not MQ

Answer: B Explanation: Once UOW1 commits (SYNCPOINT), its changes are permanent. The CICS recovery manager can only back out uncommitted work. UOW1's debit is committed, irrevocable. When UOW2 fails, the recovery manager backs out UOW2's uncommitted changes, but UOW1's debit stands. This creates a data inconsistency (money debited but not credited) that requires a compensating transaction to reverse. This is why architects minimize the number of commit points in a business process — fewer commits mean fewer opportunities for partial completion.


Question 17

You are designing the recovery architecture for a CICS AOR. Which combination of settings provides the fastest automatic recovery from a region failure?

A) START=COLD, no ARM policy, RMRETRY=300, LOGDEFER=YES B) START=AUTO, ARM with RESTART_ATTEMPTS(3), RMRETRY=30, KEYINTV=60, system log on coupling facility C) START=WARM, ARM with RESTART_ATTEMPTS(1), RMRETRY=60, KEYINTV=300 D) START=AUTO, no ARM policy, RMRETRY=30, KEYINTV=30, system log on DASD

Answer: B Explanation: B provides: START=AUTO (attempts warm start, falls back to emergency — correct for abnormal termination recovery); ARM with 3 attempts (automatic restart with loop protection); RMRETRY=30 (aggressive retry for shunted UOW resolution); KEYINTV=60 (limits log scan to ~60 seconds of records); and system log on coupling facility (survives LPAR failure, enables emergency restart). Answer A uses cold start (no recovery) and deferred logging (data loss risk). Answer C uses warm start (only valid after planned shutdown) and long keypoint interval. Answer D uses DASD logging (cannot survive LPAR failure).


Question 18

Marcus Whitfield at FBA describes a seventh failure scenario for CICS-IMS recovery that has never been tested. Why is this untested scenario concerning?

A) IMS recovery is fundamentally incompatible with CICS TS 5.6 B) Complex multi-subsystem failures can interact in ways that documented procedures don't cover, and the judgment to handle unexpected variations resides in Marcus's experience, which will be lost when he retires C) FBA's test environment cannot simulate IMS failures D) IBM has confirmed that the seventh scenario is impossible in practice

Answer: B Explanation: This connects to the "Knowledge is retiring" theme. Marcus has 30 years of IMS recovery experience — a combination of documented procedures and undocumented judgment calls accumulated through production incidents. The untested seventh scenario (simultaneous CICS and IMS failure across LPARs with a coupling facility transient error) is concerning not because the technology can't handle it, but because the human expertise to diagnose and resolve unexpected complications during recovery lives in Marcus's head. When he retires, that expertise goes with him unless it's captured in test cases, runbooks, and training.


Question 19

SecureFirst's mobile banking API uses idempotency keys to handle retry scenarios. Why is the idempotency check (SELECT from TRANSFER_AUDIT) placed in the same unit of work as the business logic (account updates)?

A) To improve performance by reducing the number of DB2 round-trips B) To ensure that the duplicate check and the audit insert commit atomically — preventing orphaned audit records or unrecorded transfers C) Because CICS requires all SQL statements to be in the same unit of work D) To avoid holding DB2 locks across multiple syncpoints

Answer: B Explanation: If the duplicate check and the audit insert were in separate units of work, a failure between them could create inconsistency: an audit record without the corresponding balance update (or vice versa). By placing the SELECT check, the balance updates, and the audit INSERT in the same UOW, they all commit or all back out together. If the UOW is backed out and retried, the duplicate check finds no audit record (it was backed out too) and processes normally. This is the foundation of correct idempotent design.


Question 20

You are evaluating whether to add a VSAM audit journal to the 2PC scope of a high-volume transaction that currently involves only DB2. The VSAM journal is in a FOR connected via MRO. What is the MOST important architectural trade-off to evaluate?

A) The VSAM journal's disk I/O throughput vs. the DB2 audit table's disk I/O throughput B) The increase in syncpoint overhead (from single-phase to multi-phase 2PC with MRO round-trip) vs. the recovery benefit of the VSAM journal C) The FOR's MAXTASK limit vs. the AOR's MAXTASK limit D) The VSAM file's CI size vs. the DB2 table's page size

Answer: B Explanation: Adding a VSAM journal in a FOR to the 2PC scope changes the syncpoint from single-phase (DB2 only, ~0.1ms) to multi-phase (DB2 + VSAM via MRO, ~1.0ms). At high volume, this 10x increase in syncpoint cost translates to significant CPU consumption. The architect must evaluate whether the VSAM journal provides recovery or audit value that cannot be achieved through other means (DB2 audit table, asynchronous journal write after the 2PC, MQ-based audit feed). Often, the audit requirement can be met without including the journal in the 2PC scope, avoiding the performance penalty while maintaining the audit trail.