Chapter 18 Exercises

DataField.Dev

Chapter 18 Exercises

Section 18.1 — Failure Taxonomy

Exercise 1: Failure Classification

Classify each of the following scenarios into the appropriate failure category (Transaction, Task-Level, Region, System, or Sysplex-Wide). For each, describe the recovery mechanism and the expected recovery time.

a) A COBOL program executing in an AOR receives an ASRA abend due to a subscript out of range b) A CICS task enters an infinite loop consuming 100% of a TCB c) A storage overlay in a COBOL program corrupts CICS kernel storage, causing an S0C4 in the CICS dispatcher d) The z/OS LPAR loses connectivity to the coupling facility e) A DB2 -911 SQLCODE causes a CICS AEYB condition in a funds transfer transaction f) An operator issues CANCEL on the CICS address space during a maintenance window g) The coupling facility loses power, affecting all LPARs in the Sysplex

Exercise 2: Failure Impact Analysis

CNB's CNBAORA1 fails at 14:22 on a Tuesday afternoon. At the moment of failure, the region has: - 847 active tasks - 312 tasks holding DB2 locks - 45 tasks with active MRO sessions to CNBFORA1 - 18 tasks with pending MQ puts

For each category of active work, describe: a) What happens to the work at the moment of failure b) What resources are locked/held and by whom c) The sequence of recovery events d) The impact on other regions and subsystems

Exercise 3: Recovery Cost Calculation

Calculate the business impact of a 90-second CICS AOR outage at CNB, given: - The AOR processes 3,200 TPS under normal conditions - Each failed transaction has a 15% probability of requiring manual reconciliation - Manual reconciliation costs $4.50 per transaction in staff time - Customer-facing impact costs an estimated $0.02 per transaction in goodwill loss - The AOR pair (CNBAORA1 + CNBAORA2) means CNBAORA2 absorbs 100% during recovery

a) How many transactions are directly affected (in-flight at failure time)? b) How many transactions are delayed (during the 90-second recovery window)? c) What is the estimated monetary cost of the outage? d) Why does Kwame argue that the indirect cost (delayed transactions on CNBAORA2 due to increased load) exceeds the direct cost?

Section 18.2 — Recovery Architecture

Exercise 4: System Log Configuration

You are configuring the CICS system log for a new production AOR that will process 2,000 TPS with an average of 3 log records per transaction.

a) Calculate the log record volume per second and per minute b) Determine an appropriate KEYINTV value. Show your reasoning based on the trade-off between recovery time and log I/O overhead c) Explain why LOGDEFER=NO is required for financial transactions. Under what circumstances (if any) would LOGDEFER=YES be acceptable? d) Write the z/OS Logger DEFINE LOGSTREAM statement for a coupling-facility-resident log stream with appropriate sizing

Exercise 5: Recovery Manager Coordination

Draw a sequence diagram showing the recovery manager's interactions during emergency restart when the region had 5 in-flight UOWs at failure time: - UOW1: DB2-only, commit record found on log - UOW2: DB2 + VSAM, no commit record - UOW3: DB2 + MQ, PREPARE complete, no commit record (indoubt) - UOW4: DB2 + MRO to FOR, in progress (no PREPARE yet) - UOW5: DB2-only, no commit record

For each UOW, show the recovery manager's classification, the recovery action, and the messages exchanged with each resource manager.

Exercise 6: Recoverable Resource Decision Matrix

Pinnacle Health Insurance has the following resources in their CICS claims processing region:

Resource	Type	Usage
CLAIM-MASTER	VSAM KSDS	Permanent claim records
CLAIM-WORK	VSAM KSDS	In-progress claim work area
ELIG-CACHE	VSAM RRDS	Cached eligibility data (rebuilt nightly from DB2)
AUDIT-LOG	VSAM ESDS	Regulatory audit trail
SCRATCH-Q	TS queue	Temporary calculation workspace
CLAIM-NOTIFY	TD queue	Claim status notification messages

For each resource, recommend whether it should be defined as RECOVERABLE or NOT RECOVERABLE. Justify each decision in terms of data integrity risk, recovery overhead, and operational impact.

Exercise 7: Journal Design

Design the journal record structure for a funds transfer transaction at CNB. Your journal record must support: - Forward recovery (rebuilding the accounts table from a DB2 backup + journal) - Regulatory audit (who transferred what, when, from where) - Reconciliation (detecting and resolving discrepancies between accounts and the journal)

Include: the COBOL record layout, the EXEC CICS WRITE JOURNALNAME call, and an explanation of how forward recovery would use these records.

Section 18.3 — XA and Two-Phase Commit

Exercise 8: Two-Phase Commit Scenario Analysis

A CICS transaction involves three resource managers: DB2 (account balance update), MQ (notification message), and VSAM RLS (audit journal). Trace the two-phase commit protocol for each of the following scenarios:

a) Normal commit: all participants say YES to PREPARE b) DB2 says NO to PREPARE (lock timeout on the account row) c) All participants say YES to PREPARE, but the CICS region fails before writing the commit record d) All participants say YES to PREPARE, the commit record is written, but the CICS region fails before sending COMMIT to MQ e) MQ is unavailable at PREPARE time

For each scenario, show the message flow, the final state of each resource manager, and the recovery action.

Exercise 9: Performance Impact Calculation

A CICS AOR at Pinnacle Health processes 1,500 TPS for eligibility verification. Currently, each transaction involves only DB2 (single-phase commit, 0.1ms overhead). The compliance team wants to add an MQ notification and a VSAM audit record to every transaction, which would require full two-phase commit (estimated 1.0ms overhead).

a) Calculate the additional CPU time per day for the 2PC overhead b) At $0.15 per CPU-second (typical mainframe MSU cost), what is the annual cost increase? c) Propose an alternative architecture that provides the notification and audit capability without adding MQ and VSAM to every transaction's 2PC scope. Explain the trade-offs.

Exercise 10: DB2CONN Configuration Review

Review the following DB2CONN definition and identify three configuration problems that would affect recovery:

DEFINE DB2CONN(PHDB2)
  GROUP(PHDB2G)
  DB2ID(DB2P)
  RESYNCMEMBER(NORESYNC)
  STANDBYMODE(NOCONNECT)
  THREADERROR(ABEND)
  THREADLIMIT(50)
  THREADWAIT(NO)

For each problem, explain: what will happen during a failure, what the correct setting should be, and why.

Section 18.4 — Indoubt Resolution

Exercise 11: Indoubt Probability Calculation

At CNB, the average time between a successful PREPARE and the commit record write is 0.05ms. The CICS AOR processes 3,200 TPS, each with an average 2PC scope.

a) What is the probability that a random region failure (occurring at any millisecond) will catch at least one transaction in the indoubt window? b) If the region fails once per month on average, how many indoubt transactions per year should CNB expect? c) Why does Kwame say this calculation is "optimistic in all the wrong ways"? (Hint: consider what happens to the indoubt window under log stream contention.)

Exercise 12: Indoubt Lock Impact Analysis

A CICS AOR fails with one indoubt transaction. The indoubt transaction holds a DB2 lock on the ACCOUNTS table row for account 00047291. The account holder is a corporate treasury account that processes 50 transactions per hour.

a) During the 90-second emergency restart, how many transactions for this account will fail due to lock timeout? b) If DB2's IRLMRWT (lock wait timeout) is set to 30 seconds, what is the maximum elapsed time before the first timeout occurs? c) If the CICS region restarts but MQ is still unavailable, causing the UOW to be shunted, how does the impact change? d) Design a monitoring alert that detects indoubt-caused lock contention within 15 seconds.

Exercise 13: Manual Resolution Procedure

Write a step-by-step operational procedure for manually resolving an indoubt transaction using DFHRMUTL. Your procedure must include: - Prerequisites (who can authorize, what information is needed) - The DFHRMUTL commands to list indoubt UOWs - Decision criteria for choosing COMMIT vs. BACKOUT - The DFHRMUTL commands to execute the resolution - Post-resolution verification steps - Documentation requirements

Assume the indoubt transaction is a funds transfer where the source account debit may or may not have been applied.

Exercise 14: Shunted UOW Escalation

Design an escalation procedure for shunted UOWs at CNB. The procedure should define: - Time thresholds (e.g., shunted > 5 minutes → Level 1 alert, > 15 minutes → Level 2) - Who is notified at each level - Decision criteria for forcing resolution vs. waiting for automatic resolution - Documentation and audit trail requirements

Section 18.5 — Region Recovery

Exercise 15: Startup Type Decision Tree

Create a decision tree (or flowchart) that a CICS system programmer can use to determine the correct startup type (COLD, WARM, or EMERGENCY) based on the circumstances: - Was the previous shutdown normal or abnormal? - Is the system log intact and accessible? - Is the coupling facility available? - Are there known indoubt transactions? - Has the CICS software been upgraded since the last run? - Have CSD definitions changed since the last warm shutdown?

Exercise 16: ARM Policy Design

Design the ARM policy for the HA banking system's CICS topology (from your Chapter 13 project checkpoint). For each region type (TOR, AOR, FOR, CMAS), specify: - RESTART_ATTEMPTS - RESTART_INTERVAL - RESTART_TIMEOUT - RESTART_METHOD - Cross-system restart requirements

Justify each parameter value based on the region's role and the recovery time objective.

Exercise 17: Emergency Restart Timing Analysis

CNB's CNBAORA1 processes 3,200 TPS with KEYINTV=60. During emergency restart: - Log scan rate: 50,000 records per second - Average log records per transaction: 3 - DB2 thread re-establishment: 2ms per thread, 150 threads - MRO session re-establishment: 10ms per session, 20 sessions

a) How many log records (maximum) must be scanned? b) How long does the log scan take? c) How long does DB2 reconnection take? d) How long does MRO reconnection take? e) What is the total estimated emergency restart time? f) What is the single largest contributor? How would you reduce it?

Section 18.6 — Designing for Automatic Recovery

Exercise 18: Idempotent Transaction Design

Redesign the following non-idempotent transaction to be idempotent. The transaction processes a customer payment:

       EXEC SQL
           UPDATE LOAN_BALANCE
           SET AMOUNT_OWED = AMOUNT_OWED - :WS-PAYMENT
           WHERE LOAN_ID = :WS-LOAN-ID
       END-EXEC
       EXEC SQL
           INSERT INTO PAYMENT_HISTORY
           (LOAN_ID, PAYMENT_AMOUNT, PAYMENT_DATE)
           VALUES (:WS-LOAN-ID, :WS-PAYMENT, CURRENT DATE)
       END-EXEC
       EXEC CICS SYNCPOINT END-EXEC

Show the modified COBOL code, explain what makes the original non-idempotent, and describe how your redesign handles the retry scenario.

Exercise 19: Retry Logic Design

Design a retry framework for the HA banking system that handles these five failure types: - DB2 lock timeout (SQLCODE -911, reason code 00C9008A) - DB2 deadlock (SQLCODE -911, reason code 00C90088) - MQ connection failure (MQCC=2, MQRC=2009) - MRO session failure (SYSIDERR) - CICS storage shortage (NOSTG)

For each, specify: retryable (yes/no), maximum retries, backoff strategy, and escalation action if retries are exhausted.

Exercise 20: Compensating Transaction Design

CNB's international wire transfer process involves four steps across two units of work:

UOW1: Debit customer account (DB2) + Record pending transfer (DB2) UOW2: Send SWIFT message (external) + Update transfer status to SENT (DB2) + Send customer notification (MQ)

If UOW1 succeeds but UOW2 fails, design the compensating transaction. Address: a) What must be reversed? b) How do you handle the case where the SWIFT message was sent but the DB2 status update failed? c) How do you make the compensation idempotent? d) What audit trail does the compensation create?

Section 18.7 — Testing Recovery

Exercise 21: Recovery Test Plan

Write a recovery test plan for the HA banking system that covers: - Transaction abend recovery (weekly automated) - Region failure recovery (monthly) - Indoubt resolution (quarterly) - Multi-region cascade (semi-annual)

For each test, specify: setup steps, failure injection method, expected behavior, verification steps, pass/fail criteria, and rollback procedure.

Exercise 22: Failure Injection Techniques

Describe three different techniques for injecting a CICS region failure in a test environment without affecting other regions on the same LPAR. For each technique, explain: - How it works - What type of failure it simulates (immediate crash vs. gradual degradation) - Advantages and limitations - Whether it tests ARM auto-restart

Exercise 23: Indoubt Window Testing

Explain how you would construct a test that creates an indoubt transaction — a transaction that is in the PREPARED state when the region fails. This is the hardest recovery test to construct. Describe: - How to create a delay between PREPARE and COMMIT (hint: CICS user exits) - How to time the region failure to fall within the delay - How to verify the UOW is actually in the PREPARED state - What you verify after the region restarts

Exercise 24: Recovery Documentation Review

You inherit a CICS environment where the recovery documentation was last updated 18 months ago. The environment has since undergone: - CICS TS upgrade from 5.5 to 5.6 - DB2 upgrade from V12 to V13 - Migration of system logs from DASD to coupling facility - Addition of two new AOR regions

List the top 10 documentation elements that are most likely to be incorrect, and explain why each matters for recovery.

Exercise 25: Recovery Metrics Dashboard

Design a monitoring dashboard for CICS recovery health. Include: - Real-time metrics (log stream utilization, active UOWs, shunted UOWs, DB2 thread status) - Historical metrics (recovery time trend, indoubt frequency, ARM restart count) - Alert thresholds for each metric - The CICS and DB2 commands needed to collect each metric

Exercise 26: Cross-Reference Exercise — Locking and Recovery

Explain the interaction between DB2 locking strategy (Chapter 8) and CICS recovery. Specifically:

a) How does the choice of page-level vs. row-level locking affect the blast radius of a CICS region failure? b) If a CICS region fails with 500 active DB2 transactions using row-level locking, approximately how many DB2 rows are locked? With page-level locking, how many rows are effectively locked? c) How does LOCK AVOIDANCE (from Chapter 8) interact with CICS recovery? Can lock avoidance help or hurt during the recovery window?

Exercise 27: Cross-Reference Exercise — Region Topology and Recovery

Using the topology you designed in Chapter 13's project checkpoint, analyze:

a) What is the maximum number of simultaneous region failures your topology can survive while maintaining 99.99% availability? b) If your topology has a single FOR, what is the FOR's contribution to the overall recovery risk? How would migrating FOR-resident data to DB2 with data sharing change the recovery profile? c) Design a CICSPlex SM health check transaction for your topology that verifies DB2, MQ, and VSAM connectivity.