Chapter 18 Key Takeaways
The Core Principle
Recovery behavior is a design decision, not a side effect. Every aspect of how your CICS system recovers from failure is determined by choices you make in advance — SIT parameters, resource definitions, transaction design, log configuration, and operational procedures. If you don't make these choices explicitly, CICS makes them for you with conservative defaults that prioritize safety over speed and manual intervention over automatic recovery.
Failure Taxonomy (Section 18.1)
-
Five categories of failure, each with different recovery mechanisms. Transaction abends (automatic backout), task-level failures (purge or intervention), region failures (emergency restart), system failures (LPAR restart + subsystem recovery), and Sysplex-wide failures (DR procedures). Design your recovery architecture to handle categories 1–3 automatically before worrying about categories 4–5.
-
The first architectural decision is which failures to automate. At CNB, categories 1–3 are fully automatic. Category 4 requires minimal operator intervention. Category 5 invokes the DR plan. Most shops make the mistake of designing for category 5 while leaving category 3 partially manual.
Recovery Architecture (Section 18.2)
-
The system log is the single most important recovery artifact. It records every change to every recoverable resource. Place it on a coupling facility log stream, not DASD. Set LOGDEFER=NO for financial transactions. If you lose the system log, you lose the ability to do emergency restart.
-
Activity keypoints bound recovery time. KEYINTV controls how far back the recovery manager must scan the log. At 3,200 TPS, KEYINTV=300 (default) means potentially scanning 960,000 records. KEYINTV=60 limits it to 192,000 records. Tune KEYINTV based on your transaction volume and recovery time target.
-
The recovery manager coordinates, not just replays. It coordinates backout with every participating resource manager (DB2, MQ, VSAM). It manages the two-phase commit resolution for distributed transactions. It shunts UOWs when participants are unavailable. Understanding the recovery manager's coordination role is essential for diagnosing recovery problems.
-
Not every resource should be recoverable. Recoverable resources add overhead (log records, syncpoint coordination, recovery processing). If losing the data during a region failure would cost money or violate a regulation, make it recoverable. If it would just cost time, don't.
XA and Two-Phase Commit (Section 18.3)
-
Two-phase commit separates the decision from the act. Phase 1 (PREPARE) ensures all participants can commit. The commit record on the coordinator's log is the commit point. Phase 2 (COMMIT) executes the decision. The coordinator's log is the source of truth.
-
CICS is the coordinator; resource managers are participants. In the event of disagreement or failure, the coordinator's log wins. If the log says committed, all participants commit. If the log has no commit record, all participants back out.
-
Every resource manager in the 2PC scope adds syncpoint overhead. Single-phase (DB2 only): ~0.1ms. Two-phase (DB2 + MQ): ~0.6ms. Cross-region (DB2 + MQ + MRO): ~1.0ms. Design transactions with the minimum set of resource managers necessary for data integrity.
-
RESYNCMEMBER(GROUPRESYNC) is mandatory for Sysplex environments. Without it, CICS can only resolve indoubt transactions through the specific DB2 member it was connected to. If that member is also down, resolution is blocked.
Indoubt Resolution (Section 18.4)
-
Indoubt transactions are rare but their impact is disproportionate. The indoubt window (between PREPARE and commit record) is normally microseconds. But at high volume, even microsecond windows will eventually be hit. The real damage is not the indoubt transaction itself — it's the locks it holds while waiting for resolution.
-
Automatic resolution depends on participant availability. If all participants are available at restart, resolution is automatic. If any participant is unavailable, the UOW is shunted. Shunted UOWs hold locks. The Pinnacle Health incident demonstrates how a 7-minute MQ recovery window translates to 7 minutes of DB2 lock contention affecting patient care.
-
RMRETRY controls shunted UOW retry frequency. The default (300 seconds) is almost always too long. CNB uses 30 seconds. After a participant recovers, the maximum delay before resolution is one RMRETRY interval. Reduce it.
-
Manual indoubt resolution (DFHRMUTL) requires two-person authorization. A wrong COMMIT/BACKOUT decision corrupts data. The CICS system programmer determines the technical state; a business operations manager confirms the business decision.
Region Recovery (Section 18.5)
-
Emergency restart is the workhorse. It handles abnormal termination by scanning the log, classifying UOWs, coordinating backout and resolution, reopening resources, and resuming transaction processing. Typical recovery time: 30–90 seconds for a well-configured region.
-
Cold start is a data integrity event. No log replay, no recovery. In-flight transactions leave resources inconsistent. Use only as a last resort. CNB's cold start triggers a mandatory 4–6 hour data reconciliation.
-
ARM handles restart; CICSPlex SM handles health. ARM answers "is the region running?" CPSM answers "is the region healthy?" Both are needed. A running-but-unhealthy region (DB2 not connected, MQ not initialized) that receives routed transactions will fail every one.
-
RESTART_ATTEMPTS prevents restart loops. ARM stops trying after the configured number of failures within the interval. This prevents a persistent defect (corrupt load module, bad SIT parameter) from causing an infinite restart loop.
Designing for Automatic Recovery (Section 18.6)
-
Idempotent transactions make retry safe. A unique request ID + atomic duplicate check within the same UOW ensures that retried transactions produce the same result, not duplicate results. Idempotency reduces recovery complexity by making retry a safe default action.
-
Classify errors before retrying. Lock timeouts and connection failures are transient (retryable). Program errors and security failures are permanent (not retryable). Unknown errors should default to not-retryable to avoid amplifying problems.
-
Exponential backoff prevents contention amplification. Never retry immediately. Start at 500ms, double with each attempt, cap at 3 retries. This gives the underlying condition time to resolve.
-
Compensating transactions handle committed-but-partially-complete business processes. When a multi-UOW process fails after the first UOW commits, a compensating transaction reverses the committed work. Compensating transactions must be idempotent, auditable, and handle their own failure scenarios.
Testing Recovery (Section 18.7)
-
Recovery testing is not optional. Most shops never test their recovery procedures. Then a failure occurs and the procedure doesn't work because configurations have drifted. Test regularly: transaction abends weekly, region failures monthly, indoubt resolution quarterly.
-
The most valuable outcome of recovery testing is discovering that your procedures are wrong. Every test uncovers documentation errors, parameter drift, or invalid assumptions. The test itself is the product.
-
Test in an environment that mirrors production. Same SIT parameters, same ARM policies, same DB2CONN definitions, same log stream configurations. A recovery test against different settings proves nothing about production recovery.
The Big Picture
CICS failure and recovery is an architecture, not a feature. It spans the system log, the recovery manager, two-phase commit, indoubt resolution, ARM auto-restart, CICSPlex SM health monitoring, idempotent transaction design, retry logic, and compensating transactions. Each component addresses a different failure mode. Together, they create a system that recovers from failures automatically, quickly, and correctly.
The technology is rock-solid. Twenty-five years of CICS production experience proves that every failure mode is documented and every recovery mechanism works. The failures are always human: configurations not set, procedures not tested, knowledge not transferred. The architect's job is to close those human gaps — making deliberate choices, documenting them explicitly, testing them regularly, and training the team to execute them under pressure.