Chapter 18: CICS Failure and Recovery

46 min read

> "In twenty-five years of running CICS in production, I have never once been surprised that a failure happened. I have been surprised by how we failed to recover from it. The failure is always the easy part. Recovery is where careers are made or...

Learning Objectives

Design CICS recovery procedures for transaction, region, and system failures
Explain XA (two-phase commit) transactions across CICS, DB2, and MQ
Resolve indoubt transactions after CICS or DB2 failures
Implement automatic recovery mechanisms including CICS auto-restart and transaction retry
Architect the failure/recovery strategy for the HA banking system

In This Chapter

XA Transactions, Indoubt Resolution, and Designing for Automatic Recovery
18.1 Failure Is Not If But When
18.2 CICS Recovery Architecture
18.3 XA Transactions and Two-Phase Commit
18.4 Indoubt Resolution
18.5 Region Recovery: Cold, Warm, and Emergency Start
18.6 Designing for Automatic Recovery
18.7 Testing Recovery
18.8 Putting It All Together: Recovery Architecture Patterns
Spaced Review Integration
Chapter Summary
What's Next

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 18: CICS Failure and Recovery

XA Transactions, Indoubt Resolution, and Designing for Automatic Recovery

"In twenty-five years of running CICS in production, I have never once been surprised that a failure happened. I have been surprised by how we failed to recover from it. The failure is always the easy part. Recovery is where careers are made or ended." — Kwame Mensah, Chief Mainframe Architect, Continental National Bank

This is the chapter that separates programmers from architects.

Everything you've built in Part III — the region topologies of Chapter 13, the web services of Chapter 14, the channels and containers of Chapter 15, the security models of Chapter 16, the performance tuning of Chapter 17 — all of it assumes the system is running. This chapter is about what happens when it stops running. And then starts again. And whether anything was lost in between.

At Continental National Bank, "anything lost" could mean a customer's $50,000 wire transfer. At Pinnacle Health Insurance, it could mean a claim adjudication that determines whether a patient gets their surgery approved. At Federal Benefits Administration, it could mean a Social Security payment that 80 million Americans depend on.

The stakes are not academic. The recovery architecture you design determines whether failures are invisible blips or front-page disasters.

⚠️ PREREQUISITE CHECK — This chapter requires solid understanding of CICS region topology and MRO (Chapter 13), DB2 locking and commit scope (Chapter 8), and z/OS system architecture (Chapter 1). If you don't understand what a syncpoint is, what a two-phase commit does, or how MRO connects regions, stop and review those chapters first. The material here builds directly on those foundations.

18.1 Failure Is Not If But When

Let me share a story that Kwame tells every new architect who joins CNB.

In 2016, CNB's primary CICS AOR — CNBAORA1 — suffered an abrupt failure at 14:17 on a Tuesday afternoon. Peak transaction volume. 3,200 transactions per second flowing through the region. The cause: a storage overlay in a rarely-executed foreign exchange program corrupted the CICS kernel's dispatch control area. The region went down hard. No graceful shutdown. No warning. One moment processing transactions, the next moment gone.

Here's what happened in the next 47 seconds:

CICS's auto-restart facility detected the region failure
CICSPlex SM's workload management stopped routing new transactions to CNBAORA1
Remaining AORs (CNBAORA2, CNBAORB1, CNBAORB2) absorbed the redistributed workload
CICS auto-restart initiated emergency restart of CNBAORA1
The recovery manager replayed the system log, backing out 847 in-flight transactions
DB2's lock manager resolved locks held by the failed region's threads
47 seconds after failure, CNBAORA1 was back online and accepting transactions

847 in-flight transactions were rolled back. Zero transactions were lost. Zero data corruption. The ATM network never noticed. Branch tellers experienced a 2-second delay on transactions that were mid-flight. The customer impact was, functionally, zero.

That is what a properly designed recovery architecture delivers.

Now let me tell you about a bank that didn't have one.

The Taxonomy of CICS Failures

Before you can design recovery, you need to understand what can fail. CICS failures fall into five categories, each with different characteristics, different recovery mechanisms, and different architectural implications.

Category 1: Transaction Failure (Abend)

A single transaction abends — ASRA (data exception), AEY9 (program not found), ATNI (terminal I/O error). The failing transaction is backed out. All resources it modified are restored to their pre-transaction state. Other transactions in the region are unaffected.

This is CICS doing its job as a transaction manager. The recovery is automatic, immediate, and invisible to other users. You don't design for this; CICS handles it. What you design is the notification — making sure operations knows the transaction failed, logging the details for diagnosis, and ensuring the user gets a meaningful error message rather than a cryptic abend code.

Category 2: Task-Level Failure

A task enters an error state but doesn't abend the transaction cleanly. Examples: a looping task consuming CPU, a task waiting indefinitely on a resource, a task holding an enqueue that blocks other tasks. The individual task is broken, but the region is still running.

Recovery: CICS's runaway task detection (ICVR parameter) can purge looping tasks. Deadlock detection can break task-level deadlocks. Operations can CEMT SET TASK PURGE individual tasks. The region continues operating throughout.

Category 3: Region Failure

The entire CICS region fails. Causes include: storage overlays corrupting the CICS kernel, z/OS initiating an abend of the CICS address space (S0C4, S878), operator CANCEL command, or z/OS resource exhaustion forcing region termination. Every transaction in the region is affected. Every resource the region owns becomes unavailable until recovery.

This is where recovery architecture matters. The region must restart, replay its logs, resolve its resources, and re-enter the topology — ideally without human intervention.

Category 4: System Failure

The z/OS image (LPAR) fails. Every CICS region on that LPAR fails simultaneously. All DB2 members on that LPAR fail. All MQ queue managers on that LPAR fail. If you're running a Sysplex, the coupling facility detects the LPAR departure and initiates cross-system recovery.

This is disaster recovery territory. The recovery involves LPAR restart, z/OS IPL, subsystem initialization, and CICS region startup — a sequence that takes minutes, not seconds.

Category 5: Sysplex-Wide Failure

Multiple LPARs fail, or the coupling facility fails, or the network fails. This is the catastrophic scenario that Chapter 30 (Disaster Recovery) addresses in depth. For this chapter, we focus on categories 1–4.

💡 INSIGHT — The first architectural decision in recovery design is determining which failure categories your system must survive automatically and which are acceptable to handle with operator intervention. At CNB, categories 1–3 are fully automatic. Category 4 requires minimal operator intervention (confirm LPAR restart). Category 5 invokes the DR plan with site failover. Most shops make the mistake of designing for category 5 while leaving category 3 partially manual. Get category 3 right first.

Recovery Is Architecture, Not Accident

Here's the principle that governs this entire chapter: recovery behavior is a design decision, not a side effect. Every aspect of how your CICS system recovers from failure is determined by choices you make in advance — in your SIT parameters, your resource definitions, your transaction design, your log configuration, and your operational procedures.

If you don't make these choices explicitly, CICS makes them for you with defaults. And the defaults are conservative. They prioritize safety over speed, manual intervention over automatic recovery. For a shop processing 500 million transactions per day, conservative defaults translate to longer outages, more manual work, and more risk of human error during the most stressful moments.

🔗 CROSS-REFERENCE — Chapter 13, Section 13.1 introduced the concept that CICS is a transaction manager, not an application server. That distinction is critical here. An application server that fails simply restarts and begins accepting new work. A transaction manager that fails must also resolve the work that was in progress when it failed — completing committed work, backing out uncommitted work, and resolving work that was between phases of a two-phase commit. This resolution process is what makes CICS recovery complex and what makes it powerful.

18.2 CICS Recovery Architecture

Every CICS region maintains a set of structures that collectively enable recovery. Understanding these structures — what they record, when they record it, and how they're used during recovery — is the foundation for everything else in this chapter.

The System Log

The CICS system log is the single most important recovery artifact. It records every change to every recoverable resource in the region. When CICS starts a unit of work, the system log gets a record. When the UOW modifies a recoverable VSAM file, the system log gets a before-image and after-image. When the UOW issues a syncpoint, the system log gets a commit record. When the UOW abends, the system log gets a backout record.

The system log is implemented as a z/OS log stream. In a Sysplex environment, it can be directed to the coupling facility for hardware-level duplexing, which means the log survives even if the LPAR hosting the CICS region fails. This is not optional for production. If your system log is on DASD and your LPAR fails, you cannot do emergency restart. You've lost the log. You're doing a cold start, which means losing all in-flight work with no recovery.

CICS System Log Configuration (SIT Parameters):
  LOGDEFER=NO         Do not defer log writes (production setting)
  AKESSION=YES        Auto-install kinetic sessions
  LGDFINT=5           Log defer interval (5ms) if LOGDEFER=YES

z/OS Logger Configuration (for coupling facility log stream):
  DEFINE LOGSTREAM NAME(CNB.CICS.CNBAORA1.DFHLOG)
    STRUCTNAME(CICS_LOG_001)
    LOWOFFLOAD(40)
    HIGHOFFLOAD(80)
    MAXBUFSIZE(65532)
    STG_DUPLEX(YES)
    DUPLEXMODE(COND)
    LS_DATACLAS(CICSLOG)

⚠️ PRODUCTION RULE — Never set LOGDEFER=YES in a CICS region that processes financial transactions. LOGDEFER batches log writes for performance, but it creates a window where committed work is not yet on the log. If the region fails during that window, committed transactions may be lost. The performance benefit (typically 2–5% CPU reduction) is not worth the data loss risk. CNB learned this the hard way during a 2014 migration — three committed wire transfers worth $2.1M total were lost during a region failure because the log write was deferred. They set LOGDEFER=NO across every production region within 24 hours.

Activity Keypoints

CICS periodically writes an activity keypoint to the system log. The activity keypoint is a snapshot of every active unit of work in the region — their resource modifications, their lock holdings, their current state. During recovery, CICS reads the log backward from the end, but it only needs to read back to the most recent activity keypoint. Without keypoints, CICS would need to read the entire log from the beginning of the region's life.

The frequency of activity keypoints is controlled by the AKESSION parameter and the KEYINTV SIT parameter (keypoint interval in seconds). The default is 300 seconds (5 minutes). For high-volume regions, that's too long — it means recovery might need to read 5 minutes of log records, which at CNB's volume could be 50,000+ records.

SIT Override for Activity Keypoint Interval:
  KEYINTV=60          Activity keypoint every 60 seconds

CNB sets KEYINTV=60 for AORs and KEYINTV=120 for TORs. The AOR keypoint frequency is higher because AORs have more recoverable resources, and recovery time is more critical for AORs (they hold the application state).

The Recovery Manager

The CICS recovery manager (DFHRM) is the component that coordinates recovery. It doesn't merely replay the log — it coordinates with every resource manager that participated in the unit of work.

During normal operation, the recovery manager:

Assigns a Unit of Recovery ID (URID) to each unit of work
Tracks which resource managers are participating in each UOR
Coordinates syncpoints across all participating resource managers
Manages the two-phase commit protocol for distributed transactions

During recovery, the recovery manager:

Reads the system log backward from the end to the most recent activity keypoint
Identifies all in-flight units of work (started but not committed or backed out)
For each in-flight UOW, coordinates backout with every participating resource manager
For committed UOWs where after-images haven't been applied, coordinates redo
For distributed transactions that were in the prepared state (phase 1 complete, phase 2 not started), enters indoubt resolution (see section 18.4)

This is where the recovery manager's relationship with DB2 becomes critical. When a CICS region fails while DB2 transactions are in flight, DB2 still holds the locks. DB2 doesn't release those locks until CICS tells it to — either through normal commit/backout during restart, or through indoubt resolution if CICS doesn't restart promptly.

🔗 SPACED REVIEW — Chapter 8 — Recall from Chapter 8 that DB2 locks are held for the duration of the unit of work. In a CICS environment, the unit of work boundary is the EXEC CICS SYNCPOINT. If a CICS region fails with 500 active tasks, each holding DB2 locks, those locks remain held until CICS restarts and resolves each UOW. Until then, other CICS regions (or batch jobs) attempting to access the same DB2 rows will wait — or time out. This is why fast CICS recovery is not just about the failing region; it's about the entire system's throughput during the recovery window.

Journals

CICS journals (also known as user journals) are distinct from the system log. The system log is CICS's internal recovery mechanism. Journals are application-directed log records that your programs write explicitly using EXEC CICS WRITE JOURNALNAME.

Journals serve two purposes:

Forward recovery. If a VSAM file is damaged beyond what backout can repair (disk failure, accidental deletion), you can replay journal records to reconstruct the file from a backup. The journal contains the after-images of every update.
Audit trail. Journal records provide a chronological record of every change, with timestamps, transaction IDs, and terminal IDs. For regulatory compliance (SOX, HIPAA, PCI-DSS), journals are evidence of who changed what and when.

      * Writing a journal record for audit and forward recovery
           EXEC CICS WRITE JOURNALNAME('DFHJ01')
                JTYPEID('TX')
                FROM(WS-JOURNAL-RECORD)
                LENGTH(WS-JOURNAL-LENGTH)
                WAIT
                RESP(WS-RESP)
           END-EXEC

The WAIT option is critical. Without WAIT, the journal write is asynchronous — your program continues before the journal record is physically written. If the region fails before the write completes, the journal record is lost. For audit-critical transactions, always specify WAIT.

💡 INSIGHT — The system log and journals serve different masters. The system log serves the recovery manager — it exists to enable backout and redo during restart. Journals serve the application — they exist for forward recovery and audit. You need both. The system log handles the "undo" (backing out failed transactions). The journal handles the "redo" (replaying changes to reconstruct lost data from a backup).

Resource Recovery Table (RRT)

The Resource Recovery Table defines which resources are recoverable — which VSAM files, which transient data queues, which temporary storage queues will have their changes backed out if a transaction fails or a region crashes.

A common mistake is making everything recoverable. Recoverable resources add overhead: every modification generates system log records, every syncpoint coordinates with the resource manager, and every recovery replays those log records. If a resource doesn't need transactional integrity — a scratch file, a temporary work queue, a cache — don't make it recoverable.

At CNB, approximately 30% of VSAM files are defined as recoverable. The rest are either read-only reference files or scratch files that can be recreated. Lisa Tran's rule: "If losing the data during a region failure would cost us money or violate a regulation, it's recoverable. If it would just cost us time, it's not."

Recovery Architecture in a Multi-Region Topology

In a multi-region topology (Chapter 13), recovery becomes more complex because a single transaction may involve resources in multiple regions.

Consider a funds transfer at CNB:

TOR receives the request and routes it to AOR via MRO
AOR reads the source account from DB2 (via CICS-DB2 attachment)
AOR updates the source account balance in DB2
AOR function-ships a VSAM write to FOR for the transaction journal
AOR updates the destination account balance in DB2
AOR issues EXEC CICS SYNCPOINT

The syncpoint in step 6 must coordinate across three participants: DB2 (for the account balance updates), the FOR (for the VSAM journal write), and CICS itself (for the system log). If any participant cannot commit, all must back out. This is the two-phase commit protocol — which brings us to section 18.3.

18.3 XA Transactions and Two-Phase Commit

What XA Actually Is

XA is the X/Open standard for distributed transaction processing. It defines a protocol by which a transaction manager (CICS) coordinates commit and backout across multiple resource managers (DB2, MQ, VSAM RLS, etc.). The protocol has two phases, hence "two-phase commit" (2PC).

In CICS terms, here's what happens during a syncpoint that involves multiple resource managers:

Phase 1 — Prepare

CICS sends a PREPARE signal to every resource manager participating in the unit of work.

DB2 receives PREPARE, writes its log records, and responds "YES, I can commit" or "NO, I need to back out."
MQ receives PREPARE, writes its log records, and responds YES or NO.
VSAM RLS receives PREPARE, writes its log records, and responds YES or NO.

If all resource managers respond YES, CICS proceeds to Phase 2. If any resource manager responds NO, CICS sends ROLLBACK to all resource managers. The transaction is backed out. End of story.

Phase 2 — Commit

CICS writes a commit record to its own system log. This is the commit point — the moment of truth. Once the commit record is on the log, the transaction is committed. Even if the system fails immediately after, the commit record on the log ensures that during recovery, the transaction will be completed (not backed out).

CICS then sends COMMIT to every resource manager. Each resource manager applies its changes and releases its locks.

Two-Phase Commit Timeline:

Application:  EXEC CICS SYNCPOINT
                    |
CICS:              [Begin 2PC]
                    |
Phase 1:    ┌──────────────────────────────┐
            │ PREPARE ──► DB2   ──► YES    │
            │ PREPARE ──► MQ    ──► YES    │
            │ PREPARE ──► VSAM  ──► YES    │
            └──────────────────────────────┘
                    |
            [All YES — proceed]
                    |
CICS Log:   [Write COMMIT record]  ◄── THE COMMIT POINT
                    |
Phase 2:    ┌──────────────────────────────┐
            │ COMMIT  ──► DB2              │
            │ COMMIT  ──► MQ               │
            │ COMMIT  ──► VSAM             │
            └──────────────────────────────┘
                    |
            [Locks released, work complete]

💡 INSIGHT — The key insight of two-phase commit is that the decision to commit is separated from the act of committing. Phase 1 ensures everyone can commit. Between phase 1 and phase 2, there is a brief window — the "window of doubt" — where participants have agreed to commit but the commit record hasn't been written yet. If a failure occurs in this window, you have an indoubt transaction. This window is typically microseconds. But it exists, and at 500 million transactions per day, microseconds add up.

CICS as the XA Coordinator

CICS plays a specific role in XA: it is the coordinator (sometimes called the transaction manager or sync-point manager). The resource managers — DB2, MQ, VSAM RLS — are the participants.

This distinction matters because the coordinator's log is the source of truth. If there's a disagreement during recovery about whether a transaction was committed or backed out, the coordinator's log wins. If the coordinator's log says "committed," every participant must commit. If the coordinator's log says "backed out" or has no record of the transaction, every participant must back out.

The CICS-DB2 attachment facility implements the XA interface between CICS and DB2. When you configure the CICS-DB2 attachment (DB2CONN resource definition), you establish the thread management and two-phase commit coordination:

DEFINE DB2CONN(CNBDB2)
  GROUP(CNBDB2G)
  DB2ID(DB2P)
  MSGQUEUE1(CSSL)
  NONTERMREL(YES)
  RESYNCMEMBER(GROUPRESYNC)
  STANDBYMODE(RECONNECT)
  THREADERROR(N906D)
  THREADLIMIT(150)
  THREADWAIT(YES)
  ACCOUNTREC(TASK)
  AUTHTYPE(USERID)
  PLAN(CNBPLAN1)
  PRIORITY(HIGH)
  COMAUTHTYPE(USERID)
  STATSQUEUE(NONE)
  TCBLIMIT(0)

The critical parameter here is RESYNCMEMBER(GROUPRESYNC). This tells CICS that during recovery, it should attempt to resynchronize with any available member of the DB2 data sharing group — not just the specific DB2 member it was connected to before the failure. In a Sysplex, this means CICS can resolve indoubt transactions even if the original DB2 member is still down, as long as another member of the data sharing group is available.

Two-Phase Commit Across CICS Regions (MRO)

When a transaction spans multiple CICS regions via MRO, the region that owns the transaction (usually the AOR where the application program runs) is the coordinator. The remote regions (FOR, other AORs reached via DPL) are participants.

The coordinator extends the two-phase commit across MRO connections:

Application in AOR1 issues EXEC CICS SYNCPOINT
AOR1's recovery manager sends PREPARE to FOR1 (via MRO) and to DB2 (via attachment)
FOR1 prepares its VSAM changes and responds YES
DB2 prepares its SQL changes and responds YES
AOR1 writes its commit record
AOR1 sends COMMIT to FOR1 and DB2

If FOR1 or DB2 responds NO to PREPARE, AOR1 sends ROLLBACK to all participants. The entire distributed transaction is backed out atomically.

⚠️ PRODUCTION RULE — In a multi-region MRO topology, always ensure that the coordinating region (the AOR) has its system log on a coupling facility log stream. If the AOR fails and its system log is on DASD, recovery requires that specific DASD to be accessible. If the DASD is damaged, the indoubt transactions cannot be resolved automatically. Coupling facility log streams survive LPAR failures, making automatic recovery possible.

Two-Phase Commit with MQ

IBM MQ participates in CICS two-phase commit as a resource manager. When your COBOL program executes MQPUT or MQGET within a CICS transaction, MQ registers as a participant in the unit of work. At syncpoint, CICS coordinates the commit across DB2, VSAM, and MQ.

This creates a powerful but complex interaction. Consider a funds transfer that also sends a notification message:

      * Within a CICS transaction:
      * 1. Debit source account (DB2)
      * 2. Credit destination account (DB2)
      * 3. Send notification to message queue (MQ)
      * 4. Write audit journal (VSAM)
      * 5. SYNCPOINT — all four resources commit or all back out

           EXEC SQL
               UPDATE ACCOUNTS
               SET BALANCE = BALANCE - :WS-AMOUNT
               WHERE ACCOUNT_NO = :WS-SOURCE-ACCT
           END-EXEC

           EXEC SQL
               UPDATE ACCOUNTS
               SET BALANCE = BALANCE + :WS-AMOUNT
               WHERE ACCOUNT_NO = :WS-DEST-ACCT
           END-EXEC

           CALL 'CSQCPUT' USING WS-MQ-HCONN
                                WS-MQ-HOBJ
                                WS-MQ-MD
                                WS-MQ-PMO
                                WS-MSG-LENGTH
                                WS-NOTIFY-MSG
                                WS-MQ-CC
                                WS-MQ-RC

           EXEC CICS WRITE JOURNALNAME('DFHJ01')
                JTYPEID('TX')
                FROM(WS-AUDIT-RECORD)
                LENGTH(WS-AUDIT-LENGTH)
                WAIT
           END-EXEC

           EXEC CICS SYNCPOINT
                RESP(WS-RESP)
           END-EXEC

At the SYNCPOINT, CICS coordinates phase 1 PREPARE to DB2, MQ, and the VSAM journal. All three must agree before the commit record is written and phase 2 COMMIT signals are sent.

The MQ integration is configured through the CICS-MQ adapter (CKQQ transaction):

INITPARM=(CKQQ='SI=CNBQ,TN=CKQQ,IQ=SYSTEM.CICS.INITQ')

Where SI is the MQ queue manager name, TN is the trigger transaction, and IQ is the initialization queue.

🔗 CROSS-REFERENCE — Chapter 19 covers MQ queue design in depth, including persistent vs. non-persistent messages. For two-phase commit to work with MQ, messages must be persistent. Non-persistent messages are not recoverable — they're lost if MQ restarts. If your transaction puts a non-persistent message, MQ does not participate in 2PC for that message. This is sometimes acceptable (notifications, alerts) but never acceptable for transaction data.

Performance Implications of Two-Phase Commit

Two-phase commit is not free. Each syncpoint with N resource managers requires:

N PREPARE messages (and N responses)
1 commit record write to the system log
N COMMIT messages

For a simple DB2-only transaction, 2PC adds approximately 0.1–0.3ms to the syncpoint. When MQ and VSAM RLS are added, the overhead increases to 0.5–1.0ms. For cross-region 2PC via MRO, add the MRO round-trip time (typically 0.1–0.5ms per region).

At CNB's volume (500M transactions/day), even 0.5ms per transaction is 69 CPU-hours per day. Kwame's team made a deliberate decision: transactions that only modify DB2 use the CICS-DB2 attachment's optimized single-phase commit protocol (no explicit PREPARE/COMMIT exchange — DB2 handles it internally). Only transactions that span multiple resource managers use full two-phase commit.

Single-Phase Commit (DB2 only):
  SYNCPOINT → DB2 commit → done
  Overhead: ~0.1ms

Two-Phase Commit (DB2 + MQ):
  SYNCPOINT → PREPARE(DB2) → PREPARE(MQ) → commit log → COMMIT(DB2) → COMMIT(MQ)
  Overhead: ~0.6ms

Two-Phase Commit (DB2 + MQ + MRO to FOR):
  SYNCPOINT → PREPARE(DB2) → PREPARE(MQ) → PREPARE(FOR/MRO) → commit log →
  COMMIT(DB2) → COMMIT(MQ) → COMMIT(FOR/MRO)
  Overhead: ~1.0ms

💡 INSIGHT — The decision about which resource managers to include in a unit of work is an architectural decision with measurable performance impact. Every resource manager you add to the 2PC increases syncpoint cost linearly. Design your transactions to include the minimum set of resource managers necessary for data integrity. If a journal write can be deferred to a subsequent transaction (because you have other recovery mechanisms), consider removing it from the 2PC scope.

18.4 Indoubt Resolution

This is the section that separates theory from reality. Two-phase commit works beautifully when all components are running. The hard part — the part that wakes you up at 2 AM — is what happens when the coordinator fails between phase 1 and phase 2.

What "Indoubt" Means

An indoubt transaction is one where:

Phase 1 (PREPARE) completed successfully — all resource managers said YES
The coordinator (CICS) has not yet written the commit record to its system log
The coordinator fails

At this point, the resource managers are in limbo. They've prepared their changes (written log records, acquired locks) but they don't know whether the coordinator decided to commit or back out. They can't commit unilaterally (the coordinator might have decided to back out). They can't back out unilaterally (the coordinator might have decided to commit). They are indoubt.

And here's the nasty part: while they're indoubt, they're holding locks. DB2 locks. MQ message locks. VSAM record locks. Those locks block other transactions that need access to the same data. The longer the indoubt state persists, the wider the impact.

The Indoubt Window

The indoubt window is the time between all participants responding YES to PREPARE and the coordinator writing the commit record. In normal operation, this window is microseconds — the time it takes to write a single log record.

But microseconds multiplied by millions of transactions per day means it will happen. Kwame's estimate: at CNB's volume, approximately one transaction per week is in the prepare-to-commit window during any given millisecond. If a region failure occurs at exactly that millisecond, you have an indoubt transaction.

In practice, the indoubt window is wider than microseconds because of log I/O. If the log stream is experiencing contention or the coupling facility is under load, the commit record write can take milliseconds. During those milliseconds, any failure creates an indoubt.

Indoubt Resolution Mechanisms

CICS provides several mechanisms for resolving indoubt transactions:

Mechanism 1: Automatic Resolution via Emergency Restart

When CICS restarts after a failure, the recovery manager reads the system log. For each in-flight UOW:

If the commit record is on the log → the transaction was committed → send COMMIT to all participants
If no commit record exists → the transaction was not committed → send BACKOUT to all participants

This resolves most indoubt situations automatically. The CICS region restarts, reads its log, and tells every participant what to do. Simple, clean, automatic.

The critical dependency: the system log must be intact and accessible. If it's on a coupling facility log stream (as it should be), this is almost always the case — the coupling facility survives LPAR failures.

Mechanism 2: RESYNCMEMBER for DB2 Data Sharing

When CICS restarts and attempts to resolve indoubt transactions with DB2, it normally reconnects to the same DB2 member it was using before the failure. But if that DB2 member is also down, resolution is blocked.

The RESYNCMEMBER(GROUPRESYNC) parameter on the DB2CONN definition tells CICS to attempt resolution with any available member of the DB2 data sharing group. Since all members share the same data (via the coupling facility), any member can apply the commit or backout on behalf of the failed member.

This is essential for Sysplex environments. If both CICS and DB2 on the same LPAR fail (a common scenario — if the LPAR fails, everything on it fails), CICS on a different LPAR can restart and resolve indoubt transactions through a DB2 member on a surviving LPAR.

Mechanism 3: Manual Resolution with DFHRMUTL

If automatic resolution fails — the system log is damaged, the DB2 data sharing group is entirely unavailable, or the MQ queue manager is down — an operator must manually resolve indoubt transactions.

The DFHRMUTL utility reads the CICS recovery manager log and displays indoubt units of work:

//RESOLVE  EXEC PGM=DFHRMUTL
//STEPLIB  DD DSN=CICSTS56.CICS.SDFHLOAD,DISP=SHR
//DFHRMLOG DD DSN=CNB.CICS.CNBAORA1.DFHLOG,DISP=SHR
//SYSPRINT DD SYSOUT=*
//SYSIN    DD *
  LIST UOWSTATUS(INDOUBT)
/*

Output:

DFHRM0501 UNIT OF WORK DISPLAY
  URID: 0000000000A3F710
  NETNAME: CNBAORA1
  TRANSID: XFER
  START: 2024-03-14 14:17:23.847291
  STATUS: INDOUBT
  PARTICIPANTS:
    DB2(DB2P)  - PREPARED
    MQ(CNBQ)   - PREPARED
    VSAM(CNBFORA1) - PREPARED
  LAST PHASE: PREPARE COMPLETE, AWAITING COMMIT DECISION

The operator must then decide: commit or backout. This decision requires understanding the business context. Was the funds transfer supposed to go through? Is there evidence from the application log that the transfer was valid?

//RESOLVE  EXEC PGM=DFHRMUTL
//SYSIN    DD *
  SET URID(0000000000A3F710) ACTION(COMMIT)
/*

Or:

  SET URID(0000000000A3F710) ACTION(BACKOUT)

⚠️ PRODUCTION RULE — Manual indoubt resolution is the most dangerous operation in CICS recovery. A wrong decision (committing a transaction that should have been backed out, or vice versa) corrupts data. At CNB, manual indoubt resolution requires two-person authorization: the CICS system programmer determines the technical state, and a business operations manager confirms the business decision. No exceptions. No shortcuts. Not even at 2 AM.

Shunted Units of Work

When CICS cannot automatically resolve an indoubt UOW during restart (because a participant resource manager is unavailable), it shunts the UOW. A shunted UOW is set aside — it doesn't block the restart, and CICS becomes available for new work. But the resources locked by the shunted UOW remain locked.

You can view shunted UOWs with CEMT:

CEMT INQUIRE UOWENQ
  UOWID(0000000000A3F710)
  TRANSID(XFER)
  QUALIFIER(SHUNTED)
  ENQNAME(ACCT.LOCK.00047291)
  ENQTYPE(CICSENQ)
  STATUS(WAITING)

Shunted UOWs are CICS's way of saying: "I can't resolve this now, but I haven't forgotten about it." CICS periodically retries resolution (controlled by the PSTYPE and RMRETRY parameters). When the unavailable resource manager comes back online, CICS resolves the shunted UOW automatically.

The danger of shunted UOWs is lock holding. If a shunted UOW holds a lock on a critical DB2 row — say, the account balance row for a high-value customer — every transaction that tries to access that row will wait or timeout. The impact fans out quickly.

SIT Parameters for Shunted UOW Management:
  RMRETRY=60          Retry resolution every 60 seconds
  UOWNETQL=FORCE      Force UOW resolution on region shutdown

💡 INSIGHT — The most common production incident involving indoubt transactions is not the indoubt itself — it's the lock contention caused by the indoubt. A single shunted UOW holding a lock on a busy DB2 table can cause hundreds of transaction timeouts per second. The operational priority is not "resolve the indoubt" — it's "identify what's locked and assess the business impact." Sometimes the right answer is to FORCE the resolution (accepting potential data inconsistency) rather than allowing the lock contention to cascade.

The Pinnacle Health Incident

Let me give you a concrete example. In 2022, Pinnacle Health Insurance experienced a CICS AOR failure during their busiest period — open enrollment. The AOR was processing eligibility verification transactions, each of which involved:

A DB2 read of the member's eligibility record
A DB2 update of the verification timestamp
An MQ put to the claims notification queue
A VSAM write to the audit journal

When the AOR failed, 23 transactions were in the indoubt state — phase 1 complete, phase 2 not started. The AOR restarted via emergency restart within 90 seconds. Of the 23 indoubt transactions, 21 were resolved automatically — the system log had enough information, and DB2 and MQ were available.

Two transactions could not be resolved because the MQ queue manager on the same LPAR had also restarted and its log was not yet available. Those two transactions were shunted. The shunted UOWs held DB2 locks on two member eligibility records.

For 7 minutes — until the MQ queue manager completed its own recovery and CICS's retry loop resolved the shunted UOWs — any eligibility verification for those two members timed out. Ahmad Rashidi's compliance team logged it as a service availability incident, and the post-incident review led to two changes:

RESYNCMEMBER(GROUPRESYNC) was configured for the MQ connection (allowing resolution via an MQ queue manager on a different LPAR)
RMRETRY was reduced from 300 seconds (the default) to 30 seconds

Diane Okoye's observation: "Seven minutes of lock contention for two records sounds minor. But one of those records was a patient in the middle of a cancer treatment pre-authorization. Her oncologist couldn't verify eligibility. Seven minutes felt like an eternity."

18.5 Region Recovery: Cold, Warm, and Emergency Start

When a CICS region starts, it must determine what state to initialize from. The startup type determines how much of the previous execution is carried forward — and how much is lost.

Cold Start

A cold start initializes CICS from scratch. The system log is empty. No recovery occurs. All in-flight transactions from the previous execution are lost. All recoverable resources are in whatever state the resource managers left them in.

When to cold start:

Initial CICS installation (no previous state exists)
After a deliberate decision to abandon all previous state (rare)
After catastrophic log damage where emergency restart is impossible

When NOT to cold start:

After a region failure (use emergency restart)
After a planned shutdown (use warm start)
Ever, in production, unless you have no other option

SIT Override for Cold Start:
  START=COLD

⚠️ PRODUCTION RULE — A cold start in production is a data integrity event. If there were in-flight transactions modifying recoverable resources, those resources may be in an inconsistent state. DB2 tables may have partial updates. VSAM files may have uncommitted records. After a cold start, you must verify data integrity across every recoverable resource. At CNB, a cold start triggers a mandatory data reconciliation procedure that takes 4–6 hours. They have cold-started exactly twice in the last decade, both times due to coupling facility failures that destroyed the system log.

Warm Start

A warm start initializes CICS from a controlled shutdown's checkpoint. When you shut down CICS with CEMT PERFORM SHUTDOWN (normal shutdown), CICS writes a final keypoint to the system log, completes all in-flight transactions (or backs them out), and records the clean shutdown state.

On warm start, CICS reads the shutdown keypoint and restores the region to its pre-shutdown state: open files, installed resources, started transactions, everything. No recovery is needed because the shutdown was clean.

SIT Override for Warm Start:
  START=AUTO     (AUTO attempts warm start, falls back to emergency)

Emergency Restart

Emergency restart is the workhorse of CICS recovery. It's what happens when CICS starts after an abnormal termination — a crash, a cancel, a system failure. CICS reads the system log, identifies all in-flight work, and drives recovery:

Log scan. CICS reads the system log backward from the end to the most recent activity keypoint. This identifies every UOW that was active at the time of failure.
Classification. Each UOW is classified: - Committed: Commit record found on the log. No action needed — changes were already applied. - In-flight: Started but no commit or backout record. Must be backed out. - Indoubt: Prepared (phase 1 complete) but no commit record. Must be resolved (see 18.4).
Backout. For each in-flight UOW, CICS coordinates backout with all participating resource managers. DB2 rolls back its changes. VSAM restores before-images. MQ removes uncommitted puts and restores uncommitted gets.
Indoubt resolution. For each indoubt UOW, CICS attempts automatic resolution. If the resource manager is available, resolution proceeds. If not, the UOW is shunted.
Resource reopen. CICS reopens files, restarts transient data triggers, re-establishes DB2 and MQ connections.
Ready for work. The region begins accepting new transactions.

The elapsed time for emergency restart depends on the volume of log records to process. At CNB, with KEYINTV=60 and typical AOR workload, emergency restart completes in 30–90 seconds. The dominant factor is DB2 thread re-establishment (the CICS-DB2 attachment must reconnect and resolve each thread).

SIT Overrides for Emergency Restart Optimization:
  START=AUTO            Attempt warm start; fall back to emergency
  KEYINTV=60            Activity keypoints every 60 seconds
  RMRETRY=30            Retry shunted UOW resolution every 30 seconds
  DUMPDS=AUTO           Auto-switch dump datasets on restart
  OFFSITE=NO            Don't wait for offsite backup confirmation
  PARMERR=INTERACT      Interact with operator on parameter errors
  GRPLIST=(CNBAOR,CNBCOMM,CNBREC)
  AUTCONN=YES           Auto-connect ISC/MRO sessions on restart

Auto-Restart

CICS auto-restart is the mechanism that automatically restarts a CICS region after an abnormal termination. Without auto-restart, a region failure requires operator intervention to restart it — someone must submit the CICS startup JCL or issue a START command.

Auto-restart is implemented through z/OS Automatic Restart Management (ARM). When a CICS region registers with ARM at startup, z/OS monitors the region's health. If the region fails, ARM restarts it automatically according to a restart policy.

ARM Policy for CICS AOR:
  ELEMENT(CNBAORA1)
    TYPE(CICS)
    RESTART_GROUP(CNBCICS)
    RESTART_ATTEMPTS(3)
    RESTART_INTERVAL(600)
    RESTART_TIMEOUT(120)
    RESTART_METHOD(STC)

Key ARM parameters:

RESTART_ATTEMPTS(3): Try up to 3 restarts within the interval. If all 3 fail, give up and alert the operator. This prevents restart loops where a persistent defect causes repeated failures.
RESTART_INTERVAL(600): The 3-attempt counter resets after 600 seconds. This allows for transient failures (which resolve after one restart) while preventing restart loops from persistent failures.
RESTART_TIMEOUT(120): If the restart doesn't complete within 120 seconds, ARM considers it failed and counts it against the attempt limit.
RESTART_METHOD(STC): Restart as a started task (the standard CICS startup method).

💡 INSIGHT — ARM is the first layer of automatic recovery, and it's the simplest. It answers the question "is the region running?" If not, start it. But ARM doesn't know why the region failed. If the failure was caused by a transient condition (a brief storage spike, a one-time data exception), ARM's restart will succeed and the region will resume normal operation. If the failure was caused by a persistent condition (a corrupt load module, a misconfigured SIT parameter), ARM will restart the region, the region will immediately fail again, and ARM will eventually stop trying after RESTART_ATTEMPTS. This is by design — ARM prevents restart loops, but it cannot fix the underlying problem.

CICSPlex SM and Region Health

CICSPlex SM adds a management layer above ARM. While ARM handles the basic restart, CPSM handles the topology-level response:

Detection. CPSM detects the region failure within seconds through its heartbeat monitoring.
Routing update. CPSM removes the failed region from the routing table. New transactions are not routed to a dead region.
Workload redistribution. CPSM adjusts routing weights for surviving regions to absorb the failed region's workload.
Alerting. CPSM sends WTO messages, SNMP traps, or email notifications to the operations team.
Health verification. After ARM restarts the region, CPSM verifies that the region is healthy (accepting transactions, meeting WLM goals) before adding it back to the routing table.

The CPSM health check is critical. ARM considers a restart "successful" when the CICS address space is running. But CICS can be running without being healthy — the DB2 connection might not re-establish, MQ might not reconnect, a critical file might not reopen. CPSM checks actual transaction processing capability before routing work to the region.

CICSPlex SM Health Check Definition:
  DEFINE WLMSPEC(CNBWL01)
    TRANGRP(CNBCORE)
    ALGORITHM(GOAL)
    HEALTHCHK(YES)
    HEALTHTRAN(HCHK)
    HEALTHINT(15)
    ABENDCRIT(3)
    ABENDWARN(1)
    ABENDTIME(60)

HEALTHCHK(YES) enables CPSM to run a health-check transaction (HCHK) against the region every HEALTHINT seconds. The health-check transaction is a lightweight program that verifies DB2 connectivity, MQ connectivity, and critical file availability. If the health check fails, CPSM removes the region from routing even though the region is technically running.

18.6 Designing for Automatic Recovery

Everything in sections 18.2–18.5 is about recovering the platform — getting CICS regions back online, resolving indoubt transactions, re-establishing resource manager connections. But platform recovery is only half the story. The other half is application-level recovery — ensuring that the business transactions themselves recover correctly.

Idempotent Transaction Design

An idempotent transaction is one that produces the same result whether executed once or multiple times. In the context of recovery, idempotency means that if a transaction is retried after a failure, it doesn't produce duplicate results.

Consider a simple funds transfer. If the customer submits a $500 transfer, CICS processes it, but the response is lost due to a communication failure, the customer might retry. If the transfer is not designed for idempotency, the customer gets debited $1,000.

Idempotent design requires a unique transaction identifier:

       WORKING-STORAGE SECTION.
       01  WS-TRANSFER-REQUEST.
           05  WS-REQUEST-ID        PIC X(32).
           05  WS-SOURCE-ACCT       PIC X(12).
           05  WS-DEST-ACCT         PIC X(12).
           05  WS-AMOUNT            PIC S9(13)V99 COMP-3.
           05  WS-REQUEST-TIMESTAMP PIC X(26).

       01  WS-DUPLICATE-FLAG        PIC X VALUE 'N'.

       PROCEDURE DIVISION.
       A000-MAIN-LOGIC.
      *    Check if this request has already been processed
           EXEC SQL
               SELECT 'Y' INTO :WS-DUPLICATE-FLAG
               FROM TRANSFER_AUDIT
               WHERE REQUEST_ID = :WS-REQUEST-ID
           END-EXEC

           EVALUATE SQLCODE
               WHEN 0
      *            Duplicate request — return previous result
                   PERFORM B000-RETURN-PREVIOUS-RESULT
               WHEN +100
      *            New request — process the transfer
                   PERFORM C000-PROCESS-TRANSFER
               WHEN OTHER
                   PERFORM Z000-SQL-ERROR
           END-EVALUATE
           STOP RUN.

       C000-PROCESS-TRANSFER.
      *    Debit source account
           EXEC SQL
               UPDATE ACCOUNTS
               SET BALANCE = BALANCE - :WS-AMOUNT
               WHERE ACCOUNT_NO = :WS-SOURCE-ACCT
                 AND BALANCE >= :WS-AMOUNT
           END-EXEC

           IF SQLCODE NOT = 0
               PERFORM Z100-INSUFFICIENT-FUNDS
           END-IF

      *    Credit destination account
           EXEC SQL
               UPDATE ACCOUNTS
               SET BALANCE = BALANCE + :WS-AMOUNT
               WHERE ACCOUNT_NO = :WS-DEST-ACCT
           END-EXEC

      *    Record in audit table (within same UOW)
           EXEC SQL
               INSERT INTO TRANSFER_AUDIT
               (REQUEST_ID, SOURCE_ACCT, DEST_ACCT, AMOUNT,
                PROCESS_TIMESTAMP, STATUS)
               VALUES
               (:WS-REQUEST-ID, :WS-SOURCE-ACCT, :WS-DEST-ACCT,
                :WS-AMOUNT, CURRENT TIMESTAMP, 'COMPLETED')
           END-EXEC

           EXEC CICS SYNCPOINT
                RESP(WS-RESP)
           END-EXEC
           .

The key: the duplicate check and the audit insert are in the same unit of work as the business logic. The TRANSFER_AUDIT insert commits atomically with the balance updates. If the transaction abends or the region fails, both the audit record and the balance updates are backed out together. No orphaned audit records. No unrecorded transfers.

💡 INSIGHT — Idempotency is not just about preventing duplicates. It's about making retry safe. If your transaction is idempotent, your recovery architecture can be simpler: retry failed transactions without worrying about side effects. If your transaction is NOT idempotent, your recovery architecture must include deduplication logic, compensating transactions, or manual reconciliation. Idempotency pays for itself in reduced recovery complexity.

Transaction Retry Logic

When a CICS transaction fails, the question is: should it be retried automatically?

Not all failures are retryable. An ASRA (data exception) caused by a programming error will fail every time you retry it. An AEY9 (program not found) won't magically find the program on the second attempt. But some failures are transient: a temporary DB2 lock timeout, a momentary MQ connection failure, a brief storage shortage. These failures may succeed on retry.

CICS provides automatic transaction restart through the RESTART attribute on the transaction definition:

DEFINE TRANSACTION(XFER)
  GROUP(CNBTXN)
  PROGRAM(CNBXFER)
  RESTART(YES)
  RESTARTCOUNT(2)
  DTIMOUT(30)
  TCLASS(TC01)
  TASKDATALOC(ANY)
  TASKDATAKEY(USER)
  PRIORITY(200)

RESTART(YES) tells CICS that if the transaction abends, CICS should automatically restart it. RESTARTCOUNT(2) limits the restarts to 2 — if the transaction fails 3 times, CICS gives up and reports the failure.

But RESTART has limitations:

It only works for terminal-initiated transactions (not START-ed tasks or DPL calls)
It restarts the entire transaction from the beginning (no checkpoint/resume)
It doesn't distinguish retryable from non-retryable abends

For production systems, application-level retry logic gives you more control:

       01  WS-RETRY-COUNT           PIC 9(2) VALUE 0.
       01  WS-MAX-RETRIES           PIC 9(2) VALUE 3.
       01  WS-RETRY-DELAY-MS        PIC 9(5) VALUE 500.
       01  WS-RETRYABLE-ABEND       PIC X VALUE 'N'.

       A000-MAIN-LOGIC.
           PERFORM C000-PROCESS-TRANSFER
           IF WS-RESP NOT = DFHRESP(NORMAL) AND
              WS-RETRYABLE-ABEND = 'Y' AND
              WS-RETRY-COUNT < WS-MAX-RETRIES
               ADD 1 TO WS-RETRY-COUNT
      *        Exponential backoff: 500ms, 1000ms, 2000ms
               COMPUTE WS-RETRY-DELAY-MS =
                   500 * (2 ** (WS-RETRY-COUNT - 1))
               EXEC CICS DELAY
                    MILLISECS(WS-RETRY-DELAY-MS)
               END-EXEC
               PERFORM A000-MAIN-LOGIC
           END-IF
           .

       D000-CLASSIFY-ERROR.
      *    Determine if the error is retryable
           EVALUATE TRUE
               WHEN WS-RESP = DFHRESP(LOCKED)
      *            DB2 lock timeout — transient, retryable
                   MOVE 'Y' TO WS-RETRYABLE-ABEND
               WHEN WS-RESP = DFHRESP(SYSIDERR)
      *            Remote system unavailable — might recover
                   MOVE 'Y' TO WS-RETRYABLE-ABEND
               WHEN WS-RESP = DFHRESP(NOTAUTH)
      *            Security failure — not retryable
                   MOVE 'N' TO WS-RETRYABLE-ABEND
               WHEN WS-RESP = DFHRESP(PGMIDERR)
      *            Program not found — not retryable
                   MOVE 'N' TO WS-RETRYABLE-ABEND
               WHEN OTHER
      *            Unknown — assume not retryable
                   MOVE 'N' TO WS-RETRYABLE-ABEND
           END-EVALUATE
           .

⚠️ PRODUCTION RULE — Never retry without exponential backoff. If a failure is caused by resource contention (DB2 lock timeouts, MQ queue full), immediate retry adds to the contention. Exponential backoff (500ms, 1s, 2s, 4s...) gives the system time to resolve the underlying contention. Also cap your retry count. An uncapped retry loop on a non-transient failure will consume CICS task slots and amplify the problem. CNB caps all retries at 3 attempts with exponential backoff starting at 500ms.

Compensating Transactions

Some failures cannot be simply retried. When a multi-step business process fails partway through, and the completed steps cannot be rolled back through the normal 2PC mechanism (because they've already committed), you need a compensating transaction.

Example: A CNB customer initiates an international wire transfer. The process involves:

Debit the customer's account (CICS/DB2 — committed)
Send SWIFT message to correspondent bank (external system — sent)
Record the transfer in the regulatory reporting system (CICS/DB2 — fails)

Step 3 fails, but steps 1 and 2 have already committed (they were in separate units of work, as required by the external SWIFT interface). You can't uncommit the SWIFT message. You need a compensating transaction:

       E000-COMPENSATE-FAILED-TRANSFER.
      *    Reverse the debit from step 1
           EXEC SQL
               UPDATE ACCOUNTS
               SET BALANCE = BALANCE + :WS-AMOUNT
               WHERE ACCOUNT_NO = :WS-SOURCE-ACCT
           END-EXEC

      *    Send SWIFT cancellation message for step 2
           PERFORM F000-SEND-SWIFT-CANCEL

      *    Record the compensation in audit trail
           EXEC SQL
               INSERT INTO TRANSFER_AUDIT
               (REQUEST_ID, SOURCE_ACCT, DEST_ACCT, AMOUNT,
                PROCESS_TIMESTAMP, STATUS, COMP_REASON)
               VALUES
               (:WS-REQUEST-ID, :WS-SOURCE-ACCT, :WS-DEST-ACCT,
                :WS-AMOUNT, CURRENT TIMESTAMP, 'COMPENSATED',
                'REG_REPORT_FAILURE')
           END-EXEC

           EXEC CICS SYNCPOINT
                RESP(WS-RESP)
           END-EXEC
           .

Compensating transactions are inherently more complex than simple retries. They must:

Be idempotent — if the compensation itself fails and is retried, it must not double-compensate
Be auditable — every compensation must be logged with the reason, the original transaction, and the compensation details
Handle the compensation failure — what happens if the SWIFT cancellation fails? You need a secondary compensation mechanism (manual intervention, escalation queue)

🔗 SPACED REVIEW — Chapter 13 — Recall from Chapter 13 that MRO connections create cross-region dependencies. In a multi-region topology, a compensating transaction might need to execute across multiple regions — compensating a DB2 update in one AOR, a VSAM update in a FOR, and an MQ message in a third. The compensating transaction itself is a distributed transaction that requires 2PC coordination. Design your compensation logic to be as simple as possible — ideally a single-region, single-resource-manager operation.

Recovery Queues and Dead-Letter Handling

For asynchronous processing (MQ-triggered transactions, START-ed tasks), recovery requires a queue-based approach. When an asynchronous transaction fails, the message that triggered it must not be lost. The standard pattern is:

Retry queue. Failed messages are moved to a retry queue with a retry count header. A triggered program reads the retry queue after a delay and resubmits the message.
Dead-letter queue. Messages that exceed the retry limit are moved to a dead-letter queue for manual investigation. The dead-letter queue is monitored 24/7.
Poison message handling. Messages that cause transaction abends (malformed data, invalid business rules) are detected and removed from the processing queue immediately — they'll never succeed regardless of retries.

       G000-HANDLE-MQ-FAILURE.
           ADD 1 TO WS-RETRY-COUNT
           IF WS-RETRY-COUNT > WS-MAX-RETRIES
      *        Exceeded retry limit — send to dead-letter queue
               MOVE 'CNBQ.XFER.DEADLETTER' TO WS-DLQ-NAME
               PERFORM H000-PUT-TO-QUEUE
               PERFORM H100-ALERT-OPERATIONS
           ELSE
      *        Put back on retry queue with incremented count
               MOVE WS-RETRY-COUNT TO WS-MSG-RETRY-HDR
               MOVE 'CNBQ.XFER.RETRY' TO WS-RETRY-QUEUE-NAME
               PERFORM H000-PUT-TO-QUEUE
           END-IF
           .

18.7 Testing Recovery

Here's an uncomfortable truth: most shops never test their recovery procedures. They design them, document them, and pray they never need them. Then a failure occurs, and the procedure doesn't work because the log stream configuration has drifted, or the ARM policy hasn't been updated since the last CICS upgrade, or the DFHRMUTL JCL references a dataset that was renamed six months ago.

Recovery testing is not optional. It's as critical as functional testing.

Controlled Failure Injection

Testing recovery means deliberately causing failures and verifying that recovery works correctly. This must be done in a test environment that mirrors production configuration — same SIT parameters, same ARM policies, same DB2CONN definitions, same log stream configurations.

Test 1: Transaction Abend Recovery

Inject a transaction abend by coding a deliberate data exception or by using CEDF to force an abend.

Verification: - The transaction is backed out - DB2 changes from the failed transaction are rolled back - VSAM changes from the failed transaction are restored - The abend is logged to the CICS dump dataset - Operations monitoring detects the abend

Test 2: Region Failure Recovery

Kill the CICS region with CANCEL or by deliberately corrupting a kernel structure (in test only).

Verification: - ARM detects the failure and initiates restart - Emergency restart completes within the expected timeframe - In-flight transactions are backed out - DB2 locks from in-flight transactions are released - MRO connections re-establish - CICSPlex SM removes and then re-adds the region to routing - New transactions are processed successfully after recovery

Test 3: Indoubt Transaction Resolution

This is the hardest test to construct because you need to create a failure during the indoubt window — between phase 1 PREPARE completion and phase 2 COMMIT.

One approach: inject a delay in the commit path using a CICS exit (XRMIIN — recovery manager indoubt exit). During the delay, kill the region. The transaction will be in the prepared-but-not-committed state.

Verification: - Emergency restart detects the indoubt UOW - If the participant resource manager is available, resolution is automatic - If the participant resource manager is unavailable, the UOW is shunted - Shunted UOW retry resolves when the participant becomes available - Locked resources are released after resolution - Business data is consistent (no partial updates)

Test 4: Auto-Restart Loop Detection

Kill the CICS region immediately after restart (simulating a persistent failure condition).

Verification: - ARM counts restart attempts correctly - After RESTART_ATTEMPTS failures, ARM stops trying and alerts operations - Operations receives the alert with sufficient information to diagnose the problem

Test 5: Multi-Region Recovery Cascade

In a multi-region topology, kill the coordinating AOR while it has active MRO connections to a FOR and distributed transactions to DB2.

Verification: - FOR handles the MRO connection failure gracefully (no FOR failure cascade) - DB2 handles the thread failure (locks held, waiting for CICS resolution) - When the AOR restarts, it resolves distributed UOWs across MRO and DB2 - The FOR's VSAM resources are consistent after resolution - DB2's data is consistent after resolution

Recovery Validation Checklist

After every recovery test, validate with this checklist:

RECOVERY VALIDATION CHECKLIST
================================
Region: _____________  Date: _____________  Tester: _____________

PLATFORM RECOVERY
[ ] ARM detected failure within _____ seconds (target: <10s)
[ ] Emergency restart completed in _____ seconds (target: <120s)
[ ] DB2 reconnection successful
[ ] MQ reconnection successful
[ ] MRO connections re-established
[ ] CICSPlex SM updated routing table
[ ] Health check transaction passed
[ ] Region accepting new transactions

TRANSACTION RECOVERY
[ ] In-flight transactions backed out: _____
[ ] Indoubt transactions resolved: _____ (commit: _____, backout: _____)
[ ] Shunted UOWs: _____ (resolved: _____, pending: _____)
[ ] DB2 locks released after resolution
[ ] MQ messages consistent (no duplicates, no lost messages)
[ ] VSAM files consistent (VERIFY run clean)

DATA INTEGRITY
[ ] Account balances reconcile (sum of credits = sum of debits)
[ ] Audit trail complete (no gaps in sequence numbers)
[ ] Journal records recoverable (test forward recovery from backup)
[ ] No orphaned lock entries in DB2 DISPLAY THREAD output

OPERATIONAL PROCEDURES
[ ] Alert sent to operations team within _____ seconds
[ ] Runbook steps match actual recovery sequence
[ ] Manual resolution procedure tested (DFHRMUTL)
[ ] Two-person authorization for manual indoubt resolution verified
[ ] Post-recovery data reconciliation procedure executed

Recovery Testing Schedule

At CNB, recovery testing follows this schedule:

Test	Frequency	Environment	Duration
Transaction abend	Weekly (automated)	Test	15 minutes
Region failure	Monthly	QA	2 hours
Indoubt resolution	Quarterly	Pre-production	4 hours
Multi-region cascade	Semi-annually	Pre-production	Full day
LPAR failure	Annually	DR test	2 days
Sysplex failure	Annually	DR test	3 days

The weekly transaction abend test is automated — a test harness submits transactions designed to abend, and a verification program checks that recovery occurred correctly. The monthly region failure test is semi-automated — an operator kills the region, and a verification script checks the recovery.

💡 INSIGHT — The most valuable outcome of recovery testing is not verifying that recovery works. It's discovering that your procedures are wrong. Every recovery test CNB runs uncovers at least one documentation error, one parameter drift, one assumption that is no longer valid. The test itself is the product, not just the result. This is why Kwame insists on fresh test plans for every recovery test — forcing the team to re-examine the procedures, not just re-execute them.

The Sandra Chen Perspective

At Federal Benefits Administration, Sandra Chen faces a unique challenge with recovery testing. FBA's CICS environment includes IMS-dependent transactions — programs that access IMS databases through the CICS-IMS Database Control interface. IMS adds another resource manager to the 2PC scope, with its own recovery log, its own indoubt resolution protocol, and its own restart procedure.

Sandra's recovery testing must validate that CICS emergency restart coordinates with IMS database recovery. When a CICS region fails with active IMS database transactions, IMS's Program Isolation lock manager holds locks on database segments until CICS resolves the UOW. The indoubt resolution between CICS and IMS is more complex than CICS-DB2 resolution because IMS uses a different log format and a different resynchronization protocol.

Marcus Whitfield's observation: "I've been doing IMS recovery for 30 years. The CICS-IMS recovery path has seven distinct failure points. I've seen six of them in production. The seventh one keeps me up at night because we've never tested it — the scenario where CICS and IMS both fail on different LPARs at the same time and the coupling facility has a transient error during the recovery sequence."

That seventh scenario is on Sandra's test plan for this year's DR exercise. Marcus will be there to guide the team through it. After his retirement, the procedure will exist only in the runbook he's writing — unless someone on Sandra's team internalizes it deeply enough to handle the unexpected variations that runbooks can't anticipate.

🔗 THEME: Knowledge is retiring — Marcus Whitfield has 30 years of IMS recovery experience that exists nowhere in documentation. The procedures he follows are a combination of documented steps and undocumented judgment calls accumulated over decades of production incidents. Sandra's modernization effort includes not just technology modernization but knowledge preservation — capturing Marcus's judgment in test cases, runbooks, and training materials before he retires.

18.8 Putting It All Together: Recovery Architecture Patterns

Pattern 1: The Self-Healing Region

The simplest recovery architecture. A single CICS region with ARM auto-restart, CICSPlex SM health monitoring, and application-level retry logic.

                    ┌─────────────────────────┐
                    │      ARM Policy          │
                    │  RESTART_ATTEMPTS(3)     │
                    │  RESTART_INTERVAL(600)   │
                    └─────────┬───────────────┘
                              │ monitors
                              ▼
                    ┌─────────────────────────┐
                    │    CICS AOR (CNBAORA1)   │
                    │                          │
                    │  Emergency Restart ──┐   │
                    │  System Log (CF) ────┤   │
                    │  Activity Keypoints ─┤   │
                    │  DB2CONN(GROUPRESYNC)─┤   │
                    │  RMRETRY(30) ────────┘   │
                    └─────────────────────────┘
                              │
                    ┌─────────▼───────────────┐
                    │   CICSPlex SM (CMAS)     │
                    │  Health Check: 15s       │
                    │  Remove/re-add routing   │
                    │  Alert operations        │
                    └─────────────────────────┘

This pattern handles Category 1–3 failures automatically. No operator intervention. Typical recovery time: 30–90 seconds for a region failure.

Pattern 2: Active-Active Pair with Failover

Two CICS AORs processing the same workload. CICSPlex SM distributes transactions across both. If one fails, the other absorbs the full workload while the failed region recovers.

     CICSPlex SM
     ┌──────────────────────────────────────┐
     │  Workload: CNBCORE                    │
     │  Algorithm: GOAL                      │
     │  Targets: CNBAORA1, CNBAORA2         │
     └────────┬────────────────┬────────────┘
              │ 50% traffic    │ 50% traffic
              ▼                ▼
     ┌──────────────┐  ┌──────────────┐
     │  CNBAORA1    │  │  CNBAORA2    │
     │  MAXTASK=250 │  │  MAXTASK=250 │
     │  ARM: YES    │  │  ARM: YES    │
     └──────┬───────┘  └──────┬───────┘
            │                  │
            └───────┬──────────┘
                    ▼
            ┌───────────────┐
            │  DB2 (shared) │
            │  MQ (shared)  │
            └───────────────┘

When CNBAORA1 fails: 1. CPSM detects failure, removes CNBAORA1 from routing (0% / 100%) 2. CNBAORA2 absorbs 100% of workload (must be sized for this — MAXTASK=250 provides headroom) 3. ARM restarts CNBAORA1 4. After health check passes, CPSM re-adds CNBAORA1 (50% / 50%)

Recovery time: zero for new transactions (instant failover). 30–90 seconds for the failed region itself.

Pattern 3: Cross-LPAR Active-Active with Full Isolation

The production pattern at CNB. AOR pairs span LPARs. Each LPAR is a separate failure domain. DB2 data sharing and CF-based shared TS provide data access from any LPAR.

     SYSA                              SYSB
     ┌──────────────────┐             ┌──────────────────┐
     │  CNBTORA1 (TOR)  │◄───IPIC───►│  CNBTORB1 (TOR)  │
     │  CNBAORA1 (AOR)  │◄───IPIC───►│  CNBAORB1 (AOR)  │
     │  CNBAORA2 (AOR)  │◄───IPIC───►│  CNBAORB2 (AOR)  │
     │  DB2 Member A    │             │  DB2 Member B    │
     │  MQ QMgr A       │             │  MQ QMgr B       │
     └────────┬─────────┘             └────────┬─────────┘
              │                                 │
              └─────────┬───────────────────────┘
                        ▼
              ┌──────────────────┐
              │ Coupling Facility │
              │  DB2 lock struct  │
              │  CICS shared TS   │
              │  System logs      │
              │  Named counters   │
              └──────────────────┘

When SYSA fails entirely (LPAR failure): 1. CPSM on SYSB detects failure of all SYSA regions 2. All routing shifts to SYSB AORs (CNBAORB1, CNBAORB2) 3. DB2 Member B serves all data (data sharing provides access) 4. MQ QMgr B serves all queues (shared queue pattern) 5. SYSA LPARs restart (IPL), CICS regions restart (ARM), and are re-added to routing

Recovery time for new transactions: seconds (CPSM routing update). Recovery time for SYSA: minutes (LPAR IPL + CICS restart).

⚠️ PRODUCTION RULE — In Pattern 3, each LPAR must be sized to handle 100% of the workload alone. If your normal distribution is 50/50, each LPAR must have enough CPU, memory, and I/O capacity for 100%. This is the cost of true HA — you're paying for double the capacity to ensure that a single LPAR failure doesn't degrade service. At CNB, each LPAR runs at approximately 40% capacity under normal conditions, leaving 60% headroom for failover. This is not waste — it's insurance.

The SecureFirst Perspective

Yuki Nakamura at SecureFirst Retail Bank faces a different recovery challenge. SecureFirst's CICS environment fronts a mobile banking API, and the mobile app has its own retry logic. When the CICS backend fails, the mobile app retries the request. If the original request was partially processed (debit committed, credit not yet committed), the retry creates a duplicate debit.

Carlos Vega, who designed the mobile API, initially assumed CICS handled all recovery: "I thought if the backend failed, everything was rolled back. I didn't realize there were cases where the first request committed and the response was lost."

Yuki and Carlos redesigned the API with an idempotency key:

{
  "request_id": "550e8400-e29b-41d4-a716-446655440000",
  "source_account": "001-12345678",
  "dest_account": "001-87654321",
  "amount": 500.00,
  "currency": "USD"
}

The CICS program checks the request_id against the TRANSFER_AUDIT table before processing. If the ID already exists, it returns the previous result. If not, it processes the transfer and records the audit entry atomically.

This is the bridge between mainframe recovery and distributed system recovery. The CICS side handles transactional integrity (2PC, backout, emergency restart). The API side handles communication integrity (idempotency keys, retry with backoff). Together, they provide end-to-end recovery.

🔗 THEME: The best architects understand both worlds — Carlos's initial assumption ("the backend handles recovery") is common among distributed-systems architects encountering mainframe transaction processing for the first time. The mainframe does handle recovery — but only within its transaction boundaries. When the transaction boundary ends at an API response, the last mile of recovery (ensuring the response reaches the client) requires distributed-systems patterns. The architect who understands both worlds designs end-to-end recovery. The architect who understands only one world leaves gaps.

Spaced Review Integration

Review: Chapter 8 — DB2 Locking and Recovery

In Chapter 8, you learned that DB2 locks are held for the duration of the unit of work. In the context of CICS recovery, this creates a direct relationship between CICS recovery time and DB2 lock duration:

A CICS region failure leaves DB2 locks held by the failed region's threads
Those locks persist until CICS restarts and resolves each UOW (commit or backout)
During the recovery window, other transactions accessing the locked rows will wait or timeout
Fast CICS recovery directly reduces the DB2 lock contention window

The architecture decision you made in Chapter 8 — row-level locking vs. page-level locking — amplifies here. Page-level locking means a single in-flight transaction locks an entire page of data (typically 40–100 rows). During a CICS failure, all those rows are unavailable. Row-level locking limits the impact to the specific rows modified by in-flight transactions. This is another argument for row-level locking in high-availability environments.

Review: Chapter 13 — CICS Region Topology and MRO

In Chapter 13, you designed a multi-region topology with TOR/AOR/FOR separation and MRO connections. The recovery implications of that topology include:

AOR failure is isolated to one application group. TORs reroute to surviving AORs. FORs continue serving other AORs. This is the fundamental benefit of topology separation for recovery.
FOR failure affects all AORs that depend on it for VSAM access. This is why CNB migrated high-volume data from VSAM to DB2 with data sharing — eliminating the FOR as a single point of failure.
MRO connection failure between regions triggers session re-establishment. The AUTCONN=YES SIT parameter automates this. Without AUTCONN, an operator must manually reconnect MRO sessions after a region restart.

Chapter Summary

CICS failure and recovery is not a single mechanism — it's an architecture. The system log provides the recovery data. The recovery manager coordinates the recovery process. Two-phase commit ensures atomicity across resource managers. Indoubt resolution handles the edge case where the coordinator fails. Emergency restart brings the region back. ARM automates the restart. CICSPlex SM manages the topology-level response. Idempotent transaction design ensures application-level consistency. And recovery testing validates that all of these mechanisms work together.

The common thread: every recovery behavior is a design decision. The system log location (DASD vs. coupling facility), the keypoint interval, the ARM restart policy, the RESYNCMEMBER setting, the RMRETRY interval, the retry logic in your application code — each is a choice that determines how your system behaves when something fails. Make these choices deliberately, document them explicitly, and test them regularly.

Kwame's closing observation: "I've never seen a CICS production failure that surprised me technically. Every failure mode is documented. Every recovery mechanism is proven. What surprises me is how often shops haven't configured them, haven't tested them, or haven't trained their people to use them. The technology is rock-solid. The failures are always human."

What's Next

Chapter 19 shifts to messaging and integration with IBM MQ. Where this chapter focused on recovering from failures within the CICS transaction boundary, Chapter 19 introduces a fundamentally different model: asynchronous message processing where the sender and receiver are temporally decoupled. The recovery patterns change — instead of two-phase commit across synchronous participants, you'll design for guaranteed message delivery, dead-letter handling, and exactly-once processing semantics. The XA and indoubt resolution concepts from this chapter carry directly into MQ's transactional model.

But first: the project checkpoint. You're about to design the failure and recovery architecture for the HA banking system — the most critical component of the entire system design. A banking system that can't recover from failures is not a banking system. It's a liability.

Learning Objectives

In This Chapter

Chapter 18: CICS Failure and Recovery

XA Transactions, Indoubt Resolution, and Designing for Automatic Recovery

18.1 Failure Is Not If But When

The Taxonomy of CICS Failures

Recovery Is Architecture, Not Accident

18.2 CICS Recovery Architecture

The System Log

Activity Keypoints

The Recovery Manager

Journals

Resource Recovery Table (RRT)

Recovery Architecture in a Multi-Region Topology

18.3 XA Transactions and Two-Phase Commit

What XA Actually Is

CICS as the XA Coordinator

Two-Phase Commit Across CICS Regions (MRO)

Two-Phase Commit with MQ

Performance Implications of Two-Phase Commit

18.4 Indoubt Resolution

What "Indoubt" Means

The Indoubt Window

Indoubt Resolution Mechanisms

Shunted Units of Work

The Pinnacle Health Incident

18.5 Region Recovery: Cold, Warm, and Emergency Start

Cold Start

Warm Start

Emergency Restart

Auto-Restart

CICSPlex SM and Region Health

18.6 Designing for Automatic Recovery

Idempotent Transaction Design

Transaction Retry Logic

Compensating Transactions

Recovery Queues and Dead-Letter Handling

18.7 Testing Recovery

Controlled Failure Injection

Recovery Validation Checklist

Recovery Testing Schedule

The Sandra Chen Perspective

18.8 Putting It All Together: Recovery Architecture Patterns

Pattern 1: The Self-Healing Region

Pattern 2: Active-Active Pair with Failover

Pattern 3: Cross-LPAR Active-Active with Full Isolation

The SecureFirst Perspective

Spaced Review Integration

Review: Chapter 8 — DB2 Locking and Recovery

Review: Chapter 13 — CICS Region Topology and MRO

Chapter Summary

What's Next

Related Reading