Case Study 28.2: Member Failure Recovery — A Saturday Night Story

DataField.Dev

Case Study 28.2: Member Failure Recovery — A Saturday Night Story

Background

Pacific Trust Bank operates a 3-member DB2 data sharing group (PTB1, PTB2, PTB3) on two z15 CPCs. The group processes online banking, ATM transactions, and wire transfers 24/7. On a typical Saturday evening:

PTB1 handles 4,000 TPS from CICS online banking
PTB2 handles 2,500 TPS from ATM and mobile app transactions
PTB3 is running the weekly settlement batch, updating 8 million loan accounts

The on-call DBA is Sarah Chen. Her pager goes off at 9:47 PM.

The Incident

9:47 PM — Alert Fires

Sarah's monitoring system generates a critical alert:

CRITICAL: DB2 member PTB2 - XCF group membership lost
CRITICAL: Retained locks detected in PTBGRP_LOCK1
WARNING:  GBP0 castout owner change for 12 pagesets

Sarah logs into the operations console from home and begins assessment.

9:48 PM — Initial Assessment

Sarah runs the following commands from PTB1:

-DB1A DISPLAY GROUP

Output:

DSNL100I -PTB1 DSNLTRMG - CURRENT GROUP LEVEL IS V13
MEMBER  SUBSYS  STATUS    CMDPREF
------  ------  --------  -------
PTB1    PTB1    ACTIVE    -PTB1
PTB2    PTB2    FAILED    -PTB2
PTB3    PTB3    ACTIVE    -PTB3

PTB2 has failed. The z/OS operations team confirms that the LPAR running PTB2 experienced a hardware fault — a memory board failure caused a machine check, and the LPAR terminated.

9:49 PM — Retained Lock Assessment

Sarah checks for retained locks:

-PTB1 DISPLAY DATABASE(*) SPACENAM(*) RESTRICT

The output shows 47 tablespaces with retained locks from PTB2. The most critical:

TABLESPACE  DBNAME      STATUS
----------  ----------  ------
ACCOUNTS    PTBBANK     LPL,GRECP
TRNHIST     PTBBANK     LPL,GRECP
CUSTDATA    PTBBANK     GRECP
WIRETRAN    PTBBANK     GRECP

LPL = Logical Page List (pages needing recovery). GRECP = Group Restart/Recovery Pending.

9:50 PM — Peer Recovery Begins

PTB1 has already initiated automatic peer recovery for PTB2. Sarah monitors progress:

-PTB1 DISPLAY LOG

DSNJ100I -PTB1 PEER RECOVERY IN PROGRESS FOR MEMBER PTB2
         LOG RANGE: RBA 0000A21F34000000 TO 0000A21F89FFFFFF
         UNDO PHASE: PROCESSING 1,847 UNDO RECORDS

The peer recovery process: 1. PTB1 reads PTB2's active log datasets (on shared DASD) 2. It identifies all in-flight transactions at the time of failure: 1,847 uncommitted units of work 3. It begins the UNDO phase — rolling back each uncommitted transaction

9:52 PM — Impact on PTB1 and PTB3

During peer recovery, PTB1 continues processing its online banking workload, but with increased response times:

Before failure: Average CICS transaction response time on PTB1 = 12 ms
During recovery: Average response time = 45 ms

The slowdown is caused by: 1. PTB1 is consuming CPU for peer recovery processing 2. Some of PTB1's transactions are blocked by retained locks held by PTB2 3. GBP castout processing has increased as PTB1 takes over castout ownership for PTB2's pagesets

PTB3's batch job pauses temporarily when it encounters a retained lock on the ACCOUNTS tablespace. The batch controller enters a wait state.

9:55 PM — Peer Recovery Completes

DSNJ200I -PTB1 PEER RECOVERY COMPLETE FOR MEMBER PTB2
         TRANSACTIONS ROLLED BACK: 1,847
         RETAINED LOCKS RELEASED: 23,481
         ELAPSED TIME: 4 MINUTES 52 SECONDS

All retained locks are released. PTB1's response times return to normal. PTB3's batch job resumes.

9:56 PM — Customer Impact Assessment

Sarah reviews the impact:

PTB1 (online banking): Continued operating throughout. Response time degraded for ~5 minutes during peer recovery. No transactions lost.
PTB2 (ATM/mobile): All 2,500 TPS halted at 9:47 PM. ATM transactions in flight (approximately 450) were rolled back. ATM machines displayed "Transaction could not be completed, please try again." Mobile app users received timeout errors.
PTB3 (batch): Paused for 3 minutes while retained locks blocked access to ACCOUNTS. Resumed automatically after peer recovery. Will complete on schedule.

10:15 PM — PTB2 Recovery

The hardware team replaces the faulty memory board and IPLs (boots) the LPAR. Sarah restarts DB2 on PTB2:

START DB2 PTB2

PTB2 performs its local restart, reconciles with the group, and rejoins:

DSNL200I -PTB2 MEMBER PTB2 HAS JOINED GROUP PTBGRP

Sarah reconfigures the sysplex distributor to resume routing ATM and mobile traffic to PTB2.

10:32 PM — Service Fully Restored

All three members are active. Sarah confirms:

-PTB1 DISPLAY GROUP DETAIL

MEMBER  STATUS   CONNECTIONS  TPS     CF_SVC_TIME
------  -------  -----------  ------  -----------
PTB1    ACTIVE   2,450        4,100   11 us
PTB2    ACTIVE   1,800        2,200   12 us
PTB3    ACTIVE   25           850     14 us

PTB2 is ramping back up as connections are re-established.

Timeline Summary

Time	Event	Duration
9:47 PM	PTB2 LPAR crashes	—
9:47 PM	XCF detects failure, alerts fire	<10 seconds
9:47 PM	Peer recovery initiated by PTB1	Automatic
9:47-9:52 PM	PTB1 response time degraded	5 minutes
9:52-9:55 PM	PTB3 batch paused on retained locks	3 minutes
9:55 PM	Peer recovery complete, retained locks released	4 min 52 sec total
10:15 PM	LPAR restarted after hardware repair	—
10:20 PM	PTB2 DB2 restarted and rejoins group	5 minutes
10:32 PM	Full service restored to pre-failure levels	—

Total customer-facing outage for PTB2's workload: 45 minutes (hardware repair time) Total customer-facing outage for PTB1's workload: 0 minutes (degraded performance for 5 minutes but no outage) Data loss: Zero

Post-Incident Analysis

What Went Right

Automatic peer recovery worked flawlessly. No DBA intervention was needed for the recovery itself.
PTB1 continued serving customers throughout the incident, absorbing some of PTB2's workload.
Data integrity was preserved. All 1,847 uncommitted transactions were cleanly rolled back.
The batch job self-recovered. After retained locks were released, PTB3's batch resumed without manual intervention.

What Could Be Improved

ATM failover was slow. ATM machines should have been configured to retry transactions on PTB1 via the sysplex distributor, but some ATM controllers had hardcoded connections to PTB2's specific IP address.
Mobile app error handling was poor. The mobile app showed generic "Service Unavailable" errors instead of a user-friendly "Please try again in a moment" message with automatic retry.
Hardware repair took 28 minutes. If the spare LPAR on CPC-2 had been pre-configured with a DB2 image, PTB2 could have been restarted on the spare LPAR within minutes, without waiting for hardware repair.
Response time degradation on PTB1 was not anticipated. The capacity planning model did not account for the CPU overhead of peer recovery. Additional LPAR CPU capacity should be provisioned for this scenario.

Recommendations

Configure all ATM controllers to use the DVIPA (group address) rather than member-specific IP addresses.
Add retry logic to the mobile application with exponential backoff and user-friendly error messages.
Pre-configure a warm spare LPAR on CPC-2 with DB2 installed, ready to take over PTB2's identity within minutes.
Increase PTB1's LPAR CPU allocation by 15% to absorb peer recovery overhead without impacting response times.
Conduct quarterly failure simulation drills to ensure operational readiness.

Discussion Questions

If PTB2 had been running long-running utility jobs (REORG, COPY) at the time of failure, how would that have affected peer recovery?
What would have happened if PTB1 had also failed during the peer recovery process (i.e., only PTB3 remained)?
If the coupling facility had failed instead of the LPAR, how would the scenario differ? What if both CFs failed?
Calculate the approximate financial impact of the 45-minute outage for PTB2's workload, assuming $0.15 revenue per ATM transaction and 2,500 TPS.