Case Study 28.2: Member Failure Recovery — A Saturday Night Story
Background
Pacific Trust Bank operates a 3-member DB2 data sharing group (PTB1, PTB2, PTB3) on two z15 CPCs. The group processes online banking, ATM transactions, and wire transfers 24/7. On a typical Saturday evening:
- PTB1 handles 4,000 TPS from CICS online banking
- PTB2 handles 2,500 TPS from ATM and mobile app transactions
- PTB3 is running the weekly settlement batch, updating 8 million loan accounts
The on-call DBA is Sarah Chen. Her pager goes off at 9:47 PM.
The Incident
9:47 PM — Alert Fires
Sarah's monitoring system generates a critical alert:
CRITICAL: DB2 member PTB2 - XCF group membership lost
CRITICAL: Retained locks detected in PTBGRP_LOCK1
WARNING: GBP0 castout owner change for 12 pagesets
Sarah logs into the operations console from home and begins assessment.
9:48 PM — Initial Assessment
Sarah runs the following commands from PTB1:
-DB1A DISPLAY GROUP
Output:
DSNL100I -PTB1 DSNLTRMG - CURRENT GROUP LEVEL IS V13
MEMBER SUBSYS STATUS CMDPREF
------ ------ -------- -------
PTB1 PTB1 ACTIVE -PTB1
PTB2 PTB2 FAILED -PTB2
PTB3 PTB3 ACTIVE -PTB3
PTB2 has failed. The z/OS operations team confirms that the LPAR running PTB2 experienced a hardware fault — a memory board failure caused a machine check, and the LPAR terminated.
9:49 PM — Retained Lock Assessment
Sarah checks for retained locks:
-PTB1 DISPLAY DATABASE(*) SPACENAM(*) RESTRICT
The output shows 47 tablespaces with retained locks from PTB2. The most critical:
TABLESPACE DBNAME STATUS
---------- ---------- ------
ACCOUNTS PTBBANK LPL,GRECP
TRNHIST PTBBANK LPL,GRECP
CUSTDATA PTBBANK GRECP
WIRETRAN PTBBANK GRECP
LPL = Logical Page List (pages needing recovery). GRECP = Group Restart/Recovery Pending.
9:50 PM — Peer Recovery Begins
PTB1 has already initiated automatic peer recovery for PTB2. Sarah monitors progress:
-PTB1 DISPLAY LOG
DSNJ100I -PTB1 PEER RECOVERY IN PROGRESS FOR MEMBER PTB2
LOG RANGE: RBA 0000A21F34000000 TO 0000A21F89FFFFFF
UNDO PHASE: PROCESSING 1,847 UNDO RECORDS
The peer recovery process: 1. PTB1 reads PTB2's active log datasets (on shared DASD) 2. It identifies all in-flight transactions at the time of failure: 1,847 uncommitted units of work 3. It begins the UNDO phase — rolling back each uncommitted transaction
9:52 PM — Impact on PTB1 and PTB3
During peer recovery, PTB1 continues processing its online banking workload, but with increased response times:
- Before failure: Average CICS transaction response time on PTB1 = 12 ms
- During recovery: Average response time = 45 ms
The slowdown is caused by: 1. PTB1 is consuming CPU for peer recovery processing 2. Some of PTB1's transactions are blocked by retained locks held by PTB2 3. GBP castout processing has increased as PTB1 takes over castout ownership for PTB2's pagesets
PTB3's batch job pauses temporarily when it encounters a retained lock on the ACCOUNTS tablespace. The batch controller enters a wait state.
9:55 PM — Peer Recovery Completes
DSNJ200I -PTB1 PEER RECOVERY COMPLETE FOR MEMBER PTB2
TRANSACTIONS ROLLED BACK: 1,847
RETAINED LOCKS RELEASED: 23,481
ELAPSED TIME: 4 MINUTES 52 SECONDS
All retained locks are released. PTB1's response times return to normal. PTB3's batch job resumes.
9:56 PM — Customer Impact Assessment
Sarah reviews the impact:
- PTB1 (online banking): Continued operating throughout. Response time degraded for ~5 minutes during peer recovery. No transactions lost.
- PTB2 (ATM/mobile): All 2,500 TPS halted at 9:47 PM. ATM transactions in flight (approximately 450) were rolled back. ATM machines displayed "Transaction could not be completed, please try again." Mobile app users received timeout errors.
- PTB3 (batch): Paused for 3 minutes while retained locks blocked access to ACCOUNTS. Resumed automatically after peer recovery. Will complete on schedule.
10:15 PM — PTB2 Recovery
The hardware team replaces the faulty memory board and IPLs (boots) the LPAR. Sarah restarts DB2 on PTB2:
START DB2 PTB2
PTB2 performs its local restart, reconciles with the group, and rejoins:
DSNL200I -PTB2 MEMBER PTB2 HAS JOINED GROUP PTBGRP
Sarah reconfigures the sysplex distributor to resume routing ATM and mobile traffic to PTB2.
10:32 PM — Service Fully Restored
All three members are active. Sarah confirms:
-PTB1 DISPLAY GROUP DETAIL
MEMBER STATUS CONNECTIONS TPS CF_SVC_TIME
------ ------- ----------- ------ -----------
PTB1 ACTIVE 2,450 4,100 11 us
PTB2 ACTIVE 1,800 2,200 12 us
PTB3 ACTIVE 25 850 14 us
PTB2 is ramping back up as connections are re-established.
Timeline Summary
| Time | Event | Duration |
|---|---|---|
| 9:47 PM | PTB2 LPAR crashes | — |
| 9:47 PM | XCF detects failure, alerts fire | <10 seconds |
| 9:47 PM | Peer recovery initiated by PTB1 | Automatic |
| 9:47-9:52 PM | PTB1 response time degraded | 5 minutes |
| 9:52-9:55 PM | PTB3 batch paused on retained locks | 3 minutes |
| 9:55 PM | Peer recovery complete, retained locks released | 4 min 52 sec total |
| 10:15 PM | LPAR restarted after hardware repair | — |
| 10:20 PM | PTB2 DB2 restarted and rejoins group | 5 minutes |
| 10:32 PM | Full service restored to pre-failure levels | — |
Total customer-facing outage for PTB2's workload: 45 minutes (hardware repair time) Total customer-facing outage for PTB1's workload: 0 minutes (degraded performance for 5 minutes but no outage) Data loss: Zero
Post-Incident Analysis
What Went Right
- Automatic peer recovery worked flawlessly. No DBA intervention was needed for the recovery itself.
- PTB1 continued serving customers throughout the incident, absorbing some of PTB2's workload.
- Data integrity was preserved. All 1,847 uncommitted transactions were cleanly rolled back.
- The batch job self-recovered. After retained locks were released, PTB3's batch resumed without manual intervention.
What Could Be Improved
-
ATM failover was slow. ATM machines should have been configured to retry transactions on PTB1 via the sysplex distributor, but some ATM controllers had hardcoded connections to PTB2's specific IP address.
-
Mobile app error handling was poor. The mobile app showed generic "Service Unavailable" errors instead of a user-friendly "Please try again in a moment" message with automatic retry.
-
Hardware repair took 28 minutes. If the spare LPAR on CPC-2 had been pre-configured with a DB2 image, PTB2 could have been restarted on the spare LPAR within minutes, without waiting for hardware repair.
-
Response time degradation on PTB1 was not anticipated. The capacity planning model did not account for the CPU overhead of peer recovery. Additional LPAR CPU capacity should be provisioned for this scenario.
Recommendations
- Configure all ATM controllers to use the DVIPA (group address) rather than member-specific IP addresses.
- Add retry logic to the mobile application with exponential backoff and user-friendly error messages.
- Pre-configure a warm spare LPAR on CPC-2 with DB2 installed, ready to take over PTB2's identity within minutes.
- Increase PTB1's LPAR CPU allocation by 15% to absorb peer recovery overhead without impacting response times.
- Conduct quarterly failure simulation drills to ensure operational readiness.
Discussion Questions
- If PTB2 had been running long-running utility jobs (REORG, COPY) at the time of failure, how would that have affected peer recovery?
- What would have happened if PTB1 had also failed during the peer recovery process (i.e., only PTB3 remained)?
- If the coupling facility had failed instead of the LPAR, how would the scenario differ? What if both CFs failed?
- Calculate the approximate financial impact of the 45-minute outage for PTB2's workload, assuming $0.15 revenue per ATM transaction and 2,500 TPS.