Case Study 36.2: Disaster Recovery Drill — Lessons Learned

Background

Continental Insurance Group (CIG) is a large property and casualty insurer processing $4.2 billion in annual premiums. Their claims processing system runs on DB2 13 for z/OS in a three-member data sharing group. The policy administration system runs on DB2 11.5 for LUW with an HADR pair. State insurance regulators require annual proof of disaster recovery capability, with a maximum Recovery Time Objective (RTO) of 4 hours and zero data loss (RPO = 0).

CIG had been conducting DR drills annually for seven years. Every drill had been "successful" — the team had achieved the RTO and RPO targets each time. Then came Year 8.

The DR Plan (As Documented)

CIG's documented DR plan for the DB2 environment specified:

z/OS Data Sharing Group: 1. Detect failure of the primary data center. 2. Activate the remote data sharing member (DB2D) at the DR site (connected via GDPS — Geographically Dispersed Parallel Sysplex). 3. Verify coupling facility structures at the DR site. 4. Resume application workload within 2 hours.

LUW HADR Pair: 1. Detect failure of the primary LUW server. 2. Execute TAKEOVER HADR on the standby (NEARSYNC mode). 3. Redirect application connections via Automatic Client Reroute. 4. Verify data integrity. 5. Resume application workload within 30 minutes.

The Year 8 Drill

The drill was scheduled for a Saturday in March. The DBA team had four members: a senior z/OS DBA (20 years experience), a senior LUW DBA (14 years experience), a mid-level DBA (5 years experience), and a junior DBA (18 months experience, first DR drill).

07:00 — Drill Begins

The operations center declared the simulated disaster: "Primary data center power failure. All systems in Data Center 1 are offline. Execute DR Plan."

07:04 — z/OS Activation Begins

The senior z/OS DBA began activating the remote data sharing member. The first step — verifying coupling facility structures at the DR site — immediately revealed a problem.

Problem 1: Coupling Facility Structure Sizes Were Wrong

The DR site's coupling facility had been upgraded six months ago, and the structure sizes had been reconfigured to match the new hardware. However, the DB2 data sharing group definitions still referenced the old structure sizes. When DB2D attempted to connect to the coupling facility, the lock structure allocation failed because the defined size was larger than the available space in the new CF partition layout.

Root Cause: The z/OS systems programming team had updated the CF policies during the hardware upgrade but had not notified the DBA team. The DB2 CFRM (Coupling Facility Resource Management) policy was inconsistent with the actual CF configuration.

Resolution: The senior z/OS DBA manually adjusted the CFRM policy to match the actual CF partition sizes. This required: 1. Reviewing the CF partition layout with the systems programmer on call. 2. Updating the CFRM policy in the IODF (I/O Definition File). 3. Activating the new policy. 4. Retrying the DB2D connection.

Time Lost: 47 minutes.

07:08 — LUW HADR Takeover Begins

Meanwhile, the senior LUW DBA initiated the HADR takeover on the standby server.

TAKEOVER HADR ON DATABASE CIGPOLICY BY FORCE

The command returned immediately with a success message. The standby was promoted to primary in 14 seconds.

07:10 — Application Team Begins Connection Testing

The application team attempted to connect to the new primary database. Connections from the claims processing application succeeded immediately — Automatic Client Reroute worked as designed.

Problem 2: One Application Did Not Have ACR Configured

The policy rating engine — a newer application deployed 10 months ago — had been configured with a direct connection string pointing to the original primary server. It did not use the DB2 connection pool that had ACR configured. When the primary went "offline" (simulated), the rating engine's connections failed and it had no alternate server defined.

Root Cause: The rating engine was deployed by a different development team that was not aware of the ACR configuration requirement. The deployment checklist did not include a DB2 connection configuration review.

Resolution: The application team manually updated the rating engine's connection string to include the alternate server:

jdbc:db2://dr-server:50000/CIGPOLICY:currentSchema=CIGPOL;
  clientRerouteAlternateServerName=primary-server;
  clientRerouteAlternatePortNumber=50000;
  retryIntervalForClientReroute=10;
  maxRetriesForClientReroute=5;

Then they restarted the rating engine's application server.

Time Lost: 22 minutes (including the application server restart).

07:51 — z/OS Recovery Continues

After the CFRM policy fix, DB2D connected to the coupling facility successfully. But the next step — recovering the databases to a consistent state — uncovered another problem.

Problem 3: Archive Log Tapes Were Not at the DR Site

The data sharing group's recovery process required applying archive logs from the most recent image copies to the current point. The image copies were replicated to the DR site via PPRC (Peer-to-Peer Remote Copy). However, the archive logs generated between the last image copy and the simulated failure were on tape — and the tape management system had not yet replicated them to the DR site's tape library.

The archive logs were generated between 02:00 AM and 07:00 AM (the current time). The replication schedule moved tapes to the DR site in a nightly batch at 04:00 AM, but the 04:00 AM batch only replicated tapes created before midnight. The 02:00-07:00 AM logs would not arrive until the next night's batch.

Root Cause: The tape replication schedule had a 28-hour lag for recent archive logs. This had never been an issue in previous drills because those drills had been scheduled after the replication window — by design (though the DBA team had not realized this was a critical dependency).

Resolution: The senior z/OS DBA had two options: 1. Wait for the tapes to be physically shipped from the primary data center (simulated as available — in a real disaster, they might not be). 2. Recover to the last image copy point and accept data loss for the 02:00-07:00 AM window.

Neither option met the RPO = 0 requirement.

In the real drill, the team chose Option 2, recovering to the last image copy (taken at 01:30 AM). This meant 5.5 hours of data was lost — primarily overnight batch processing that could be re-run.

Time for z/OS recovery: 2 hours 14 minutes (including the CFRM fix).

08:15 — LUW Data Integrity Verification

Problem 4: Data Integrity Check Revealed a Discrepancy

The LUW DBA ran data integrity checks after the HADR takeover:

SELECT COUNT(*) FROM CIGPOL.POLICY_MASTER;          -- Expected: 4,234,891
SELECT COUNT(*) FROM CIGPOL.CLAIM_HISTORY;           -- Expected: 12,847,223
SELECT MAX(TRANSACTION_ID) FROM CIGPOL.POLICY_TRANS; -- Expected: 98,234,567

The POLICY_MASTER and CLAIM_HISTORY counts matched. But the POLICY_TRANS maximum transaction ID was 98,234,412 — 155 transactions fewer than expected.

Root Cause: HADR was configured in NEARSYNC mode, not SYNC mode. In NEARSYNC, the primary acknowledges a commit to the application after writing the log record to the local log AND sending it to the standby — but does not wait for the standby to confirm receipt. Under heavy load, a small number of log records can be "in flight" on the network at the moment of failure. Those 155 transactions had been committed on the primary but had not yet been received by the standby.

Impact: 155 policy transactions were lost. In insurance terms, these were mostly premium endorsements and mid-term policy changes. They could be recreated from the application's source documents, but it required manual reprocessing.

Note: This was technically a violation of the RPO = 0 requirement. The architecture team had specified SYNC mode during the original HADR design, but a network latency issue six months ago had caused the DBA team to switch to NEARSYNC to avoid application timeouts. This change had not been communicated to the DR planning team.

09:14 — z/OS Data Sharing Group Fully Operational

After recovering to the last image copy and restarting DB2D: - All databases were accessible. - Application connections were established. - The first test transactions completed successfully.

Total z/OS RTO: 2 hours 14 minutes (within the 4-hour target).

09:30 — Full DR Environment Operational

All applications were running on the DR infrastructure: - z/OS claims processing: operational, with data loss from 01:30-07:00 AM. - LUW policy administration: operational, with loss of 155 transactions. - Both systems within RTO targets but failing RPO targets.

Post-Drill Analysis

What Went Right

  1. HADR takeover was fast (14 seconds) and reliable.
  2. Automatic Client Reroute worked for applications that had it configured.
  3. The senior DBAs were able to diagnose and resolve unexpected problems under pressure.
  4. The junior DBA gained invaluable experience observing a real (simulated) crisis.

What Went Wrong

# Problem Root Cause Severity Remediation
1 CF structure size mismatch CF upgrade without DBA notification HIGH Add DB2 CFRM validation to hardware change checklist
2 Rating engine without ACR New app deployed without DB2 connection review MEDIUM Add DB2 connection audit to deployment checklist
3 Archive log tapes not at DR site 28-hour replication lag CRITICAL Implement real-time archive log replication via z/OS GDPS log shipping
4 NEARSYNC data loss (155 transactions) Mode changed from SYNC without DR team review HIGH Revert to SYNC mode; solve the network latency issue properly

The Hard Conversations

Conversation 1: With the CTO

The DBA team had to report that seven years of "successful" DR drills had been masking a critical gap. The archive log replication lag (Problem 3) had existed since the DR site was commissioned, but previous drills had been inadvertently scheduled to avoid exposing it.

The CTO's response was measured but serious: "We have been telling regulators we can achieve zero data loss. That is not true. How do we fix it, and how do we make sure we don't have other gaps we haven't found?"

Conversation 2: With the Development Team

The rating engine's missing ACR configuration (Problem 2) exposed a broader issue: there was no process to ensure new applications met DR requirements before deployment. The development team had followed their own deployment checklist, which did not include database-level DR validation.

Conversation 3: With the z/OS Systems Programming Team

The CF structure mismatch (Problem 1) was a communication failure. The systems programming team had followed their own change management process for the CF upgrade, which did not include notifying database administrators. The existing change management categories did not recognize that a CF hardware change could affect DB2 data sharing.

Remediations Implemented

  1. Real-time archive log shipping: Replaced the nightly tape replication with GDPS-based continuous log shipping. Archive logs are now replicated to the DR site within seconds of creation.

  2. HADR mode reverted to SYNC: The network latency issue was addressed by upgrading the interconnect between data centers from 10 Gbps to 25 Gbps. HADR returned to SYNC mode with no application timeout issues.

  3. Change management integration: Hardware changes to coupling facilities, networks, and storage now require DBA sign-off when DB2 components are involved.

  4. Application deployment checklist updated: All new applications must demonstrate Automatic Client Reroute configuration before production deployment.

  5. DR drill methodology revised: Future drills will be scheduled at random times (not pre-planned weekends) and will include scenarios that stress the replication pipeline.

  6. Quarterly mini-drills: In addition to the annual full drill, quarterly mini-drills test individual components (HADR takeover, CF failover, log recovery) to maintain readiness.

Key Lessons

  1. A DR drill that always succeeds is not testing hard enough. If every drill passes, you may be unconsciously avoiding the scenarios that would expose weaknesses. Vary the timing, the failure mode, and the team composition.

  2. DR is a cross-team discipline. The failures in this drill were not DBA failures — they were communication failures between the DBA team, the systems programming team, the network team, and the development team. DR planning must involve all teams that touch the data path.

  3. NEARSYNC is not SYNC. The name sounds almost the same, but the behavior under failure is different. If your RPO requirement is truly zero, you must use SYNC mode — and you must solve whatever performance issue drove you away from it, rather than downgrading the HA mode.

  4. Document assumptions, not just procedures. The DR plan documented the steps. It did not document the assumptions — that tapes would be replicated, that CF policies would be consistent, that all applications would use ACR. Undocumented assumptions are the most dangerous kind.

  5. The junior DBA's observation was the most valuable outcome. After the drill, the junior DBA said: "I always thought DR was just about having a backup and a standby. I didn't realize how many things have to be exactly right for recovery to work." That understanding — that DR is a system of interdependent components, not a single technology — is the most important lesson in this case study.

Discussion Questions

  1. If the primary data center had experienced a real (not simulated) disaster, would the 28-hour tape replication lag have been recoverable? What data would have been permanently lost?

  2. The CTO asked: "How do we make sure we don't have other gaps?" Propose a methodology for identifying hidden DR gaps that only appear under specific failure conditions.

  3. The decision to switch HADR from SYNC to NEARSYNC was made by the DBA team without consulting the DR planning team. Design a change management process that would prevent this type of unilateral change to DR-critical configurations.

  4. How would you design a DR drill for a system that cannot tolerate any downtime — where even a "test" failover affects real users? Consider the role of isolated testing environments, runbook validation, and tabletop exercises.