Case Study 1: Disaster Recovery Drill — Simulating a Data Center Failure

DataField.Dev

Case Study 1: Disaster Recovery Drill — Simulating a Data Center Failure

Background

Meridian National Bank operates its core banking system on IBM z/OS DB2 in a primary data center in Chicago. A secondary data center in Indianapolis, 290 kilometers away, maintains a synchronized copy of all DB2 data through IBM GDPS with Metro Mirror. The bank's regulatory framework (OCC guidance on business continuity) requires semi-annual disaster recovery testing with documented results.

This case study follows the spring DR drill, conducted on a Saturday in March. The objective: simulate a complete loss of the Chicago data center and verify that Indianapolis can assume full production operations within the 15-minute RTO.

The Environment

Chicago (Primary): - z/OS 2.5 LPAR with DB2 13 for z/OS - Data sharing group: DB2PCSG (2 members: DB2A, DB2B) - Total DB2 data: 2.1 TB across 450 tablespaces - Active log: 8 data sets × 2 GB, dual logged - Average transaction rate: 4,200 transactions/second during peak - GDPS controlling Metro Mirror to Indianapolis

Indianapolis (Secondary): - Identical z/OS LPAR configuration (cold standby — DB2 not running) - Metro Mirror copies of all DB2 DASD volumes - Copy of the BSDS on mirrored volumes - Network connectivity pre-configured for application failover - Last successful DR test: 6 months ago (fall drill)

The Drill Plan

The DR team consists of six people: two z/OS system programmers, two DB2 DBAs, one network engineer, and a DR coordinator. All participants have copies of the recovery runbook — a 47-page document updated after each test.

Phase 1: Preparation (T-60 minutes to T-0)

At 06:00 AM, the DR coordinator convenes the team. Production workload has been reduced but not eliminated — the bank's ATM network and online banking system continue to operate against the Chicago DB2 subsystem.

The team reviews the runbook and confirms: - Indianapolis hardware is operational - Network paths between Indianapolis and the application servers are configured - The GDPS automation scripts are loaded and current - Contact information for escalation is verified

At 06:45, the team notifies the operations center that the drill will begin at 07:00.

Phase 2: Simulated Failure (T=0)

At 07:00:00, the DR coordinator declares a simulated disaster. In a real scenario, the Chicago data center would be lost. For the drill, GDPS is instructed to perform a planned site switch — functionally identical to an emergency failover but without actual hardware destruction.

The GDPS operator issues the site switch command:

CLIST: GDPS_SWITCH_SITE TARGET(INDIANAPOLIS)

GDPS begins the automated failover sequence:

T+0:00 — GDPS halts Metro Mirror replication
T+0:05 — GDPS reverses the mirror direction (Indianapolis volumes become primary)
T+0:12 — GDPS initiates z/OS IPL at Indianapolis using the mirrored system volumes
T+2:30 — z/OS is operational at Indianapolis
T+3:00 — GDPS starts the DB2 subsystem: -DB2A START DB2

Phase 3: DB2 Crash Recovery

DB2 starts and immediately enters crash recovery. The Chicago DB2 was processing transactions at the moment of the switch — those in-flight transactions must be resolved.

DSNJ001I  DB2A RESTART PROCESSING BEGINNING
DSNJ002I  DB2A LOG ANALYSIS: READING FROM RBA=X'00000003A5000000'
DSNJ003I  DB2A REDO PHASE: 14,827 LOG RECORDS APPLIED
DSNJ004I  DB2A UNDO PHASE: 247 TRANSACTIONS ROLLED BACK
DSNJ005I  DB2A RESTART COMPLETE, ELAPSED TIME: 00:00:08.4

The crash recovery completes in 8.4 seconds. The 247 rolled-back transactions represent the in-flight work at the moment of the switch. These transactions will be retried by the application layer when it reconnects.

At T+3:09, DB2 is operational.

Phase 4: Application Reconnection

The network engineer activates the pre-configured routes that direct application traffic to Indianapolis:

T+3:30 — CICS regions reconnect to DB2 at Indianapolis
T+4:15 — Batch job scheduler is pointed to Indianapolis
T+5:00 — Online banking middleware is reconfigured

The DB2 DBA verifies database integrity by running a series of canary queries:

-- Verify core tables are accessible
SELECT COUNT(*) FROM MERIDIAN.CUSTOMER;       -- Returns 2,847,293
SELECT COUNT(*) FROM MERIDIAN.ACCOUNT;         -- Returns 5,123,847
SELECT MAX(TXN_TIMESTAMP) FROM MERIDIAN.TRANSACTION;  -- Returns 06:59:58.234

The MAX(TXN_TIMESTAMP) result shows that the last committed transaction was processed at 06:59:58 — just 2 seconds before the simulated failure. The 247 uncommitted transactions from that final 2-second window were rolled back and will be retried. Zero committed transactions were lost. RPO = 0 achieved.

Phase 5: Production Validation

At T+8:00 (07:08:00 AM), the DR coordinator announces that Indianapolis is operational. A subset of production transactions is routed to the Indianapolis system:

ATM withdrawals: processed successfully
Online banking login and balance inquiry: successful
Funds transfer: successful
New account opening: successful

The team monitors for 30 minutes, watching for: - DB2 message logs for any errors - Transaction response times (within 15% of normal — the slight increase is expected due to the geographic distance of the application servers) - Buffer pool hit ratios (starting low as caches warm up, expected to stabilize within 15-20 minutes) - Lock contention (normal levels)

Phase 6: Failback

At 08:00, the team begins the failback to Chicago. This is the reverse of the failover:

GDPS re-establishes Metro Mirror from Indianapolis to Chicago
Full resynchronization of the mirrored volumes (this takes approximately 45 minutes for 2.1 TB of changed data during the drill period)
Once synchronized, GDPS performs a planned site switch back to Chicago
DB2 restarts at Chicago — another brief crash recovery for the in-flight work during the switch
Applications reconnect to Chicago

By 09:30, production is fully restored to Chicago.

Issues Discovered

No DR test is perfect. This drill revealed three issues:

Issue 1: Batch Job Scheduler Delay The batch job scheduler took 90 seconds longer than expected to redirect to Indianapolis. Investigation revealed a DNS cache that was not flushed during the failover automation. The team added a cache flush step to the runbook.

Issue 2: Missing Archive Log Copy During the drill, the team noticed that one archive log copy from the previous week was missing from Indianapolis. The log had been offloaded to tape at Chicago but the tape was not yet vaulted to Indianapolis. In a real disaster, this would not have affected recovery (the mirrored active logs were sufficient), but it would have complicated any point-in-time recovery to a date before that log range. The team updated the tape vaulting schedule from weekly to daily.

Issue 3: Runbook Page Reference Error Page 23 of the runbook referenced a CICS region name that had been renamed three months ago. The command worked only because the operator recognized the error and used the correct name. The team updated the runbook and added a quarterly runbook review to the maintenance calendar.

Results Summary

Metric	Target	Actual	Status
Total failover time	< 15 minutes	8 minutes	PASS
Committed transactions lost	0	0	PASS
Application reconnection	< 10 minutes	5 minutes	PASS
Transaction accuracy	100%	100%	PASS
Issues discovered	—	3 (none critical)	DOCUMENTED

Lessons for the Reader

DR testing reveals problems that planning alone cannot. The DNS cache issue and the runbook error would never have been found without an actual test. Every organization that tests discovers issues. Every organization that does not test carries unknown risks.
The 15-minute RTO was achievable because of automation. GDPS automated the complex sequence of mirror reversal, IPL, and DB2 startup. Manual execution of these steps would take 45-60 minutes. Automation is not optional for aggressive RTO targets.
RPO = 0 was achieved through synchronous mirroring. Every committed transaction that existed at Chicago also existed at Indianapolis, because Metro Mirror wrote every change to both sites before acknowledging the COMMIT. This comes at the cost of ~2 ms additional COMMIT latency — a trade-off Meridian Bank accepted for its core banking system.
Crash recovery was the fastest phase. At 8.4 seconds, DB2's crash recovery was negligible compared to the infrastructure startup time. The checkpoint interval (LOGLOAD) was tuned to keep the recovery window small — approximately 15,000 log records, or less than 10 seconds of crash recovery processing.
Failback is as important as failover. The drill included a full failback to Chicago. In a real disaster, the bank might run from Indianapolis for days or weeks. The failback procedure must be tested to ensure a smooth return to the primary site.
Document everything. The drill report, including all three issues and their resolutions, was filed with the compliance team and shared with the entire IT organization. This documentation satisfies regulatory requirements and builds institutional knowledge.

This case study is based on real-world DR practices at major financial institutions. The specific numbers and configurations are representative of production environments.