Case Study 1: Continental National Bank's DR Architecture and Annual Test

DataField.Dev

Case Study 1: Continental National Bank's DR Architecture and Annual Test

Background

Continental National Bank's disaster recovery architecture didn't emerge from a single design session. It evolved over seven years, driven by three forces: regulatory pressure, real incidents, and the hard lessons of DR testing.

In 2017, when CNB completed its Parallel Sysplex migration (Chapter 1, Case Study 1), the DR plan was simple — almost embarrassingly so, in retrospect. Nightly tape backups shipped to an offsite vault. A cold standby site in Raleigh with enough hardware to run a reduced-capacity configuration. A recovery procedure that had been tested exactly once, in 2015, and had achieved an RTO of 14 hours.

Fourteen hours. For a Tier-1 bank processing 350 million transactions per day.

"That 14-hour RTO was the number that changed everything," Kwame Mensah recalls. "I put it on a slide and showed it to the board of directors. I said: 'If our Charlotte data center is destroyed at 9 AM on a Monday, we will not be able to process a single transaction until 11 PM that night. No ATMs. No wire transfers. No online banking. For fourteen hours. And that assumes the tape restore goes perfectly, which it won't.' The board approved the GDPS project in that meeting."

The Architecture Evolution

Phase 1: Metro Mirror (2018)

The first phase was straightforward: establish synchronous replication from Charlotte to Raleigh using GDPS/Metro Mirror.

Infrastructure investment: - Raleigh data center: two z14 LPARs (CNBDR01, CNBDR02) with 60% of Charlotte's processing capacity - Storage: IBM DS8880 with 47 TB of mirrored DASD - Network: four 16 Gbps FICON links between Charlotte and Raleigh (43 km), diverse path routing through two separate fiber providers - GDPS/Metro Mirror licensing and implementation services

Design decisions:

Which volumes to mirror. Not everything. The team identified three categories: - Must mirror (Tier 0/1): DB2 data and log volumes, CICS journals and system datasets, MQ page sets and log datasets, RACF database, system catalogs. Total: 31 TB. - Should mirror (Tier 2): Batch input/output datasets, VSAM master files, generation data groups used in nightly cycle. Total: 12 TB. - Don't mirror (Tier 3): Work files, sort work, spool, temporary datasets. Total: 4 TB. - The "don't mirror" decision saved approximately $180,000/year in storage costs and reduced replication bandwidth requirements by 8%.
Warm standby vs. cold standby. The Raleigh LPARs run z/OS in a minimal state — IPLed, TCP/IP active, GDPS SA monitoring active, but no production subsystems running. This cuts the IPL time from the recovery sequence (saving 8-12 minutes) while avoiding the complexity and licensing cost of running active DB2/CICS/MQ at the DR site.
Network architecture. The ATM network, wire transfer interfaces (Fedwire, SWIFT), and online banking traffic all route through the Charlotte data center. During failover, DNS and network routing must be updated to point to Raleigh. The team automated this with pre-staged DNS entries and BGP route advertisements that GDPS triggers during failover.

First test (2018-Q4): RTO was 47 minutes. Major issues: - DB2 lock structure rebuild took 22 minutes (lock structure was not duplexed at Raleigh) - CICS cold start instead of warm start (CICS system datasets at Raleigh were out of maintenance sync) - MQ channel definitions had hard-coded Charlotte IP addresses

Phase 2: HyperSwap (2019)

After the first Metro Mirror test revealed the 47-minute RTO, Kwame pushed for HyperSwap to handle the most common failure scenario: storage subsystem failure at the primary site.

What changed: - GDPS/HyperSwap manager deployed on the GDPS controlling system - Metro Mirror configuration updated to support HyperSwap (requires specific configuration attributes on the Metro Mirror sessions) - Testing: simulated primary storage controller failure — HyperSwap completed in 1.8 seconds. Applications didn't notice.

Key learning: HyperSwap and Metro Mirror failover serve different purposes. HyperSwap handles storage failure (transparent, near-zero RTO). Metro Mirror failover handles site failure (requires IPL, 15+ minute RTO). You need both.

Phase 3: XRC to Dallas (2020)

The OCC examiner's question in 2020 was blunt: "What happens if a hurricane takes out both Charlotte and Raleigh?"

Charlotte and Raleigh are 43 km apart. A major hurricane making landfall on the North Carolina coast could conceivably affect both sites — through power grid disruption if not direct damage. The OCC expected a third site at continental distance.

Infrastructure investment: - Dallas data center (leased colocation space): two z15 LPARs (CNBDR03, CNBDR04) with 40% of Charlotte's processing capacity - Storage: IBM DS8950 with 47 TB for XRC replication targets - Network: 1 Gbps dedicated WAN circuit, encrypted, from Charlotte to Dallas - GDPS/XRC licensing

Design decisions:

XRC consistency group management. All replicated volumes are in a single consistency group with a consistency timestamp. This ensures that if CNB fails over to Dallas, all volumes are at the same point in time — even though individual write operations arrived at different times due to asynchronous replication.
Reduced capacity at Dallas. The Dallas site runs at 40% of Charlotte's capacity. After a regional disaster, CNB would operate in degraded mode: - Online transactions: 100% (reduced capacity is still sufficient for the expected reduced volume) - Batch processing: Tier 1 and Tier 2 only — Tier 3 batch deferred until failback - Regulatory reporting: available but slower
XRC lag monitoring. A custom monitoring job runs every 5 minutes, checking the XRC consistency group timestamp against current time. If the lag exceeds 30 seconds, it triggers a warning. If it exceeds 60 seconds, it triggers a Sev-2 alert. During the batch window, thresholds are relaxed to 60/120 seconds.

Phase 4: Ransomware Protection (2022)

After several high-profile ransomware attacks on financial institutions in 2021, CNB added a fourth layer of protection specifically targeting data corruption.

What changed: - FlashCopy snapshots: Before and after each nightly batch cycle, a FlashCopy snapshot of all Tier 0/1 DB2 volumes is created. These snapshots are retained for 30 days. A FlashCopy provides an instant, space-efficient, point-in-time copy of a volume. - Immutable backups: Weekly full image copies of all DB2 tablespaces written to virtual tape (TS7770) with WORM (Write Once Read Many) policy — the backups cannot be modified or deleted for 90 days, even by a storage administrator. - Air-gapped tape vault: Monthly full backups on physical tape, transported to a vault facility with no network connectivity to any CNB data center. - Corruption detection: Nightly batch job that runs DB2 CHECK DATA and CHECK INDEX on critical tablespaces. If corruption is detected, it alerts immediately and blocks the next batch cycle from starting.

Lisa Tran led this phase: "Replication is your enemy when the threat is data corruption. Metro Mirror faithfully replicates corrupt data to Raleigh. XRC faithfully replicates it to Dallas. Within seconds, all three copies of your data are corrupt. The only defense is a copy that was made before the corruption — and that copy has to be somewhere the attacker can't reach."

The 2023 Annual DR Test

CNB's most comprehensive DR test to date was conducted on September 14, 2023 — a Level 3 planned site failover. This was the test that achieved the 11-minute RTO referenced in Section 30.6.

Pre-Test Preparation (2 weeks before)

Test scope: Full site failover from Charlotte to Raleigh. Charlotte would be treated as completely unavailable. All production workload would run on Raleigh for a 6-hour window.

Notifications: - OCC: 30 days advance notice (regulatory requirement for Tier-1 bank DR test) - ATM network (STAR/Pulse): 14 days advance notice - Wire transfer networks (Fedwire, SWIFT): 14 days advance notice - Customer notification: None (test during low-volume window, no expected customer impact) - Board of directors: Informed at monthly meeting

Team: - Test director: Kwame Mensah - DB2 recovery lead: Lisa Tran - Batch recovery lead: Rob Calloway - CICS lead: Sarah Kim (CICS systems programmer) - MQ lead: David Park (MQ administrator) - Network lead: James Wright (network engineer) - GDPS operator: Maria Santos (storage/GDPS specialist) - Observers: Internal audit (2), OCC examiner (1), external auditor (1)

Test Execution

20:00 — Pre-test briefing. All team members in the Charlotte NOC (Network Operations Center). Review of test plan, success criteria, abort criteria, and communication plan. Each team member confirmed understanding of their role and verified access to runbooks.

20:15 — Pre-test verification. - Metro Mirror: all pairs in-sync (verified by Maria Santos) - Raleigh LPARs: IPLed and in standby (verified by Kwame) - XRC to Dallas: lag at 7 seconds (normal) - All team members' phones tested (call and SMS)

20:30 — Quiesce primary site. - CICS: drain online transactions. Verify transaction count drops to zero. - Batch: hold all pending job submissions. - MQ: stop channels gracefully. - This took 4 minutes (planned: 5 minutes).

20:34 — Initiate GDPS failover.

Maria Santos issued the GDPS failover command from the GDPS controlling system. The automated sequence:

T+0 sec: GDPS freezes all Metro Mirror pairs
T+3 sec: GDPS verifies secondary volumes are consistent
T+5 sec: GDPS issues HMC commands to activate production configuration on Raleigh LPARs
T+8 sec: z/OS on CNBDR01 and CNBDR02 begins activating production subsystems

20:34 — 20:38: DB2 startup (4 minutes)

DB2 group restart on the Raleigh members. The previous Charlotte members' in-flight work was backed out automatically using the replicated log datasets. Lisa Tran verified: - -DIS DDF → both Raleigh members showing STARTD - -DIS THREAD(*) → 0 active threads (expected — no applications connected yet) - -DIS DATABASE(*) → all databases in RW status

20:38 — 20:42: CICS startup (4 minutes)

CICS regions on Raleigh started via SA z/OS automation: - TOR: started, listening on production port - AOR1, AOR2: warm started successfully - AOR3: warm start failed — CICS journal dataset not found

Issue: AOR3's journal volume had been excluded from Metro Mirror replication in a storage reconfiguration two months earlier. Nobody had noticed because AOR3 was a backup region that rarely processed transactions. Sarah Kim cold-started AOR3 (losing journal data that had already been committed to DB2). Total delay: 3 additional minutes.

Post-test action item: Reconcile Metro Mirror volume list against CICS dataset inventory monthly.

20:42 — 20:45: MQ and network (3 minutes)

MQ queue managers started, shared queues available
MQ channels started automatically (previous tests had required manual restart — automated in GDPS policy after 2021 test)
Network team verified ATM routing, Fedwire connectivity, online banking DNS

20:45 — Total RTO: 11 minutes from failover initiation.

Validation Testing (20:45 — 02:00)

Synthetic transaction testing (20:45 — 21:00): - Balance inquiry (CICS → DB2): 500 transactions, 100% success, avg response 12ms - Fund transfer (CICS → DB2 → MQ): 200 transactions, 100% success, avg response 45ms - ATM authorization: 100 simulated ATM requests, 100% success, avg response 85ms

Live transaction testing (21:00 — 22:00): - Restored online banking access (small user population — Saturday night) - 3,247 real customer transactions processed successfully - No customer-reported issues

Batch testing (22:00 — 02:00): - Rob Calloway submitted a reduced batch cycle (Tier 1 and Tier 2 jobs only) - 187 of 312 nightly jobs executed - All completed successfully - Batch window at Raleigh: 4 hours (same as Charlotte — capacity was sufficient)

Regulatory reporting test (01:00 — 02:00): - Generated simulated OCC Call Report - Generated simulated FFIEC 031 report - Both produced valid output

Failback (02:00 — 04:00)

02:00 — Quiesce DR site. Same process as the original quiesce — drain CICS, hold batch, stop MQ channels.

02:08 — Initiate failback. GDPS reversed the Metro Mirror direction — Raleigh volumes became the source, Charlotte volumes the target. Resynchronization began immediately. Because only 6 hours of changes had accumulated (and the test workload was light), resynchronization completed in 22 minutes.

02:30 — Charlotte production subsystems restarted. DB2, CICS, MQ started on Charlotte LPARs. Verification identical to the failover verification.

02:42 — Failback complete. Charlotte resumed as primary. Metro Mirror resynchronization from Charlotte to Raleigh restarted. Full resync (verifying all volumes) completed at 04:15.

Post-Test Review

Attendees: Full test team plus internal audit, external auditor, and OCC examiner.

Findings:

Finding	Severity	Owner	Deadline
AOR3 journal volume excluded from Metro Mirror	High	Maria Santos (storage)	Oct 15
Metro Mirror volume list not reconciled against application datasets since June	Medium	Maria Santos + Sarah Kim	Monthly recurring
XRC lag to Dallas increased to 45 seconds during failback resync — bandwidth contention	Low	James Wright (network)	Nov 30
Test didn't validate IVR (Interactive Voice Response) system failover	Medium	David Park (MQ/voice)	Next test
Observer from OCC noted that runbook step 7.3 references a phone number that's been disconnected	Low	Kwame Mensah	Oct 1

Metrics summary:

Metric	Target	Actual	Status
Failover RTO (Tier 0/1)	< 15 min	11 min	Pass
RPO	Zero	Zero	Pass
Synthetic transaction success	100%	100%	Pass
Live transaction success	100%	100%	Pass
Batch completion (Tier 1+2)	100%	100%	Pass
Failback RTO	< 30 min	34 min	Fail
Regulatory report generation	Yes	Yes	Pass
Runbook deviations	< 5	2	Pass

The one failure: Failback took 34 minutes against a 30-minute target. Root cause: the Metro Mirror resynchronization after failback took 22 minutes instead of the expected 15 minutes, because the batch test had generated more I/O than planned. Action: adjust failback time target to 45 minutes (more realistic) or reduce batch workload during test.

OCC examiner's assessment (verbal): "This is a well-run DR program. The AOR3 finding is exactly the kind of issue that testing is supposed to uncover. I expect to see the volume reconciliation process formalized by my next examination."

Lessons Learned (Kwame's Perspective)

"Seven years and fourteen DR tests have taught me five things:

First, every test finds something. If your test finds nothing, either you didn't test hard enough or you're not looking. The AOR3 journal volume issue was hiding for two months. If we hadn't tested, it would have been waiting for us during a real disaster.

Second, the failback is harder than the failover. Everyone focuses on getting to the DR site. Nobody practices getting back. Our failback has failed or exceeded targets in four of our fourteen tests.

Third, people are the bottleneck, not technology. GDPS failover takes 30 seconds. The human decisions around it take 10 minutes. Improving our RTO from 47 minutes to 11 minutes was mostly about streamlining human processes — pre-authorizations, automated notifications, runbook simplification.

Fourth, the thing that kills you in a real disaster is the thing you forgot to replicate. First it was CICS datasets out of maintenance sync. Then it was MQ channel definitions with hard-coded IP addresses. Then it was the AOR3 journal volume. Every time, it's something small that nobody thought to check. The reconciliation process that validates 'everything at the DR site matches production' is the single most important DR process we have.

Fifth, the DR test is the best training your team will ever get. The junior operators who participate in DR tests learn more about the system in one night than in six months of normal operations. Make DR testing a development opportunity, not a chore."

Discussion Questions

CNB's DR architecture costs approximately 35% of their total mainframe operational budget. At what point does the cost of DR protection exceed the value of the assets being protected? How would you make this determination for a Tier-1 bank?
The AOR3 journal volume issue was introduced by a routine storage reconfiguration two months before the DR test. What change management process could have caught this? How do you ensure that every infrastructure change is evaluated for DR impact?
Kwame says "the failback is harder than the failover." Why? What makes failback operationally more complex than failover? Design a failback testing program that addresses this gap.
The OCC examiner observed the DR test. How does regulatory observation change the dynamics of a DR test? Does it help or hurt the test's value as a learning exercise?
CNB uses a multi-target DR architecture (Metro Mirror to Raleigh, XRC to Dallas). Under what scenario would both Charlotte and Raleigh be simultaneously unavailable? How likely is this scenario, and does the XRC/Dallas investment represent good risk management?