46 min read

> "I've been through three real disasters and about forty DR tests. The disasters taught me more. A DR plan that hasn't been tested is a collection of wishes. A DR plan that's been tested once is a dangerous source of false confidence. A DR plan...

Learning Objectives

  • Design DR architectures using GDPS (HyperSwap, XRC, Metro Mirror) for z/OS environments
  • Analyze Sysplex failure domains and design for survivability at each level
  • Create runbooks for DR scenarios (site failure, LPAR failure, subsystem failure, data corruption)
  • Design and execute DR tests that validate recovery capabilities
  • Architect the DR plan for the HA banking system progressive project

"I've been through three real disasters and about forty DR tests. The disasters taught me more. A DR plan that hasn't been tested is a collection of wishes. A DR plan that's been tested once is a dangerous source of false confidence. A DR plan that's tested quarterly, updated after every test, and audited annually — that's a plan." — Kwame Mensah, Chief Architect, Continental National Bank

Chapter Overview

At 4:17 PM on a Friday in September 2023, a construction crew drove a backhoe through a fiber conduit in a business park outside Charlotte, North Carolina. The conduit carried sixty-four strands of dark fiber connecting Continental National Bank's primary data center to its disaster recovery site forty-three miles away. In less than a second, CNB's GDPS/Metro Mirror lost synchronous replication to every production DASD volume.

The GDPS automation detected the fiber loss in 1.2 seconds. Kwame Mensah's phone buzzed with a Sev-1 alert 3.8 seconds after the cut. By the time he opened the alert, GDPS had already completed its automated response: the production Sysplex continued processing on primary volumes, replication was suspended, and the DR site's volumes were marked stale.

No transactions were lost. No users noticed. The ATMs kept dispensing cash. The wire transfer queue kept processing. The CICS regions never saw a blip.

But Kwame wasn't relieved — he was worried. Because he knew something that the operations team celebrating the "non-event" didn't fully grasp: CNB had just lost its disaster recovery capability. If the primary data center suffered a fire, flood, or power failure in the next 72 hours — the time it took to repair the fiber and resynchronize 47 terabytes of DASD — the bank would be dead. Not "degraded." Not "impacted." Dead. No core banking. No ATMs. No wire transfers. No regulatory reporting. A Tier-1 bank with $380 billion in assets, blind and paralyzed.

That weekend, Kwame rewrote CNB's DR risk communication procedures. Monday morning, he sat in the CTO's office and explained why a severed fiber was a Sev-1 event even though nothing appeared to be broken.

This chapter is about understanding why Kwame was right to be worried — and how to design, document, test, and govern a DR architecture that protects your enterprise when the unthinkable happens.

What you will learn in this chapter:

  1. How to analyze business requirements (RTO/RPO) and work backward to technology choices — not the other way around
  2. GDPS architecture in depth: HyperSwap, XRC, Metro Mirror, and Global Mirror — what each provides, what each costs, and when to use which
  3. Sysplex failure domain analysis: every component that can fail independently, and how to design survivability at each level
  4. How to write DR runbooks that work at 3 AM when the person executing them is exhausted and scared
  5. How to design and execute DR tests that actually validate your recovery capability
  6. DR governance and compliance frameworks that satisfy regulators without drowning your team in paperwork

Learning Path Annotations:

  • 🏃 Fast Track: If you've managed GDPS environments, skim Sections 30.1–30.3 and focus on Sections 30.5 and 30.6 — the runbook and testing sections are where most shops fall short.
  • 🔬 Deep Dive: If DR architecture is new territory, read sequentially. The RTO/RPO analysis in Section 30.2 is the foundation. If you skip it and jump to GDPS technology, you'll make the same mistake most shops make: choosing technology before understanding requirements.

Spaced Review — Concepts from Earlier Chapters:

📊 From Chapter 1 (Parallel Sysplex): Recall the Parallel Sysplex architecture — coupling facilities, XCF signaling, GRS, and data sharing. In this chapter, you'll analyze every one of those components as potential failure points. A coupling facility that enables high availability within a site becomes a failure domain you must account for in your DR design. The Sysplex that protects you from component failure doesn't protect you from site failure — that's what GDPS is for.

📊 From Chapter 13 (CICS Architecture): Remember the CICS region topology — TOR/AOR/FOR/MRO. When we discuss failure domains in Section 30.4, you'll see why the decision to put all AORs on one LPAR vs. spreading them across LPARs is a DR decision as much as a performance decision. Region recovery, which Chapter 13 covered for normal operations, becomes critical during partial DR scenarios.

📊 From Chapter 23 (Batch Architecture): The batch window's dependency graph and critical path analysis from Chapter 23 reappears in DR planning. After a site failover, you don't just restart the batch — you restart it on different hardware, with potentially different performance characteristics, and you need to know which jobs can be skipped and which must be rerun from their last checkpoint. Rob Calloway's critical path analysis becomes the DR batch recovery plan.


30.1 When the Data Center Goes Dark

Let me tell you about the two kinds of DR events most mainframe architects never experience and the one kind they will.

The kind you'll never see: a full site failure caused by a natural disaster. I've been in this business since 1998, and I've known exactly three organizations that had to invoke full site DR for a genuine catastrophe — a hurricane, a flood, and a gas explosion. Three. In twenty-seven years.

The other kind you'll never see: a simultaneous multi-site failure. If both your production site and DR site go down at the same time, your DR plan isn't your biggest problem — the civilization-ending event that took out two geographically separated data centers is.

The kind you will see, and probably already have: the partial failure. An LPAR goes down. A storage subsystem loses a controller. A coupling facility fails. A DB2 member abends. A network switch dies. A fiber gets cut. These happen regularly — quarterly at minimum, monthly at a large shop. And here's the uncomfortable truth: most "disasters" that bring down production systems aren't disasters at all. They're ordinary component failures that cascade because the architecture wasn't designed to contain them.

This is why the chapter opens with a fiber cut, not a hurricane.

The Spectrum of Failure

Failure isn't binary. It exists on a spectrum, and your DR architecture must address every point on it:

Failure Level Example Typical Frequency Recovery Approach
Component Single disk, single HBA, single port Weekly Hardware redundancy (RAID, multipath)
Subsystem DB2 member abend, CICS AOR failure, MQ queue manager crash Monthly Subsystem restart, automatic recovery
LPAR z/OS crash, HMC-initiated deactivation, unrecoverable wait state Quarterly Sysplex workload redistribution, LPAR re-IPL
Coupling Facility CF failure, loss of CF connectivity Annually Alternate CF structures, structure rebuild
Storage Subsystem DS8900 controller failure, all paths lost to a storage frame Every 2-5 years Redundant controllers, PPRC/Metro Mirror
Site Power failure, flood, fire, extended network outage Every 10-20 years GDPS site failover
Data Corruption Application bug writes garbage to DB2, VSAM dataset destroyed Varies Point-in-time recovery, journal-based repair

The last entry — data corruption — is the one that keeps architects awake at night. Every other failure on this list destroys hardware or connectivity. Data corruption destroys the data itself, and it replicates. If your application writes corrupt data to the primary site, GDPS faithfully replicates that corruption to the DR site. Synchronous mirroring doesn't protect you from logical errors — it guarantees they propagate instantly.

⚠️ Common Pitfall: The most dangerous DR plan is one that only addresses site failure. Site failure is the rarest event on the spectrum. If your DR plan doesn't cover subsystem failures, LPAR failures, and data corruption, it covers the event least likely to happen and ignores the events most likely to happen.

What "Disaster Recovery" Actually Means

Let's define our terms precisely, because sloppy terminology leads to sloppy architecture:

Disaster Recovery (DR): The technical capability to restore IT services after a disruption. DR is about technology: replication, failover, backup, restore, restart.

Business Continuity (BC): The organizational capability to continue essential business operations during and after a disruption. BC includes DR but extends to people (who does what), processes (how do we communicate), facilities (where do people work), and third parties (what about our vendors).

High Availability (HA): The design of systems to minimize downtime through redundancy and automatic failover. HA operates within a site — Parallel Sysplex, DB2 data sharing, CICS region failover. HA is what keeps you running during component and subsystem failures.

Continuous Availability (CA): The aspiration to eliminate all downtime — planned and unplanned. CA requires HA + DR + the discipline to perform all maintenance (z/OS, DB2, CICS, storage firmware) without an outage.

The relationship: CA ⊃ DR ⊃ HA. You can have HA without DR (single-site Sysplex). You can have DR without CA (if you take outages for planned maintenance). You cannot have CA without both HA and DR.

💡 Key Insight: When an executive says "we need disaster recovery," what they actually mean is "we need to survive bad things happening without losing money or customers." That's business continuity. Your job as an architect is to translate that business requirement into a technical DR architecture — and to do that, you need RTO and RPO.


30.2 RTO/RPO Analysis: Business Requirements Driving Technology

This is the threshold concept for this chapter, so read this section carefully.

Every DR conversation I've ever had that went wrong started with technology. "We need GDPS." "We should implement Metro Mirror." "Let's set up a hot standby site." These are technology statements. They're answers. But nobody asked the question.

The question is: How much can you lose, and how long can you be down?

Defining RTO and RPO

Recovery Time Objective (RTO): The maximum acceptable time between the moment a disruption occurs and the moment IT services are restored to minimum acceptable functionality. RTO answers: "How long can we be down?"

Recovery Point Objective (RPO): The maximum acceptable amount of data loss measured in time. RPO answers: "How much data can we lose?" An RPO of zero means no data loss is acceptable. An RPO of one hour means you can lose up to one hour of transactions.

Recovery Level Objective (RLO): The minimum acceptable level of service after recovery. This is the often-forgotten third dimension. After failover, do you need 100% capacity? 50%? Can some functions be deferred? RLO answers: "What do we need to be able to do?"

The RTO/RPO Matrix

Here's how the analysis works in practice. CNB conducted this analysis in 2019, and Kwame walked me through the methodology.

Step 1: Identify business processes. Not IT systems — business processes. "Process wire transfers," not "run CICS." "Generate regulatory reports," not "execute batch jobs."

Step 2: Classify each process by criticality.

Tier Definition Example (Banking)
Tier 0 Life/safety or regulatory mandate; cannot be unavailable for any period Fraud detection, AML screening
Tier 1 Revenue-critical; downtime causes immediate financial loss ATM/POS authorization, wire transfers, online banking
Tier 2 Operations-critical; downtime degrades but doesn't halt the business Batch settlement, GL posting, statement generation
Tier 3 Support functions; can tolerate extended outage Management reporting, data warehouse feeds, archival

Step 3: Assign RTO/RPO/RLO to each tier.

Tier RTO RPO RLO Cost Implication
Tier 0 Near-zero (< 2 min) Zero 100% GDPS/HyperSwap + active-active
Tier 1 < 15 minutes Zero 80%+ capacity GDPS/Metro Mirror + automated failover
Tier 2 < 4 hours < 15 minutes 50% capacity GDPS/XRC or journal-based recovery
Tier 3 < 24 hours < 4 hours Minimum viable Tape backup + manual recovery

Step 4: Map business processes to IT systems. Now — and only now — you ask which technology supports each tier.

💡 Key Insight — The Threshold Concept: Notice what we just did. We started with business processes, not technology. We asked business stakeholders "how long can wire transfers be down?" — not "should we use synchronous or asynchronous replication?" The answer to the first question determines the answer to the second. An RTO of near-zero with RPO of zero demands synchronous replication and automated failover. An RTO of four hours with an RPO of fifteen minutes allows asynchronous replication with manual failover. The business requirement selects the technology. This is the principle: DR design flows from RTO/RPO backward to technology, never the other way around.

The Cost Curve

Every CTO who has ever funded a DR program has asked: "Can we get a cheaper option?" The honest answer is: yes, but the cost is denominated in RTO and RPO.

RTO    │ Tape/Manual                GDPS/XRC         Metro Mirror    HyperSwap
(hours)│ ████████████████████████
  24   │ ████████████████████████
       │ ████████████████████████
  12   │
       │
   4   │                          ████████████████
       │                          ████████████████
   1   │
       │
  0.25 │                                            ████████████████
       │                                            ████████████████
 ~0    │                                                             ████████████
       └───────────────────────────────────────────────────────────────────────
       $                  MATH0$                $$$$
                              Annual Cost →

The cost isn't just GDPS licensing (which is significant — IBM doesn't give this away). It's storage for mirrored volumes, network bandwidth for replication, a DR site with equivalent capacity, staff to manage and test the configuration, and the operational complexity tax of running a multi-site Sysplex.

At CNB, Kwame's team estimated the annual cost of their DR architecture at approximately 35% of their total mainframe operational budget. That's not unusual for a Tier-1 bank. The alternative — the cost of being unable to process transactions for 24 hours — was estimated by CNB's risk management team at $47 million in direct losses plus an unquantifiable regulatory penalty that could include consent orders.

The 35% buys you the cheapest insurance policy in the building.

Pinnacle Health's RTO/RPO Dilemma

When Pinnacle Health Insurance conducted their DR analysis, Diane Okoye and Ahmad Rashidi ran into a problem that's common in healthcare and insurance: the regulatory requirements didn't match the business requirements.

HIPAA doesn't specify an RTO. It requires a "data backup plan," a "disaster recovery plan," and an "emergency mode operation plan" — but it doesn't say "you must recover in four hours." This sounds like freedom. It's actually a trap.

"The auditors ask to see our RTO," Ahmad explained to the board. "If we say twenty-four hours, they'll accept it — until we actually have a twenty-four-hour outage. Then they'll ask why claims adjudication was down for a full day when patients needed eligibility verification for emergency procedures. And our answer can't be 'because we told you our RTO was twenty-four hours and you accepted it.' The answer has to be 'because the cost of a shorter RTO exceeded the risk.' And in healthcare, the risk includes a patient who couldn't get a procedure approved."

Pinnacle ended up with a tiered approach: real-time eligibility checking got Tier 1 treatment (RTO < 15 min, RPO zero), claims adjudication batch processing got Tier 2 (RTO < 4 hours), and reporting got Tier 3 (RTO < 24 hours).

⚠️ Common Pitfall: "Our RTO is four hours" is meaningless without specifying which business processes that RTO applies to. A single RTO for the entire enterprise either over-protects low-value functions (wasting money) or under-protects high-value functions (accepting unacceptable risk). Always tier your RTOs.


30.3 GDPS Architecture: HyperSwap, XRC, Metro Mirror

GDPS — Geographically Dispersed Parallel Sysplex — is IBM's framework for multi-site disaster recovery. It's not a single product; it's a family of solutions that combine z/OS system automation, storage replication, and Sysplex management to enable site-level failover.

Let me walk you through the options from most aggressive (lowest RTO/RPO) to most conservative.

GDPS/HyperSwap

What it does: Provides near-zero RTO and zero RPO for planned and unplanned storage failures. When a storage subsystem becomes unavailable, GDPS/HyperSwap automatically swaps all I/O from primary volumes to secondary (mirrored) volumes — with no application interruption, no IPL, and no operator intervention.

How it works:

  1. Production DASD volumes are synchronously mirrored using Metro Mirror (PPRC) to a secondary storage subsystem. The secondary can be in the same data center (for storage failure protection) or at a metro distance (for site failure protection).
  2. GDPS System Automation (SA) monitors the health of all storage subsystems, paths, and replication sessions.
  3. When a failure is detected — loss of paths to primary storage, primary controller failure, or primary site failure — GDPS/HyperSwap executes an automatic swap: - All Metro Mirror pairs are frozen (suspending replication) - I/O is redirected from primary volumes to secondary volumes - Secondary volumes become the new primary - Applications continue without interruption

Key constraints:

  • Distance: Metro Mirror is synchronous, meaning every write must complete on both the primary and secondary storage before the application receives confirmation. The speed of light imposes a hard limit: at 300 km of fiber, round-trip latency is approximately 2 milliseconds. Most deployments keep Metro Mirror distances under 100 km to keep replication latency under 1 ms. Every microsecond of replication latency adds directly to every write I/O response time.
  • Storage: You need a complete second copy of every mirrored volume. For CNB, that's 47 TB of primary DASD mirrored to 47 TB of secondary DASD — doubling the storage footprint.
  • Bandwidth: Synchronous replication requires enough bandwidth to handle peak write throughput without queuing. CNB's peak write rate is approximately 180 MB/s; they provisioned 10 Gbps FICON links for replication traffic.

When to use it: Tier 0 and Tier 1 workloads where any interruption is unacceptable. CNB uses HyperSwap for all core banking DASD — DB2 data, logs, CICS journals, and MQ page sets.

┌──────────────────────────────────────────────────────────────────────────┐
│                     GDPS/HyperSwap Architecture                          │
│                                                                          │
│  ┌─────────── Primary Site ──────────┐  ┌──────── DR Site ──────────┐   │
│  │                                    │  │                           │   │
│  │  ┌──────────┐   ┌──────────┐      │  │  ┌──────────┐            │   │
│  │  │  LPAR 1  │   │  LPAR 2  │      │  │  │  LPAR 3  │ (standby)  │   │
│  │  │  DB2/CICS│   │  DB2/CICS│      │  │  │  DB2/CICS│            │   │
│  │  └────┬─────┘   └────┬─────┘      │  │  └────┬─────┘            │   │
│  │       │              │             │  │       │                   │   │
│  │  ┌────▼──────────────▼────┐        │  │  ┌────▼──────────────┐   │   │
│  │  │  Primary DS8950        │◄══════►│  │  │  Secondary DS8950 │   │   │
│  │  │  (Production Volumes)  │ Metro  │  │  │  (Mirror Volumes)  │   │   │
│  │  │                        │ Mirror │  │  │                    │   │   │
│  │  └────────────────────────┘ (sync) │  │  └────────────────────┘   │   │
│  │                                    │  │                           │   │
│  │  ┌──────────────────────────────┐  │  │                           │   │
│  │  │  GDPS Controlling System     │  │  │                           │   │
│  │  │  (SA z/OS + HyperSwap Mgr)  │  │  │                           │   │
│  │  └──────────────────────────────┘  │  │                           │   │
│  └────────────────────────────────────┘  └───────────────────────────┘   │
│                                                                          │
│  On primary storage failure:                                             │
│  1. GDPS detects failure (< 1 sec)                                       │
│  2. Metro Mirror pairs frozen                                            │
│  3. I/O swapped to secondary volumes                                     │
│  4. Applications continue — no IPL, no restart                           │
│                                                                          │
│  RTO: Near-zero  |  RPO: Zero  |  Distance: < 100 km                    │
└──────────────────────────────────────────────────────────────────────────┘

GDPS/Metro Mirror (PPRC-based)

What it does: Provides automated site failover with zero RPO. Unlike HyperSwap (which swaps storage transparently), GDPS/Metro Mirror performs a managed failover that involves IPLing z/OS at the DR site. RTO is typically 10-30 minutes depending on the number of LPARs and subsystems.

How it works:

  1. Same synchronous Metro Mirror replication as HyperSwap — every write goes to both sites.
  2. The DR site has LPARs defined and ready but not actively running production workload (warm standby).
  3. On failover: - GDPS suspends Metro Mirror and makes DR volumes accessible - DR site LPARs are IPLed (or, if already IPLed for standby, production subsystems are started) - DB2 performs group restart, CICS warm starts, MQ channel recovery - Batch scheduling is redirected to DR site

Key difference from HyperSwap: HyperSwap handles storage failures transparently — no IPL needed. GDPS/Metro Mirror handles site failures by bringing up the DR site — IPL required. Many shops use both: HyperSwap for storage failure within a site, Metro Mirror failover for site failure.

When to use it: Tier 1 workloads that can tolerate a 10-30 minute outage for site failure but require zero data loss.

GDPS/XRC (Extended Remote Copy)

What it does: Provides asynchronous replication to a remote site. RPO is typically 2-30 seconds (the replication lag). RTO is 30-60 minutes.

How it works:

  1. Primary site writes to local DASD normally — no synchronous replication overhead.
  2. z/OS System Data Mover reads the write operations from the primary storage subsystem's cache and transmits them to the secondary storage subsystem asynchronously.
  3. A consistency group timestamp ensures that the secondary volumes are always at a consistent point in time (all volumes reflect the same moment, even if individual updates arrived at slightly different times).

Key advantages over Metro Mirror: - Distance: Because replication is asynchronous, XRC can operate over any distance — continental or intercontinental. There's no speed-of-light latency penalty on application I/O. - Performance: Zero impact on primary site write latency. The application writes to local DASD at local speed. Replication happens in the background. - Bandwidth: XRC can tolerate bandwidth constraints by queuing — though prolonged bandwidth shortfall increases RPO.

Key disadvantages: - RPO > 0: You will lose the transactions that were committed at the primary site but hadn't yet been replicated to the secondary. In practice, this is seconds of data, but "seconds" can mean hundreds of transactions at a high-volume site. - Consistency window: The consistency group timestamp means you recover to a point slightly in the past. All volumes are consistent with each other, but you've lost the most recent updates.

When to use it: Tier 2 workloads, or as a long-distance complement to Metro Mirror (see Multi-Target below).

GDPS/Global Mirror

What it does: Provides asynchronous disk-based replication over unlimited distance, typically used for a tertiary recovery site. Think of it as XRC implemented entirely in the storage subsystem firmware rather than in z/OS.

How it works: The primary storage subsystem asynchronously replicates changed tracks to a remote storage subsystem. Consistency groups are managed by the storage controller. z/OS System Data Mover is not involved.

When to use it: Third-site replication for geographically diverse DR. Some regulations (OCC for banking, certain EU directives) require a third recovery site at continental distance. Global Mirror serves this role.

Multi-Target Configurations

The real world isn't one technology or the other. Large enterprises stack GDPS technologies:

┌─────────────────────────────────────────────────────────────────────────┐
│                   CNB's Multi-Target DR Architecture                     │
│                                                                         │
│  Primary Site          Metro Site           Remote Site                  │
│  (Charlotte)           (Raleigh, 43 km)     (Dallas, 1,500 km)          │
│                                                                         │
│  ┌──────────┐          ┌──────────┐         ┌──────────┐               │
│  │Production│══════════│  Mirror  │         │  Remote  │               │
│  │  DASD    │  Metro   │  DASD    │         │  DASD    │               │
│  │          │  Mirror  │          │         │          │               │
│  │  47 TB   │  (sync)  │  47 TB   │         │  47 TB   │               │
│  └────┬─────┘          └──────────┘         └──────────┘               │
│       │                                          ▲                      │
│       │                                          │                      │
│       └──────────────────────────────────────────┘                      │
│                         XRC (async)                                     │
│                         RPO: ~5-10 sec                                  │
│                                                                         │
│  HyperSwap: Charlotte ↔ Raleigh (storage failure, near-zero RTO)       │
│  Metro Mirror Failover: Charlotte → Raleigh (site failure, ~15 min)    │
│  XRC Failover: Charlotte → Dallas (regional disaster, ~45 min)         │
│                                                                         │
│  Total storage: 3 × 47 TB = 141 TB                                     │
│  Network: 10 Gbps FICON (Charlotte-Raleigh), 1 Gbps WAN (to Dallas)    │
└─────────────────────────────────────────────────────────────────────────┘

CNB's configuration gives them three levels of protection: 1. Storage failure at primary site: HyperSwap to Raleigh mirror — near-zero RTO, zero RPO 2. Charlotte site failure: Metro Mirror failover to Raleigh — 15 min RTO, zero RPO 3. Regional disaster (both Charlotte and Raleigh): XRC failover to Dallas — 45 min RTO, ~10 sec RPO

The Dallas site runs at reduced capacity (about 60% of production) because a regional disaster that takes out both Charlotte and Raleigh is expected to reduce transaction volume significantly (ATMs in the affected area won't be generating traffic).

📊 By the Numbers — GDPS Replication Performance at CNB: - Metro Mirror write penalty: 0.3–0.8 ms additional latency per write I/O (Charlotte to Raleigh, 43 km) - XRC replication lag: 5–10 seconds under normal load; up to 30 seconds during batch window peak - HyperSwap execution time: 1.8 seconds (last measured during Q3 2023 DR test) - Metro Mirror failover time: 14 minutes (IPL + DB2 restart + CICS warmstart + MQ channel recovery) - XRC failover time: 42 minutes (includes consistency group verification)

GDPS Controlling System

GDPS requires a controlling system — a dedicated LPAR (or pair of LPARs for redundancy) running IBM System Automation for z/OS (SA z/OS). The controlling system:

  • Monitors health of all storage subsystems, replication sessions, and Sysplex components
  • Executes automated failover procedures (HyperSwap, Metro Mirror freeze/failover)
  • Manages the state machine that tracks which site is primary, which volumes are active, and which replication pairs are in-sync
  • Provides operator interfaces for planned failover (site switch for maintenance)

⚠️ Common Pitfall: The GDPS controlling system itself is a single point of failure if not properly redundant. At CNB, GDPS runs on a dedicated LPAR at each site, with the secondary controlling system capable of taking over if the primary fails. I've seen shops that ran GDPS on a shared LPAR alongside production workload. When the LPAR ran out of storage and crashed, they lost both production and the ability to perform an automated failover. Don't do this.


30.4 Sysplex Failure Domains: What Can Fail Independently

A failure domain is a set of resources that share a single point of failure. When that single point fails, everything in the failure domain is affected — and only things in that failure domain are affected. Understanding your failure domains is the foundation of survivability design.

Let me walk through the failure domains in a typical Parallel Sysplex, from smallest to largest.

Level 1: Application Instance Failure

What fails: A single CICS transaction, a single batch job step, a single DB2 thread.

Impact: One user's transaction fails. One batch job needs restart. One DB2 connection is rolled back.

Recovery: Automatic. CICS rolls back the transaction and returns an error. Batch JCL COND codes and checkpoint/restart handle job failure. DB2 backs out the in-flight unit of work.

Design principle: Every COBOL program must handle failure at this level. If your program abends, can the system recover without human intervention? If yes, you've designed for Level 1. This is Chapter 24's checkpoint/restart territory.

Level 2: Subsystem Instance Failure

What fails: An entire CICS AOR region abends. A DB2 member crashes. An MQ queue manager terminates.

Impact: All transactions in that CICS region fail. All DB2 threads connected to that member are rolled back. All MQ connections to that queue manager are broken.

Recovery: - CICS: If you've designed your region topology properly (Chapter 13), the TOR detects that the AOR is down and routes transactions to a surviving AOR on the same or different LPAR. CICS auto-restart brings the failed AOR back. Users see a brief delay but no outage — if you have enough AOR capacity distributed across multiple regions. - DB2: In a data sharing group, the surviving DB2 members perform peer recovery — they acquire the failed member's locks, back out its in-flight work, and make the data available. Applications reconnect to surviving members. - MQ: Shared queues in a queue-sharing group allow surviving queue managers to process messages. Queue manager restart recovers the failed instance.

Design principle: Never run a single instance of anything. Two AORs, two (or more) DB2 members, two queue managers. And distribute them across LPARs.

Level 3: LPAR Failure

What fails: An entire z/OS image. Everything running on that LPAR — CICS regions, DB2 members, batch initiators, MQ queue managers — is gone.

Impact: All workload on that LPAR stops. In a Sysplex, surviving LPARs continue.

Recovery: - WLM detects the missing LPAR and adjusts goals across surviving LPARs - CICS TORs on surviving LPARs route transactions to surviving AORs - DB2 peer recovery backs out the failed member's work - Batch jobs running on the failed LPAR need restart from last checkpoint - The failed LPAR is re-IPLed (minutes) and its subsystems restarted

Design principle: Distribute your workload so that no single LPAR carries more than 1/N of the total capacity (where N = number of production LPARs minus 1). CNB's four-LPAR Sysplex is designed so that any three LPARs can carry the full production workload. This means each LPAR normally runs at about 75% of its capacity for production — the remaining 25% is headroom for absorbing a failed LPAR's workload.

💡 Key Insight: This is why capacity planning (Chapter 29) and DR planning are inseparable. Your DR design requires capacity headroom. If you're running at 95% CPU utilization across all LPARs, you can't absorb the loss of one. Kwame calls this the "N-1 rule": design every layer of the architecture to survive the loss of one component at that layer.

Level 4: Coupling Facility Failure

What fails: A coupling facility — the hardware that hosts shared structures (lock structures, cache structures, list structures) for the Sysplex.

Impact: If the CF hosts the only instance of a structure, that structure is lost. If that structure is the DB2 lock structure, all DB2 data sharing members lose their global lock coordination and must rebuild. If it's a CICS shared temporary storage pool, that data is gone.

Recovery: - With alternate CF: Structures have been duplexed to an alternate CF (or another CF structure allocation has been defined). When the primary CF fails, the alternate structure takes over. For DB2 lock structures that were duplexed, this is transparent. For structures that were not duplexed, DB2 must rebuild the lock structure — a process that can take seconds to minutes depending on size. - Without alternate CF: This is a Sysplex-wide outage. If the coupling facility was a single point of failure, you've lost the coordination layer that enables data sharing. DB2 members will enter single-member mode or crash. CICS XCF-based services will fail. This is catastrophic and recoverable only by rebuilding the CF or reverting to non-data-sharing mode.

Design principle: Always have at least two coupling facilities. Duplex critical structures. Define alternate structure allocations in the CFRM policy. Test CF failure annually.

Coupling Facility Duplexing — CNB Configuration

CF01 (Frame 1)                         CF02 (Frame 2)
┌──────────────────────┐               ┌──────────────────────┐
│ DB2 Lock Structure   │◄═══════════►  │ DB2 Lock Structure   │
│   (primary)          │  (duplexed)   │   (secondary)        │
│                      │               │                      │
│ DB2 GBP0             │◄═══════════►  │ DB2 GBP0             │
│   (primary)          │  (duplexed)   │   (secondary)        │
│                      │               │                      │
│ DB2 SCA              │◄═══════════►  │ DB2 SCA              │
│   (primary)          │  (duplexed)   │   (secondary)        │
│                      │               │                      │
│ CICS TS Shared Pool  │               │ MQ CF Structure       │
│   (not duplexed —    │               │   (not duplexed —    │
│    rebuilt on fail)   │               │    rebuilt on fail)   │
└──────────────────────┘               └──────────────────────┘

Rule: Critical structures (DB2 lock, GBP, SCA) are always duplexed.
Less critical structures occupy the "other" CF and rebuild on failure.
CFs are on separate physical frames — no single hardware failure kills both.

Level 5: Storage Subsystem Failure

What fails: An entire storage frame — all DASD volumes on that subsystem.

Impact: Without mirroring, every dataset on that subsystem is inaccessible. DB2 data, CICS journals, MQ page sets, catalogs, system datasets — everything. This is a full site outage even though the processors and network are fine.

Recovery: - With GDPS/HyperSwap: Transparent. I/O swaps to mirrored volumes. Applications don't know. - With GDPS/Metro Mirror (no HyperSwap): Managed failover. Replication pairs are broken, secondary volumes are made accessible, subsystems are restarted against the secondary volumes. Minutes, not hours. - Without GDPS: Tape restore. Hours to days. Don't be this shop.

Level 6: Site Failure

What fails: The entire data center — all LPARs, all storage, all coupling facilities, all network equipment.

Impact: Everything at that site stops. Period.

Recovery: GDPS site failover to the DR site. The mechanism depends on which GDPS technology is in use (see Section 30.3). The DR site's LPARs are IPLed (or activated from standby), subsystems are started, and production workload resumes.

Design principle: The DR site must be genuinely independent — separate power grid, separate network provider, separate physical location. "Separate" means that a single event (storm, flood, earthquake, power grid failure, network provider outage) cannot affect both sites simultaneously. The minimum distance depends on the threat model: 50 km for most metropolitan risks, 500+ km for regional natural disasters, different continents for geopolitical risk.

Level 7: Data Corruption (the Special Case)

What fails: The data itself. Not hardware, not software, not connectivity — the bits on disk are wrong.

Causes: Application bug (most common), human error (SQL UPDATE without WHERE clause — every DBA's nightmare), storage firmware bug (rare but devastating), malware/ransomware (increasingly common).

Impact: Depends entirely on when the corruption is detected. If detected immediately (within the replication window), you may be able to recover from the secondary before corruption replicates. If detected after replication, both copies are corrupt.

Recovery: - DB2: Point-in-time recovery using image copies and log apply. Recover the affected tablespace to a point just before the corruption. This requires valid image copies and complete logs — which is why Chapter 9's utility strategy matters. - VSAM: Restore from backup. If you have FlashCopy snapshots (and you should), restore from the most recent clean snapshot. - IMS: Journal-based forward/backward recovery.

Design principle: Data corruption is the one failure mode that replication makes worse. Build a separate defense: FlashCopy snapshots taken at known-good points (before and after batch runs), immutable backup copies (air-gapped or WORM media), and monitoring that detects corruption quickly (checksum validation, application-level data integrity checks).

🔴 Critical Warning: Ransomware targeting mainframes is no longer theoretical. Sophisticated attacks have been documented that encrypt VSAM datasets and DB2 tablespaces. Your DR plan must include a recovery scenario for a ransomware attack — which means recovering from backups that predate the attack, on infrastructure you trust hasn't been compromised. If your backups are on the same storage subsystem as your production data and that subsystem is compromised, your backups are worthless. Air-gapped backups (tape vaulted offsite, or immutable cloud storage) are your last line of defense.


30.5 DR Runbooks: Site, LPAR, Subsystem, Data Corruption

A runbook is a step-by-step procedure for recovering from a specific failure scenario. Good runbooks share several characteristics:

  1. They're written for the person who will execute them at 3 AM. Not the person who designed the system. Not the architect. The on-call operator who may have never performed this procedure in production. Clear, unambiguous, numbered steps. No jargon that isn't defined. No assumptions about tribal knowledge.

  2. They include verification steps. After each major action, the runbook tells you how to verify it worked. "Start DB2" is not enough. "Start DB2. Verify: Issue -DIS DDF. Expected output: DSNL080I = STARTD. If you see DSNL081I = STOPD, proceed to step 14 (DDF troubleshooting)."

  3. They include decision points. Not every recovery follows the same path. The runbook must include branches: "If the DB2 member starts successfully, proceed to step 7. If the DB2 member fails to start with reason code 00E30810, proceed to Appendix C (conditional restart)."

  4. They include estimated times. The ops team and management need to know how long recovery will take. Every major step should have an expected duration: "Step 5: IPL LPAR CNBPROD3. Expected duration: 8-12 minutes."

  5. They include escalation paths. If a step fails and the recovery path doesn't cover the failure, who do you call? Name, phone number, role. Not "contact the DB2 DBA." Rather: "Contact Lisa Tran (DB2 lead) at 704-555-0142 (cell). Backup: DB2 on-call pager at 704-555-0199."

Runbook 1: LPAR Failure Recovery

Here's a condensed version of what a real LPAR failure runbook looks like. (The full version for CNB is 47 pages. This is the skeleton.)

═══════════════════════════════════════════════════════════════════
 CNB RUNBOOK: LPAR FAILURE RECOVERY
 Document: CNB-DR-RB-003  Rev: 2024.03  Classification: Internal
 Scope: Recovery of a single LPAR in the CNB Parallel Sysplex
 Estimated Recovery Time: 15–25 minutes (full LPAR re-IPL)
 or 0 minutes (if workload absorbed by surviving LPARs)
═══════════════════════════════════════════════════════════════════

PREREQUISITES:
 - At least 2 of 4 production LPARs operational
 - Coupling facilities operational (both CF01 and CF02)
 - GDPS controlling system operational
 - Storage subsystems accessible from surviving LPARs

IMMEDIATE ASSESSMENT (0–2 minutes):
─────────────────────────────────────
 1. IDENTIFY the failed LPAR
    - Check HMC: Operating System Messages panel
    - Check GDPS/SA console: INGLIST display
    - Verify: Which LPAR(s) are in "not operating" state?

 2. ASSESS impact on Sysplex
    - Issue from surviving LPAR:
      D XCF,COUPLE         (verify CF connectivity)
      D XCF,SYSPLEX        (verify Sysplex membership)
      -DIS DDF             (verify DB2 member status)
      CEMT I TOR ALL       (verify CICS region status)
    - Record which DB2 members, CICS regions, and MQ queue
      managers were on the failed LPAR

 3. VERIFY workload absorption
    - WLM will automatically redistribute work to surviving LPARs
    - CICS TORs will stop routing to AORs on the failed LPAR
    - DB2 data sharing members on surviving LPARs will perform
      peer recovery for the failed member
    - Expected: Online transactions resume within 30-90 seconds
      after peer recovery completes

 4. DECISION POINT:
    IF online transactions are processing normally
       on surviving LPARs:
       → Proceed to LPAR RECOVERY (step 5)
       → This is NOT urgent — surviving LPARs are handling load
    IF online transactions are NOT processing
       (indicates Sysplex-wide problem, not single LPAR):
       → ESCALATE to Kwame Mensah (chief architect)
       → Phone: 704-555-0137  Cell: 704-555-0138
       → DO NOT PROCEED without architect guidance

LPAR RECOVERY (5–20 minutes):
──────────────────────────────
 5. CHECK WLM headroom on surviving LPARs
    - Issue: D WLM,SCHENV=PROD_CICS
    - If CPU utilization > 90% on any surviving LPAR:
      → Consider reducing non-critical batch workload
      → Issue: $P JOBxxxxx for Tier 3 batch jobs
    - If CPU utilization < 80%:
      → Surviving LPARs have adequate headroom
      → LPAR re-IPL is not time-critical

 6. INITIATE LPAR re-IPL
    - HMC: Select the failed LPAR
    - Activate → Load → Specify load address and load parm
    - Load address: [per LPAR — see table below]
    - Load parm: [per LPAR — see table below]

    | LPAR      | Load Address | Load Parm |
    |-----------|-------------|-----------|
    | CNBPROD1  | 0A82        | 0A82DB   |
    | CNBPROD2  | 0B82        | 0B82DB   |
    | CNBPROD3  | 0C82        | 0C82DB   |
    | CNBPROD4  | 0D82        | 0D82DB   |

    Expected duration: 8-12 minutes to z/OS ready

 7. VERIFY z/OS initialization
    - Wait for message: IEA101A SPECIFY SYSTEM PARAMETERS
    - Reply with: R xx,OMVS=nn (per LPAR procedure)
    - Wait for message: $HASP426 SPECIFY OPTIONS
    - Reply with: R xx,FORMAT=yy (per JES2 PARM)
    - Wait for message: CNZ4106I TCPIP IS READY
    - Verify Sysplex join: D XCF,SYSPLEX — failed LPAR
      should now appear as active member

 8. START subsystems in order:
    a. DB2: -START DB2
       Verify: -DIS DDF → DSNL080I = STARTD
       Expected: 2-4 minutes
    b. CICS regions: SA z/OS auto-start per policy
       Verify: CEMT I TASK → tasks processing
       Expected: 3-5 minutes for all regions
    c. MQ: START QMGR
       Verify: DIS QMGR → RUNNING
       Expected: 1-2 minutes
    d. Batch initiators: $S INIT(nn)
       Verify: $D INIT(nn) → ACTIVE
       Expected: immediate

 9. VERIFY full recovery
    - D WLM,SCHENV=PROD_CICS — all LPARs balanced?
    - -DIS THREAD(*) — DB2 threads connecting from all regions?
    - Run validation transaction: CNBV001 (synthetic balance inquiry)
    - If validation succeeds → recovery complete
    - If validation fails → escalate to application team

DOCUMENTATION:
──────────────
10. Record in incident management system:
    - Time of failure detection
    - Time of workload absorption on surviving LPARs
    - Time of LPAR re-IPL initiation
    - Time of full recovery
    - Any anomalies or deviations from runbook

Runbook 2: Site Failure (GDPS Failover)

Site failure recovery is more complex because you're not just restarting a component — you're activating an entire alternate environment. The key sections (condensed):

Phase 1: Declaration (0-5 minutes) - Confirm site failure (not just a network issue that makes the primary appear down) - Declare disaster — this is a management decision, not a technical one - Activate the DR command center (conference bridge, war room)

Phase 2: GDPS Failover (5-20 minutes) - GDPS controlling system at DR site takes control - Metro Mirror pairs are broken; DR volumes become primary - DR site LPARs are IPLed - Subsystems started in dependency order: TCP/IP → DB2 → CICS → MQ → batch

Phase 3: Validation (20-40 minutes) - Run validation transactions against each critical subsystem - Verify data currency (last transaction timestamp on DR matches expected) - Verify external connectivity (ATM network, wire transfer network, internet banking) - Declare DR site operational

Phase 4: Stabilization (40 minutes - 4 hours) - Resume batch processing (assess which jobs need restart from checkpoint) - Verify regulatory reporting capability - Test all critical interfaces (ACH, Fedwire, SWIFT) - Capacity assessment — DR site may run at reduced capacity

💡 Key Insight: Phase 1 is where most real DR invocations stall. The technical team knows the site is down. They're ready to execute the failover. But they can't — because "declare disaster" is a business decision that requires executive authorization. At CNB, the authorization chain is: SVP of Technology → CTO → CFO (for financial impact authorization). If any of those people is unreachable at 3 AM, the declaration stalls. Kwame fought for and got a standing authorization: if the primary site is confirmed down and the on-call SVP of Technology is unreachable within 15 minutes, the on-call director of infrastructure can authorize failover. Without that standing authorization, your RTO includes the time it takes to wake up three executives and get them on a conference bridge.

Runbook 3: Data Corruption Recovery

This is the hardest runbook to write because data corruption doesn't announce itself. By the time you know the data is corrupt, you don't know when the corruption started.

The Critical Question: When did the corruption begin?

If you know the exact timestamp (e.g., an operator ran a bad UPDATE at 14:32:07), you can recover to 14:32:06. If you don't know (e.g., a bug has been silently writing bad data for days), you have a much bigger problem.

For DB2 data corruption — known timestamp:

  1. Identify affected tablespaces
  2. Stop all activity against those tablespaces: -STOP DATABASE(dbname) SPACENAM(tsname)
  3. Determine best recovery point: - Check image copy inventory: SELECT * FROM SYSIBM.SYSCOPY WHERE DBNAME = 'dbname' AND TSNAME = 'tsname' ORDER BY TIMESTAMP DESC - Identify the most recent image copy before the corruption
  4. Recover: RECOVER TABLESPACE dbname.tsname TOLOGPOINT TORBA(X'xxxxxxxxxxxx') - TORBA = RBA of the last valid log record before corruption
  5. Verify data integrity: run application validation queries
  6. Resume activity: -START DATABASE(dbname) SPACENAM(tsname) ACCESS(RW)

For DB2 data corruption — unknown timestamp:

This is the nightmare scenario. You know the data is wrong but you don't know when it went wrong. Steps:

  1. Stop the bleeding: quiesce all write access to affected objects
  2. Forensics: analyze the corrupted data to estimate when corruption began - Compare current data to most recent known-good FlashCopy snapshot - Analyze DB2 logs for suspicious UPDATE patterns - Check application logs for error patterns that correlate with corruption
  3. Determine recovery strategy: - If corruption is recent (hours): point-in-time recovery from image copy + logs - If corruption is old (days): FlashCopy restore to last known-good snapshot + manual reconciliation - If corruption spans multiple weeks: this is a business decision, not a technical one — escalate to executive level

⚠️ Common Pitfall: The biggest mistake in data corruption recovery is recovering too much. If one table is corrupt, recover that table — not the entire tablespace, not the entire database, not the entire site. Every additional object you recover means additional downtime, additional risk, and additional data loss (you'll overwrite valid recent data with older recovered data). Surgical precision matters. This is why Lisa Tran insists on tablespace-level image copies, not database-level — they give her the granularity to recover exactly what's needed.


30.6 DR Testing: Planned Tests, Surprise Tests, Validating Recovery

I'm going to say something controversial: most DR tests are theater.

They're scheduled months in advance. Everyone knows they're coming. The DR team prepares for weeks. They shut down the primary at 8 PM on a Saturday when transaction volume is at 5% of peak. They bring up the DR site. They run a few test transactions. They declare success. They write a report that says "DR test successful — RTO achieved: 22 minutes." Management checks a compliance box. Everyone goes home.

And none of it proves that you can recover from a real disaster.

Here's why: a real disaster doesn't happen at 8 PM on a Saturday. It happens at 2 PM on the last business day of the quarter, when the batch team is running month-end close and online volume is at 130% of normal because every trader is rushing to get their positions settled. The person who runs the DR failover isn't the senior engineer who rehearsed it — it's the junior operator on the night shift because the senior engineer was in a car accident on the way to the office.

I'm not saying you shouldn't do planned DR tests. You absolutely should — they validate the mechanics of your failover procedures. But you must complement planned tests with unannounced tests and realistic simulations.

Types of DR Tests

Level 1: Tabletop Exercise (Quarterly)

No actual failover. The DR team and key stakeholders sit in a room (or on a video call) and walk through a scenario: "It's 3 PM on a Tuesday. The primary site loses power and backup generators fail. What happens?"

The value of tabletop exercises: - They expose gaps in runbooks without the risk of a live failover - They test the human processes — escalation, communication, decision-making - They reveal single-person dependencies: "Oh, only Marcus knows how to restart the IMS gateway" - They're cheap, fast, and can be done quarterly

Level 2: Component Test (Monthly)

Test individual DR components without a full site failover: - DB2 recovery: restore a tablespace from image copy and log apply on the DR site - CICS restart: cold start a CICS region from the DR site's CICS datasets - Batch restart: rerun a batch job from checkpoint on the DR site - Network failover: switch ATM network routing to DR site addresses

The value: validates that individual components work without the complexity and risk of a full site switch.

Level 3: Planned Site Failover (Semi-Annually)

Full GDPS site failover during a maintenance window: 1. Quiesce production workload 2. Execute GDPS failover to DR site 3. Start all subsystems at DR site 4. Process test transactions 5. Run a batch cycle 6. Verify all external interfaces 7. Fail back to primary site 8. Resynchronize replication

Expected duration: 8-12 hours (including failover, testing, and failback).

Level 4: Unannounced Failover (Annually)

The real test. Scheduled by the CISO or CTO, known only to a handful of executives. The production team learns about it when they get the alert. This tests: - The actual on-call response time - Whether the runbooks are accessible and current - Whether the team can execute without the "A-team" (schedule it when the senior DR engineer is on vacation — deliberately) - Real RTO under realistic conditions

⚠️ Common Pitfall: Many organizations skip Level 4 because of the risk. "We can't take down production without warning — what if something goes wrong with the failback?" This is circular reasoning. If your failback procedure is too risky to execute unannounced, then your failback procedure is too risky to rely on during a real disaster. Fix the failback procedure, don't skip the test.

Designing an Effective DR Test

Every DR test should answer specific questions. If your test doesn't have explicit success criteria, it's not a test — it's a demo.

Test Plan Template:

DR TEST PLAN
Test ID: CNB-DR-TEST-2024-Q3
Test Type: Level 3 — Planned Site Failover
Date: 2024-09-14 (Saturday)
Test Window: 20:00 – 08:00 (12 hours)
Test Director: Kwame Mensah
Backup Director: Rob Calloway

OBJECTIVES:
 1. Validate GDPS/Metro Mirror failover from Charlotte to Raleigh
 2. Measure actual RTO for Tier 0 and Tier 1 services
 3. Validate batch restart from checkpoint on DR site
 4. Test ATM network reconnection to DR site
 5. Validate regulatory reporting capability from DR site

SUCCESS CRITERIA:
 ┌────────────────────────┬──────────────────────────────┐
 │ Metric                 │ Pass/Fail Threshold          │
 ├────────────────────────┼──────────────────────────────┤
 │ Failover RTO (Tier 0)  │ < 2 minutes                  │
 │ Failover RTO (Tier 1)  │ < 15 minutes                 │
 │ Data loss (RPO)        │ Zero committed transactions   │
 │ ATM reconnection       │ < 30 minutes                 │
 │ Batch restart          │ < 60 minutes to running      │
 │ Regulatory reporting   │ Can generate within 4 hours  │
 │ Failback RTO           │ < 30 minutes                 │
 │ Post-failback data     │ Zero data loss during        │
 │ integrity              │ failback                     │
 └────────────────────────┴──────────────────────────────┘

ABORT CRITERIA:
 - Failover has not completed within 60 minutes
 - Any evidence of data loss or corruption
 - External production impact reported by customers
 - Safety concern raised by any team member

PRE-TEST CHECKLIST:
 □ All Metro Mirror pairs confirmed in-sync (consistency group check)
 □ DR site LPARs verified operational (standby mode)
 □ DR site storage verified accessible
 □ DR site network connectivity verified
 □ All test team members confirmed available
 □ Customer notification sent (if required by SLA)
 □ Regulatory notification sent (if required)
 □ Rollback procedure reviewed by test director

SCHEDULE:
 20:00 — Pre-test briefing and checklist verification
 20:30 — Quiesce primary site online workload
 20:45 — Initiate GDPS failover
 21:00 — Expected failover complete (T+15 min)
 21:00 — Begin validation testing
 22:00 — Initiate batch test cycle
 02:00 — Batch test complete, begin regulatory report test
 04:00 — All testing complete, begin failback
 04:30 — Expected failback complete
 05:00 — Resynchronization verification
 06:00 — Post-test briefing
 08:00 — Final report due to test director

What to Measure During a DR Test

Beyond the pass/fail success criteria, capture these metrics — they're the data that drives your next improvement cycle:

  1. Time to first alert. How long between the simulated failure and the first alert reaching the on-call team? If it's more than 5 minutes, your monitoring needs work.

  2. Time to first action. How long between the alert and the first recovery action? If the on-call person spent 20 minutes figuring out what happened, your runbook needs a better triage section.

  3. Deviation count. How many times did the team deviate from the runbook? Each deviation is either a runbook error (update the runbook) or a training gap (update the training).

  4. Escalation accuracy. Did the right people get called? Were phone numbers current? Did the escalation path work?

  5. Data currency verification. After failover, what's the timestamp of the most recent transaction on the DR site? This validates your actual RPO — which may differ from your theoretical RPO.

  6. Capacity at DR site. What's the CPU utilization, storage utilization, and transaction response time at the DR site? If the DR site runs at 95% CPU, you'll be in trouble during a real disaster when volume may actually increase (everyone trying to check their account status after a publicized outage).

The Post-Test Review

The most valuable part of any DR test is the post-test review. Not the congratulatory "great job, team" session — the honest, sometimes painful, "what went wrong and how do we fix it" session.

At CNB, Kwame runs the post-test review with three rules:

  1. No blame. If someone made an error during the test, the error is a runbook problem or a training problem, not a people problem. The whole point of testing is to find errors before they matter.

  2. Every deviation is documented. Every time someone went off-script, it's recorded with the reason. Some deviations are improvements (the operator found a faster way) — those get incorporated into the runbook. Some are errors — those get addressed with additional training or runbook clarification.

  3. Action items have owners and deadlines. "We should update the runbook" is not an action item. "Lisa Tran will update the DB2 recovery section of CNB-DR-RB-003 by October 15" is an action item.

📊 From the Field — CNB's DR Test History: - 2020 Q3: First Level 3 test. RTO was 47 minutes (target: 15). DB2 peer recovery took 22 minutes because the lock structure wasn't duplexed. Action: Duplexed the lock structure. - 2021 Q1: Level 3 test. RTO improved to 19 minutes. CICS warm start failed because the DR site's CICS datasets were from a different maintenance level. Action: Added CICS maintenance synchronization to DR readiness checklist. - 2021 Q3: Level 3 test. RTO: 14 minutes. First test to meet the target. MQ channel recovery took 11 minutes — channels had to be manually restarted. Action: Automated MQ channel startup in GDPS policy. - 2022 Q1: First Level 4 (unannounced) test. On-call operator took 12 minutes to respond to the alert. RTO from first action: 16 minutes. Total: 28 minutes. Action: Revised alert escalation to include SMS and automated phone call (not just email). - 2023 Q3: Level 3 test. RTO: 11 minutes. Fiber cut incident (the one from the chapter opening) occurred the same week, giving the team unexpected real-world data. Action: Revised DR risk communication procedures.


30.7 DR Governance and Compliance

DR architecture doesn't exist in a vacuum. It exists within a regulatory and governance framework that varies by industry, jurisdiction, and organizational risk appetite.

Regulatory Requirements by Industry

Banking (US): - OCC Bulletin 2023-17 (Third-Party Risk Management): Requires banks to ensure critical third-party services have adequate DR capabilities. - FFIEC Business Continuity Planning Handbook: The primary guidance for US bank DR. Requires documented BIA (Business Impact Analysis), BCP (Business Continuity Plan), testing, and training. Updated in 2019 to address cyber resilience. - Reg E (Electronic Fund Transfers): Doesn't specify DR directly but creates implicit RTO requirements — if consumers can't access their funds, Reg E liability exposure begins immediately. - Key requirement: OCC examiners expect Tier-1 banks to demonstrate active-active or rapid failover capability. A 24-hour RTO for core banking would likely result in a Matter Requiring Attention (MRA) at a large bank.

Healthcare (US): - HIPAA Security Rule (§164.308(a)(7)): Requires a contingency plan including data backup plan, disaster recovery plan, and emergency mode operation plan. Testing and revision are required but frequency isn't specified. - HIPAA doesn't specify RTO/RPO — but CMS and state regulators expect healthcare organizations to justify their recovery objectives based on patient safety risk. - Ahmad Rashidi's approach at Pinnacle Health: "We document our RTO/RPO rationale. If an auditor questions why claims adjudication has a 4-hour RTO, we show them the analysis that says no patient care decision depends on claims adjudication within 4 hours. Eligibility verification — that's different. That's Tier 1."

Federal Government (US): - FISMA (Federal Information Security Modernization Act): Requires agencies to implement contingency plans per NIST SP 800-34. - NIST SP 800-34 (Contingency Planning Guide): Detailed guidance including BIA methodology, recovery strategy selection, plan testing, and maintenance. The most prescriptive federal DR guidance available. - FedRAMP: For cloud components, DR requirements are embedded in FedRAMP control baselines (CP control family). - Sandra Chen at Federal Benefits: "We have to satisfy FISMA, NIST, and our agency-specific security plan. The NIST SP 800-34 framework is actually well-designed — it forces you to start with the BIA, which is the right starting point. The problem is that the agency hasn't updated the BIA since 2018, and the business processes have changed significantly."

Financial Services (EU): - DORA (Digital Operational Resilience Act): Effective January 2025, DORA requires EU financial entities to maintain and test ICT (Information and Communication Technology) business continuity and DR plans. Includes mandatory annual DR testing and reporting to regulators. - ECB (European Central Bank) guidance: Expects significant institutions to demonstrate geographic diversification of DR capability.

DR Governance Framework

At a mature organization, DR governance includes:

1. Business Impact Analysis (BIA) — Reviewed Annually The BIA identifies critical business processes, their dependencies on IT systems, and the financial/operational/regulatory impact of their unavailability. The BIA is the input to RTO/RPO determination.

2. DR Plan Document — Updated After Every Test and Change The master DR plan document includes: - Scope (which systems are covered) - Recovery objectives (RTO/RPO/RLO by tier) - Architecture description (GDPS configuration, network, storage) - Runbooks for each failure scenario - Roles and responsibilities (who does what) - Communication plan (who gets notified, how, in what order) - Escalation procedures - Vendor contacts (IBM support, storage vendor, network provider)

3. DR Test Calendar — Minimum Semi-Annual Full Test Schedule of all DR testing activities: tabletops, component tests, planned failovers, unannounced failovers.

4. DR Readiness Dashboard — Monitored Continuously Real-time indicators of DR readiness: - Replication status (all Metro Mirror/XRC sessions in-sync?) - DR site LPAR status (standing by? healthy?) - Backup currency (last successful image copy, last successful FlashCopy) - Runbook currency (when was each runbook last reviewed?) - Team readiness (who is on-call? are contact lists current?)

5. DR Audit Trail — For Regulators Documentation that proves your DR program is active: - BIA revision history - DR plan revision history - Test results and post-test action items - Action item completion tracking - Training records (who was trained, when, on what)

💡 Key Insight: The governance structure is what transforms DR from a one-time project into an ongoing capability. Without governance, DR plans decay. Runbooks become stale. Tested procedures become untested procedures as systems change. The most common finding in DR audit failures isn't "they don't have a DR plan" — it's "they have a DR plan from 2019 that doesn't reflect the systems they're running in 2024."

Federal Benefits' Compliance Challenge

Sandra Chen's team at Federal Benefits Administration faces a unique DR challenge: their production system spans two generations of technology.

The core benefits calculation runs on IMS — a 40-year-old codebase with 15 million lines of COBOL. Marcus Whitfield is one of three people who understand the IMS recovery procedures. The newer eligibility verification system runs on CICS and DB2. The modernized web portal runs on z/OS Connect with API Gateway.

Each layer has different DR characteristics:

Component Technology DR Method RTO RPO Key Risk
Benefits calculation IMS/COBOL IMS journal-based recovery + GDPS 4 hours < 15 min Marcus Whitfield retiring — knowledge transfer incomplete
Eligibility verification CICS/DB2 GDPS/Metro Mirror failover 15 min Zero Dependency on IMS data for eligibility rules
Web portal z/OS Connect Automated restart 5 min Zero Depends on both IMS and CICS backends
Batch reporting JCL/COBOL Restart from checkpoint 8 hours Last checkpoint Complex job dependencies — 2,000+ jobs in nightly cycle

The cascading dependency is the problem: the web portal can't work without CICS, and CICS can't make eligibility decisions without IMS data. Recovery order matters: IMS first, then DB2/CICS, then z/OS Connect.

"Our biggest DR risk isn't technology," Sandra told me. "It's people. If we have to invoke DR in two years when Marcus has retired and we haven't finished the knowledge transfer for IMS recovery procedures, we're in serious trouble. The runbook exists, but it was written by Marcus. The person executing it in a disaster won't have Marcus to call."

This is why Chapter 40 (Knowledge Transfer) isn't an afterthought — it's a DR requirement.

🔗 Cross-Reference: The knowledge transfer urgency that Sandra describes here is the driving force behind Part VIII's capstone project and Chapter 40's knowledge transfer framework. DR isn't just about technology and procedures — it's about ensuring the people who will execute those procedures have the knowledge to do so. If your DR plan depends on a single person's expertise, your DR plan has a single point of failure.


30.8 Progressive Project: DR Design for the HA Banking System

Your HA Banking Transaction Processing System needs a DR architecture. In this project checkpoint, you'll design the GDPS configuration, analyze the failure domains, and create a DR test plan.

Use the requirements from Chapter 1's project checkpoint as your baseline: - 100 million transactions per day - 99.999% availability (five nines) - 2,000 concurrent online users - 4-hour nightly batch window - Regulatory: no single point of failure

Deliverables for this checkpoint:

  1. RTO/RPO Matrix — Classify your system's business processes into tiers and assign RTO/RPO/RLO to each.

  2. GDPS Configuration Design — Select the GDPS technologies for each tier. Design the storage replication topology (how many sites, what distance, what replication method).

  3. Failure Domain Analysis — Map every failure domain in your architecture (from Level 1 through Level 7). For each, document the failure impact and the designed survivability mechanism.

  4. DR Test Plan — Design a Level 3 (planned site failover) test with specific success criteria, schedule, and measurement plan.

Refer to CNB's architecture throughout this chapter as your reference model. Your system processes 1/5 of CNB's volume — your DR architecture should be proportionally simpler but follow the same principles.

Project Connection: This DR design connects backward to your batch restart strategy (Chapter 24), your CICS region topology (Chapter 13), and your security architecture (Chapter 28). It connects forward to Chapter 31 (Operational Automation), where you'll build the automated recovery procedures that your runbooks reference, and Chapter 37 (Hybrid Architecture), where DR design for cloud-integrated components adds complexity.


Chapter Summary

DR architecture is not about technology — it's about recovery time objectives. Start with the business: what processes are critical, how long can they be down, and how much data can you lose. Work backward from those requirements to the technology that satisfies them.

GDPS provides the technology stack for z/OS DR: HyperSwap for near-zero RTO storage failover, Metro Mirror for zero-RPO site failover, XRC for long-distance asynchronous replication, and Global Mirror for third-site protection. Most large enterprises stack multiple technologies in a multi-target configuration.

Sysplex failure domain analysis is the foundation of survivability design. Understand every component that can fail independently — from application instances to entire sites — and design containment at each level. The N-1 rule applies everywhere: every layer of the architecture should survive the loss of one component at that layer.

DR runbooks must be written for the worst case: the person executing them at 3 AM who has never done it in production. Clear steps, verification at each stage, decision points for branches, estimated times, and escalation paths.

DR testing is not an annual checkbox — it's a continuous improvement program. Planned tests validate mechanics. Unannounced tests validate readiness. Post-test reviews drive improvements. The governance framework keeps the whole program from decaying.

And remember what Sandra Chen said: your biggest DR risk might not be technology at all. It might be the person who's retiring in two years with all the recovery knowledge in their head. DR planning is people planning.


Spaced Review Questions

From Chapter 1 (Parallel Sysplex):

  1. In a four-LPAR Parallel Sysplex with DB2 data sharing, if one LPAR fails, describe the sequence of events that allows transactions to continue on the surviving three LPARs. How does the coupling facility participate in this recovery? (If you can't answer this fluently, review Chapter 1, Section 1.4.)

  2. Explain why a coupling facility failure is potentially more severe than an LPAR failure, even though the coupling facility doesn't run any application workload. (Connects to this chapter's Section 30.4, Level 4.)

From Chapter 13 (CICS Architecture):

  1. You have four AOR regions: two on LPAR1 and two on LPAR2. If LPAR1 fails, describe how the TOR routes transactions to the surviving AORs on LPAR2. What CICS mechanism enables this routing, and what happens to in-flight transactions on the failed AORs? (If this isn't automatic for you, review Chapter 13, Section 13.3.)

  2. Why is CICS warm start (rather than cold start) important during DR failover? What data would be lost in a cold start that warm start preserves? (Connects to this chapter's runbooks in Section 30.5.)

From Chapter 23 (Batch Architecture):

  1. After a site failover, your nightly batch cycle was interrupted at step 147 of 312. The first 146 steps completed and committed their work, which was replicated to the DR site. How do you determine which batch jobs need to be rerun from checkpoint versus which can be skipped? What information from Chapter 23's scheduling dependency graph do you need? (If unclear, review Chapter 23, Sections 23.2 and 23.5.)

Next chapter: Chapter 31 (Operational Automation) — where you'll build the REXX-based automation that executes many of the recovery procedures described in this chapter's runbooks. The runbook says "start DB2." Automation makes it happen without a human typing the command.