Chapter 30 Key Takeaways: Disaster Recovery and Business Continuity

Threshold Concept

DR is about recovery time objectives, not technology. DR design starts with business requirements — RTO, RPO, RLO — and works backward to technology choices. The question is not "should we use Metro Mirror or XRC?" The question is "how long can the business be down, and how much data can it lose?" The answer to that question selects the technology.


The RTO/RPO Framework

Business Process Identification
  → Criticality Tiering (Tier 0 through Tier 3)
    → RTO/RPO/RLO Assignment per Tier
      → Technology Selection per Tier
        → Architecture Design
          → Runbook Development
            → Test Plan

Key principle: Never start with technology. Always start with the business.

Tier RTO RPO Technology Annual Cost (relative)
Tier 0 Near-zero Zero GDPS/HyperSwap + active-active $$$$
Tier 1 < 15 min Zero GDPS/Metro Mirror + automated failover $$$ | | Tier 2 | < 4 hours | < 15 min | GDPS/XRC or journal recovery | $$
Tier 3 < 24 hours < 4 hours Tape backup + manual recovery $

GDPS Technology Comparison

Technology Replication RPO RTO Distance Write Impact
HyperSwap Synchronous (Metro Mirror) Zero Near-zero (< 2 sec) < 100 km 0.3-1 ms/write
Metro Mirror Failover Synchronous (PPRC) Zero 10-30 min (IPL required) < 100 km 0.3-1 ms/write
XRC Asynchronous 2-30 seconds 30-60 min Unlimited None
Global Mirror Asynchronous (disk-based) Minutes 60+ min Unlimited None

Stacking strategy: Use HyperSwap for storage failure + Metro Mirror for site failure + XRC for regional disaster. Each addresses a different failure scenario at a different distance.


Failure Domain Hierarchy

Level 7: Data Corruption     ← MOST DANGEROUS (replication propagates it)
Level 6: Site Failure         ← GDPS site failover
Level 5: Storage Subsystem    ← GDPS/HyperSwap
Level 4: Coupling Facility    ← Duplexed structures, alternate CF
Level 3: LPAR Failure         ← Sysplex workload redistribution
Level 2: Subsystem Failure    ← DB2 peer recovery, CICS region failover
Level 1: Application Failure  ← Transaction rollback, checkpoint/restart

The N-1 Rule: Every layer of the architecture must survive the loss of one component at that layer. Four LPARs → any three carry full load. Two CFs → critical structures duplexed. Two storage frames → all volumes mirrored.

Data corruption is the special case: Replication makes it worse, not better. Defense requires FlashCopy snapshots, immutable backups, air-gapped copies, and detection mechanisms.


Runbook Essentials

A good runbook has five characteristics:

  1. Written for 3 AM execution — clear, unambiguous, numbered steps
  2. Verification after every major action — "Start DB2. Verify: -DIS DDFSTARTD"
  3. Decision points — "If X succeeds → step 7. If X fails with code Y → Appendix C"
  4. Estimated times — "Step 5: IPL LPAR. Expected: 8-12 minutes"
  5. Escalation paths — names, phone numbers, roles — not "contact the DBA"

DR Testing Levels

Level Type Frequency Value
1 Tabletop exercise Quarterly Exposes gaps without risk
2 Component test Monthly Validates individual recovery mechanisms
3 Planned site failover Semi-annually Validates end-to-end failover mechanics
4 Unannounced failover Annually Validates real-world readiness

Key insight: Most DR tests are theater. A planned test at 8 PM on Saturday with the A-team assembled proves the mechanics work — it doesn't prove you can recover from a disaster at 2 PM on a business day with whoever happens to be on-call. Level 4 tests close this gap.

Post-test review rules: No blame. Every deviation documented. Action items have owners and deadlines.


DR Governance Components

  1. Business Impact Analysis — reviewed annually
  2. DR Plan Document — updated after every test and significant change
  3. DR Test Calendar — minimum semi-annual full test
  4. DR Readiness Dashboard — monitored continuously (replication status, backup currency, team readiness)
  5. DR Audit Trail — for regulators (revision history, test results, action items, training records)

Critical Warnings

🔴 Replication ≠ Backup. Synchronous replication protects against hardware failure. It propagates data corruption. You need both replication AND point-in-time backup capability.

🔴 DR plans decay. A DR plan that isn't tested, updated, and governed will become useless within 12-18 months as the production environment changes. The governance framework prevents decay.

🔴 People are failure domains too. If your DR recovery depends on one person's knowledge, your DR plan has a single point of failure. Cross-training and documentation are DR requirements, not HR nice-to-haves.

🔴 The declaration bottleneck. In most organizations, the decision to invoke DR requires executive authorization. If that authorization chain is slow, your actual RTO includes the time to wake up executives and get approval. Pre-authorize DR invocation for specific scenarios.


Connections to Other Chapters

Chapter Connection to DR
Ch 1 (Parallel Sysplex) Sysplex architecture defines your within-site failure domains
Ch 5 (WLM) WLM adjusts workload after LPAR failure — capacity headroom required
Ch 8 (DB2 Locking) DB2 peer recovery in data sharing depends on lock structure in CF
Ch 9 (DB2 Utilities) Image copy + log apply = your data corruption recovery mechanism
Ch 13 (CICS) Region topology determines CICS survivability during partial failures
Ch 23 (Batch) Batch dependency graph becomes the batch restart plan after failover
Ch 24 (Checkpoint/Restart) Checkpoint data is what makes batch restart possible after DR
Ch 28 (Security) Security controls must be maintained during and after failover
Ch 29 (Capacity) DR site capacity determines RLO — how much you can do after failover
Ch 31 (Automation) Automation executes the recovery procedures in your runbooks
Ch 37 (Hybrid Architecture) Cloud components add new DR complexity — hybrid DR design