Chapter 30 Key Takeaways: Disaster Recovery and Business Continuity
Threshold Concept
DR is about recovery time objectives, not technology. DR design starts with business requirements — RTO, RPO, RLO — and works backward to technology choices. The question is not "should we use Metro Mirror or XRC?" The question is "how long can the business be down, and how much data can it lose?" The answer to that question selects the technology.
The RTO/RPO Framework
Business Process Identification
→ Criticality Tiering (Tier 0 through Tier 3)
→ RTO/RPO/RLO Assignment per Tier
→ Technology Selection per Tier
→ Architecture Design
→ Runbook Development
→ Test Plan
Key principle: Never start with technology. Always start with the business.
| Tier | RTO | RPO | Technology | Annual Cost (relative) |
|---|---|---|---|---|
| Tier 0 | Near-zero | Zero | GDPS/HyperSwap + active-active | $$$$ |
| Tier 1 | < 15 min | Zero | GDPS/Metro Mirror + automated failover | $$$ | | Tier 2 | < 4 hours | < 15 min | GDPS/XRC or journal recovery | $$ |
| Tier 3 | < 24 hours | < 4 hours | Tape backup + manual recovery | $ |
GDPS Technology Comparison
| Technology | Replication | RPO | RTO | Distance | Write Impact |
|---|---|---|---|---|---|
| HyperSwap | Synchronous (Metro Mirror) | Zero | Near-zero (< 2 sec) | < 100 km | 0.3-1 ms/write |
| Metro Mirror Failover | Synchronous (PPRC) | Zero | 10-30 min (IPL required) | < 100 km | 0.3-1 ms/write |
| XRC | Asynchronous | 2-30 seconds | 30-60 min | Unlimited | None |
| Global Mirror | Asynchronous (disk-based) | Minutes | 60+ min | Unlimited | None |
Stacking strategy: Use HyperSwap for storage failure + Metro Mirror for site failure + XRC for regional disaster. Each addresses a different failure scenario at a different distance.
Failure Domain Hierarchy
Level 7: Data Corruption ← MOST DANGEROUS (replication propagates it)
Level 6: Site Failure ← GDPS site failover
Level 5: Storage Subsystem ← GDPS/HyperSwap
Level 4: Coupling Facility ← Duplexed structures, alternate CF
Level 3: LPAR Failure ← Sysplex workload redistribution
Level 2: Subsystem Failure ← DB2 peer recovery, CICS region failover
Level 1: Application Failure ← Transaction rollback, checkpoint/restart
The N-1 Rule: Every layer of the architecture must survive the loss of one component at that layer. Four LPARs → any three carry full load. Two CFs → critical structures duplexed. Two storage frames → all volumes mirrored.
Data corruption is the special case: Replication makes it worse, not better. Defense requires FlashCopy snapshots, immutable backups, air-gapped copies, and detection mechanisms.
Runbook Essentials
A good runbook has five characteristics:
- Written for 3 AM execution — clear, unambiguous, numbered steps
- Verification after every major action — "Start DB2. Verify:
-DIS DDF→STARTD" - Decision points — "If X succeeds → step 7. If X fails with code Y → Appendix C"
- Estimated times — "Step 5: IPL LPAR. Expected: 8-12 minutes"
- Escalation paths — names, phone numbers, roles — not "contact the DBA"
DR Testing Levels
| Level | Type | Frequency | Value |
|---|---|---|---|
| 1 | Tabletop exercise | Quarterly | Exposes gaps without risk |
| 2 | Component test | Monthly | Validates individual recovery mechanisms |
| 3 | Planned site failover | Semi-annually | Validates end-to-end failover mechanics |
| 4 | Unannounced failover | Annually | Validates real-world readiness |
Key insight: Most DR tests are theater. A planned test at 8 PM on Saturday with the A-team assembled proves the mechanics work — it doesn't prove you can recover from a disaster at 2 PM on a business day with whoever happens to be on-call. Level 4 tests close this gap.
Post-test review rules: No blame. Every deviation documented. Action items have owners and deadlines.
DR Governance Components
- Business Impact Analysis — reviewed annually
- DR Plan Document — updated after every test and significant change
- DR Test Calendar — minimum semi-annual full test
- DR Readiness Dashboard — monitored continuously (replication status, backup currency, team readiness)
- DR Audit Trail — for regulators (revision history, test results, action items, training records)
Critical Warnings
🔴 Replication ≠ Backup. Synchronous replication protects against hardware failure. It propagates data corruption. You need both replication AND point-in-time backup capability.
🔴 DR plans decay. A DR plan that isn't tested, updated, and governed will become useless within 12-18 months as the production environment changes. The governance framework prevents decay.
🔴 People are failure domains too. If your DR recovery depends on one person's knowledge, your DR plan has a single point of failure. Cross-training and documentation are DR requirements, not HR nice-to-haves.
🔴 The declaration bottleneck. In most organizations, the decision to invoke DR requires executive authorization. If that authorization chain is slow, your actual RTO includes the time to wake up executives and get approval. Pre-authorize DR invocation for specific scenarios.
Connections to Other Chapters
| Chapter | Connection to DR |
|---|---|
| Ch 1 (Parallel Sysplex) | Sysplex architecture defines your within-site failure domains |
| Ch 5 (WLM) | WLM adjusts workload after LPAR failure — capacity headroom required |
| Ch 8 (DB2 Locking) | DB2 peer recovery in data sharing depends on lock structure in CF |
| Ch 9 (DB2 Utilities) | Image copy + log apply = your data corruption recovery mechanism |
| Ch 13 (CICS) | Region topology determines CICS survivability during partial failures |
| Ch 23 (Batch) | Batch dependency graph becomes the batch restart plan after failover |
| Ch 24 (Checkpoint/Restart) | Checkpoint data is what makes batch restart possible after DR |
| Ch 28 (Security) | Security controls must be maintained during and after failover |
| Ch 29 (Capacity) | DR site capacity determines RLO — how much you can do after failover |
| Ch 31 (Automation) | Automation executes the recovery procedures in your runbooks |
| Ch 37 (Hybrid Architecture) | Cloud components add new DR complexity — hybrid DR design |