Chapter 30 Key Takeaways: Disaster Recovery and Business Continuity

DataField.Dev

Chapter 30 Key Takeaways: Disaster Recovery and Business Continuity

Threshold Concept

DR is about recovery time objectives, not technology. DR design starts with business requirements — RTO, RPO, RLO — and works backward to technology choices. The question is not "should we use Metro Mirror or XRC?" The question is "how long can the business be down, and how much data can it lose?" The answer to that question selects the technology.

The RTO/RPO Framework

Business Process Identification
  → Criticality Tiering (Tier 0 through Tier 3)
    → RTO/RPO/RLO Assignment per Tier
      → Technology Selection per Tier
        → Architecture Design
          → Runbook Development
            → Test Plan

Key principle: Never start with technology. Always start with the business.

Tier	RTO	RPO	Technology	Annual Cost (relative)
Tier 0	Near-zero	Zero	GDPS/HyperSwap + active-active	$$$$
Tier 1	< 15 min	Zero	GDPS/Metro Mirror + automated failover	$$$ \| \| Tier 2 \| < 4 hours \| < 15 min \| GDPS/XRC or journal recovery \| $$
Tier 3	< 24 hours	< 4 hours	Tape backup + manual recovery	$

GDPS Technology Comparison

Technology	Replication	RPO	RTO	Distance	Write Impact
HyperSwap	Synchronous (Metro Mirror)	Zero	Near-zero (< 2 sec)	< 100 km	0.3-1 ms/write
Metro Mirror Failover	Synchronous (PPRC)	Zero	10-30 min (IPL required)	< 100 km	0.3-1 ms/write
XRC	Asynchronous	2-30 seconds	30-60 min	Unlimited	None
Global Mirror	Asynchronous (disk-based)	Minutes	60+ min	Unlimited	None

Stacking strategy: Use HyperSwap for storage failure + Metro Mirror for site failure + XRC for regional disaster. Each addresses a different failure scenario at a different distance.

Failure Domain Hierarchy

Level 7: Data Corruption     ← MOST DANGEROUS (replication propagates it)
Level 6: Site Failure         ← GDPS site failover
Level 5: Storage Subsystem    ← GDPS/HyperSwap
Level 4: Coupling Facility    ← Duplexed structures, alternate CF
Level 3: LPAR Failure         ← Sysplex workload redistribution
Level 2: Subsystem Failure    ← DB2 peer recovery, CICS region failover
Level 1: Application Failure  ← Transaction rollback, checkpoint/restart

The N-1 Rule: Every layer of the architecture must survive the loss of one component at that layer. Four LPARs → any three carry full load. Two CFs → critical structures duplexed. Two storage frames → all volumes mirrored.

Data corruption is the special case: Replication makes it worse, not better. Defense requires FlashCopy snapshots, immutable backups, air-gapped copies, and detection mechanisms.

Runbook Essentials

A good runbook has five characteristics:

Written for 3 AM execution — clear, unambiguous, numbered steps
Verification after every major action — "Start DB2. Verify: -DIS DDF → STARTD"
Decision points — "If X succeeds → step 7. If X fails with code Y → Appendix C"
Estimated times — "Step 5: IPL LPAR. Expected: 8-12 minutes"
Escalation paths — names, phone numbers, roles — not "contact the DBA"

DR Testing Levels

Level	Type	Frequency	Value
1	Tabletop exercise	Quarterly	Exposes gaps without risk
2	Component test	Monthly	Validates individual recovery mechanisms
3	Planned site failover	Semi-annually	Validates end-to-end failover mechanics
4	Unannounced failover	Annually	Validates real-world readiness

Key insight: Most DR tests are theater. A planned test at 8 PM on Saturday with the A-team assembled proves the mechanics work — it doesn't prove you can recover from a disaster at 2 PM on a business day with whoever happens to be on-call. Level 4 tests close this gap.

Post-test review rules: No blame. Every deviation documented. Action items have owners and deadlines.

DR Governance Components

Business Impact Analysis — reviewed annually
DR Plan Document — updated after every test and significant change
DR Test Calendar — minimum semi-annual full test
DR Readiness Dashboard — monitored continuously (replication status, backup currency, team readiness)
DR Audit Trail — for regulators (revision history, test results, action items, training records)

Critical Warnings

🔴 Replication ≠ Backup. Synchronous replication protects against hardware failure. It propagates data corruption. You need both replication AND point-in-time backup capability.

🔴 DR plans decay. A DR plan that isn't tested, updated, and governed will become useless within 12-18 months as the production environment changes. The governance framework prevents decay.

🔴 People are failure domains too. If your DR recovery depends on one person's knowledge, your DR plan has a single point of failure. Cross-training and documentation are DR requirements, not HR nice-to-haves.

🔴 The declaration bottleneck. In most organizations, the decision to invoke DR requires executive authorization. If that authorization chain is slow, your actual RTO includes the time to wake up executives and get approval. Pre-authorize DR invocation for specific scenarios.

Connections to Other Chapters

Chapter	Connection to DR
Ch 1 (Parallel Sysplex)	Sysplex architecture defines your within-site failure domains
Ch 5 (WLM)	WLM adjusts workload after LPAR failure — capacity headroom required
Ch 8 (DB2 Locking)	DB2 peer recovery in data sharing depends on lock structure in CF
Ch 9 (DB2 Utilities)	Image copy + log apply = your data corruption recovery mechanism
Ch 13 (CICS)	Region topology determines CICS survivability during partial failures
Ch 23 (Batch)	Batch dependency graph becomes the batch restart plan after failover
Ch 24 (Checkpoint/Restart)	Checkpoint data is what makes batch restart possible after DR
Ch 28 (Security)	Security controls must be maintained during and after failover
Ch 29 (Capacity)	DR site capacity determines RLO — how much you can do after failover
Ch 31 (Automation)	Automation executes the recovery procedures in your runbooks
Ch 37 (Hybrid Architecture)	Cloud components add new DR complexity — hybrid DR design