Chapter 30 Quiz: Disaster Recovery and Business Continuity

DataField.Dev

Chapter 30 Quiz: Disaster Recovery and Business Continuity

Section 1: Multiple Choice

1. What is the correct order for designing a DR architecture?

a) Select GDPS technology → determine replication method → define RTO/RPO → map to business processes b) Define business processes → assign RTO/RPO → select GDPS technology → implement replication c) Implement GDPS → test failover → adjust RTO/RPO based on results → communicate to business d) Survey available technology → match technology to budget → define RTO/RPO to match technology capability

Answer: b) Define business processes → assign RTO/RPO → select GDPS technology → implement replication

Explanation: This is the chapter's threshold concept. DR design starts with business requirements and works backward to technology. You identify critical business processes, classify them by tier, assign RTO/RPO/RLO based on business impact analysis, and then select the technology that satisfies those objectives. Starting with technology (options a, c, d) leads to either over-investment (buying technology you don't need) or under-protection (choosing technology that doesn't meet business requirements because you never defined them).

2. GDPS/HyperSwap provides near-zero RTO for storage failures because:

a) It uses asynchronous replication, so the secondary is always slightly behind and can be promoted instantly b) It redirects I/O from primary to secondary volumes transparently — no IPL, no application restart c) It maintains a hot standby LPAR at the DR site that is always synchronized with production d) It uses FlashCopy to create instant snapshots that can be mounted within seconds

Answer: b) It redirects I/O from primary to secondary volumes transparently — no IPL, no application restart

Explanation: HyperSwap's key differentiator is transparency. When primary storage fails, GDPS/HyperSwap freezes Metro Mirror pairs and redirects all I/O to the secondary (mirrored) volumes. The z/OS LPARs, DB2, CICS, and all applications continue running without interruption — they don't even know the swap happened. This is fundamentally different from a site failover (which requires IPL) or a backup restore (which requires time). HyperSwap works because Metro Mirror maintains a synchronous, byte-for-byte copy of every volume, and the I/O subsystem can redirect to the copy without application-level intervention.

3. What is the primary distance limitation of GDPS/Metro Mirror (synchronous replication)?

a) Metro Mirror requires dedicated fiber that is only available within a single metropolitan area b) The speed of light creates replication latency that adds directly to every write I/O response time — exceeding ~100 km makes the latency penalty unacceptable for most workloads c) IBM only supports Metro Mirror for distances up to 50 km d) Synchronous replication requires the primary and secondary to be on the same power grid

Answer: b) The speed of light creates replication latency that adds directly to every write I/O response time — exceeding ~100 km makes the latency penalty unacceptable for most workloads

Explanation: Synchronous replication means every write must be acknowledged by both the primary and secondary storage before the application can proceed. At the speed of light in fiber (~200,000 km/s), a 100 km distance creates approximately 1 ms round-trip latency. This 1 ms is added to every write I/O. At 300 km, the penalty is approximately 3 ms, which is significant for high-volume transactional workloads. IBM supports Metro Mirror at various distances, but the physics of light speed is the practical constraint. This is why long-distance DR uses asynchronous replication (XRC or Global Mirror), which eliminates the write penalty at the cost of RPO > 0.

4. In a Parallel Sysplex with four LPARs and DB2 data sharing, what happens when one LPAR fails?

a) All DB2 processing stops until the failed LPAR is recovered b) The coupling facility detects the failure and shuts down the remaining LPARs to protect data integrity c) Surviving DB2 members perform peer recovery — they acquire the failed member's locks, back out in-flight work, and continue processing d) DB2 switches to single-member mode on the surviving LPARs, losing data sharing capability

Answer: c) Surviving DB2 members perform peer recovery — they acquire the failed member's locks, back out in-flight work, and continue processing

Explanation: This is the core value proposition of DB2 data sharing in a Parallel Sysplex. When a DB2 member fails (because its LPAR fails), the surviving members detect the failure through XCF signaling. They perform peer recovery: reading the failed member's log from shared storage, acquiring its locks from the coupling facility lock structure, and backing out any in-flight units of work. Once peer recovery completes (typically seconds to minutes depending on the amount of in-flight work), the data is fully available through the surviving members. Applications reconnect to surviving members — often automatically through workload balancing.

5. Which failure mode is UNIQUELY dangerous because synchronous replication propagates it to the DR site?

a) LPAR failure b) Coupling facility failure c) Storage subsystem failure d) Data corruption

Answer: d) Data corruption

Explanation: Data corruption (application bugs, human error, ransomware) is the one failure mode that replication makes worse. If an application writes corrupt data to the primary site, synchronous replication faithfully copies that corruption to the DR site. Failing over to the DR site gives you a second copy of the same corrupt data — not a clean copy. All other failure modes (LPAR, CF, storage) destroy the ability to process data but don't corrupt the data itself. This is why defense against data corruption requires different mechanisms: FlashCopy snapshots at known-good points, immutable backups, point-in-time recovery from DB2 image copies and logs, and application-level data integrity validation.

6. The "N-1 rule" for Sysplex design means:

a) Always maintain N-1 backup copies of all data, where N is the number of production copies b) Every layer of the architecture should survive the loss of one component at that layer c) Keep at least one LPAR in standby mode (N production, 1 standby) d) The DR site should have N-1 LPARs compared to the production site

Answer: b) Every layer of the architecture should survive the loss of one component at that layer

Explanation: The N-1 rule means that if you have N LPARs, any N-1 should carry the full workload. If you have N coupling facilities, any N-1 should host all required structures. If you have N storage subsystems, any N-1 should serve all required data (through mirroring). This rule has direct capacity implications: each LPAR normally runs at (N-1)/N utilization to leave headroom for absorbing a failed LPAR's workload. In CNB's four-LPAR Sysplex, each LPAR runs at about 75% capacity so that three LPARs can handle 100% of the workload.

7. Why does Kwame Mensah consider a severed fiber (that cuts GDPS replication) to be a Sev-1 event even though production continues normally?

a) Because the fiber carries both replication and production network traffic b) Because Metro Mirror cannot be resynchronized after a fiber cut — the DR volumes must be rebuilt from scratch c) Because without active replication, the organization has lost its disaster recovery capability — a second failure during the repair window would be unrecoverable d) Because regulatory compliance requires continuous replication — any interruption must be reported to regulators within 24 hours

Answer: c) Because without active replication, the organization has lost its disaster recovery capability — a second failure during the repair window would be unrecoverable

Explanation: This is the opening scenario of the chapter. When the fiber was cut, CNB's production Sysplex continued operating normally on primary storage — no transactions were lost, no users were affected. But the Metro Mirror replication to the DR site was interrupted. Until the fiber was repaired and all 47 TB of DASD were resynchronized (72 hours), CNB had no disaster recovery capability. If the primary data center had suffered a catastrophic failure during that window, there would have been no way to recover. The fact that "nothing appeared broken" masked the reality that CNB's insurance policy was temporarily void.

8. What is the primary purpose of a Business Impact Analysis (BIA) in DR planning?

a) To inventory all IT systems and their technical dependencies b) To estimate the cost of DR technology and create a budget proposal c) To identify critical business processes and quantify the impact of their unavailability, providing the input for RTO/RPO determination d) To satisfy regulatory audit requirements by documenting that a DR plan exists

Answer: c) To identify critical business processes and quantify the impact of their unavailability, providing the input for RTO/RPO determination

Explanation: The BIA is the business-side foundation of DR planning. It identifies which business processes are critical (and which aren't), quantifies the financial, operational, and regulatory impact of their unavailability over time, and provides the data needed to make rational RTO/RPO decisions. Without a BIA, RTO/RPO assignments are arbitrary — you're guessing. With a BIA, they're risk-based decisions backed by business impact data. While a BIA does involve IT system inventory (option a), that's a secondary output, not the primary purpose. And while regulators do require a BIA (option d), satisfying auditors is a side effect, not the goal.

9. During a DR test at CNB, the actual RTO was 14 minutes but the target was 15 minutes. Which of the following statements best characterizes this result?

a) The test was successful — RTO was met b) The test was successful, but the 1-minute margin is dangerously thin — investigate what could extend it c) The test failed because a 1-minute margin doesn't account for real-world variability d) The test is inconclusive — a single data point doesn't validate the RTO target

Answer: b) The test was successful, but the 1-minute margin is dangerously thin — investigate what could extend it

Explanation: The 14-minute result technically meets the 15-minute target, so the test passes its success criterion. However, a 1-minute margin (7% headroom) provides almost no buffer for real-world variability. In an actual disaster — as opposed to a planned test — conditions will be worse: the on-call person may be slower to respond, network conditions may be degraded, and the emotional stress of a real disaster slows decision-making. A mature DR program aims for test results well below the target (e.g., 10-11 minutes against a 15-minute target) to provide margin for real-world degradation. The 14-minute result should trigger investigation: which steps took longer than expected? Where can time be reduced?

10. GDPS/XRC differs from GDPS/Metro Mirror primarily in that:

a) XRC uses disk-based replication while Metro Mirror uses tape-based replication b) XRC is synchronous and Metro Mirror is asynchronous c) XRC is asynchronous (RPO > 0, no write penalty, unlimited distance) while Metro Mirror is synchronous (RPO = 0, write latency penalty, distance-limited) d) XRC is for DB2 data only while Metro Mirror handles all DASD volumes

Answer: c) XRC is asynchronous (RPO > 0, no write penalty, unlimited distance) while Metro Mirror is synchronous (RPO = 0, write latency penalty, distance-limited)

Explanation: Metro Mirror (based on PPRC) uses synchronous replication: every write must complete on both primary and secondary before the application proceeds. This guarantees RPO = 0 but adds write latency and is constrained by the speed of light to metro distances (~100 km practical limit). XRC uses asynchronous replication: the application writes to local DASD at full speed, and z/OS System Data Mover transmits the changes to the remote site in the background. This means RPO > 0 (typically 2-30 seconds of data lag) but imposes no write penalty on production and works at any distance. Most large enterprises use both: Metro Mirror to a nearby DR site for zero-RPO protection, and XRC to a remote site for geographic diversity.

Section 2: Scenario-Based Questions

11. A coupling facility (CF01) in a two-CF Sysplex fails. The DB2 lock structure was duplexed across CF01 and CF02. What is the expected impact on DB2 data sharing?

a) DB2 data sharing stops — the lock structure is lost b) DB2 continues using the duplexed copy on CF02 — the failover is transparent to applications c) DB2 must rebuild the lock structure on CF02 from scratch, causing a 5-10 minute outage d) DB2 switches to non-data-sharing mode until CF01 is restored

Answer: b) DB2 continues using the duplexed copy on CF02 — the failover is transparent to applications

Explanation: Structure duplexing is precisely for this scenario. When the DB2 lock structure is duplexed across two CFs, both copies are maintained synchronously. When CF01 fails, DB2 seamlessly switches to the surviving copy on CF02. Applications experience no interruption and no delay. This is why CNB duplexes all critical structures (lock, SCA, GBP) across two CFs on separate physical frames. If the lock structure had NOT been duplexed, DB2 would need to rebuild it — which is option c and causes a noticeable interruption.

12. After a GDPS/Metro Mirror site failover, the DR site LPARs are IPLed and DB2 is started. You need to restart the nightly batch cycle, which was 40% complete when the primary site failed. Which statement is correct?

a) All 100% of batch jobs must be rerun from the beginning because the DR site has no batch checkpoint data b) The first 40% of jobs don't need rerun because Metro Mirror replicated their committed work — you restart from the point of interruption using checkpoint data c) The batch cycle cannot be restarted at the DR site — batch must wait until failback to the primary d) Only the currently-running jobs at the time of failure need rerun; all others (completed and not-yet-started) are handled automatically

Answer: b) The first 40% of jobs don't need rerun because Metro Mirror replicated their committed work — you restart from the point of interruption using checkpoint data

Explanation: Metro Mirror provides synchronous, zero-RPO replication. Every DASD write that completed at the primary site was also written to the DR site. This includes DB2 data, DB2 logs, VSAM datasets, checkpoint datasets, and batch output. Completed batch jobs don't need rerun — their work is on the DR site's volumes. Jobs that were in-flight at the time of failure need restart from their last checkpoint (per Chapter 24's checkpoint/restart procedures). Jobs that hadn't started yet are simply submitted at the DR site. The batch scheduler needs to know which jobs completed, which need restart, and which haven't started — this is why Rob Calloway's job scheduling dependency graph (Chapter 23) is part of the DR plan.

13. You're designing a DR test for a healthcare organization. Ahmad Rashidi insists that the test must include verification that HIPAA-protected data (PHI) is accessible at the DR site but only to authorized users. Why is this a valid DR test requirement?

a) It's not — DR tests should focus on availability, not security b) Because HIPAA compliance is a business requirement that must be maintained during and after DR failover — if PHI is accessible to unauthorized users at the DR site, the failover created a HIPAA violation c) Because the DR site uses different RACF databases and security policies need manual verification d) Because encrypted data at the primary site may not be readable at the DR site if encryption keys aren't replicated

Answer: b) Because HIPAA compliance is a business requirement that must be maintained during and after DR failover — if PHI is accessible to unauthorized users at the DR site, the failover created a HIPAA violation

Explanation: DR recovery doesn't suspend regulatory requirements. If the DR site's RACF configuration, dataset permissions, or network access controls differ from production, a failover could expose PHI to unauthorized access — creating a HIPAA breach during an already-stressful situation. DR tests must validate not just that systems are available but that they're available with the correct security posture. Option d is also a legitimate concern (encryption key management in DR) but isn't the primary reason — it's a technical detail under the broader requirement that compliance is maintained during DR.

14. CNB's GDPS/XRC replication to Dallas has a typical lag of 5-10 seconds. During the peak of the nightly batch window, the lag increases to 30 seconds. If a regional disaster destroys both Charlotte and Raleigh and CNB fails over to Dallas, what is the RPO?

a) Zero — XRC provides zero data loss b) 5-10 seconds — the typical lag c) Up to 30 seconds — the worst-case lag at the time of failure d) It depends on when the disaster occurs — during normal operations, 5-10 seconds; during batch peak, up to 30 seconds

Answer: d) It depends on when the disaster occurs — during normal operations, 5-10 seconds; during batch peak, up to 30 seconds

Explanation: XRC is asynchronous — the DR site is always slightly behind the primary. The amount of data loss (RPO) equals the replication lag at the moment of failure. If the disaster occurs during normal daytime operations, the lag is 5-10 seconds. If it occurs during the batch window peak, the lag could be 30 seconds — meaning up to 30 seconds of committed transactions at the primary site are not yet on the Dallas volumes. This is why RPO is expressed as a maximum (worst case) for capacity planning and risk assessment, and why CNB monitors XRC lag continuously. If the lag exceeds the acceptable RPO threshold (defined in the DR plan), it triggers an alert for investigation.

15. Which of the following is the strongest argument for conducting unannounced (Level 4) DR tests?

a) They cost less than planned tests because no preparation time is needed b) They test the real-world response capability — including alert effectiveness, on-call readiness, and the ability to execute without rehearsal c) They satisfy regulatory requirements that planned tests don't d) They're more technically rigorous because the systems are under full production load

Answer: b) They test the real-world response capability — including alert effectiveness, on-call readiness, and the ability to execute without rehearsal

Explanation: The core value of unannounced tests is that they simulate real-world conditions. In a planned test, everyone knows it's coming, the A-team is assembled, runbooks are reviewed in advance, and the test happens during a low-volume window. In a real disaster, none of these conditions hold. Unannounced tests reveal how long it actually takes to respond to alerts, whether the on-call person can execute the runbook without advance preparation, and whether the organization can recover without its best engineers (who might be unavailable during a real event). CNB's data shows this clearly: their Level 4 test had an RTO of 31 minutes vs. 14 minutes for the preceding planned test — revealing that real-world conditions add significant time.

Section 3: True/False with Justification

16. True or False: A DR plan that satisfies regulatory requirements is, by definition, adequate to protect the business.

Answer: False

Explanation: Regulatory requirements set a minimum floor, not an optimal target. HIPAA, for example, requires a "disaster recovery plan" but doesn't specify RTO or RPO. An organization could have a regulatory-compliant DR plan with a 72-hour RTO that would be catastrophic for a hospital that needs real-time eligibility verification. Similarly, the FFIEC handbook requires DR testing but doesn't specify that tests must be unannounced. A DR plan can check every regulatory box and still fail to protect the business from a realistic disaster scenario. The business impact analysis — not the regulatory checklist — should drive DR design.

17. True or False: If your production site and DR site are on the same power grid, you have a single point of failure at the site level even though they are geographically separated.

Answer: True

Explanation: Geographic separation protects against localized events (fire, flood, building-specific failures) but not against events that affect the shared infrastructure connecting the sites. A common power grid, common network provider, or common water supply (for cooling) creates a failure domain that spans both sites. The 2003 Northeast Blackout affected 55 million people across eight US states and Ontario — any DR plan relying on two sites within that power grid would have failed. True site independence requires separate power grids (or generator capacity for extended outages), separate network paths from different providers, and separate physical infrastructure.

18. True or False: After a GDPS/Metro Mirror failover, the DR site's data is guaranteed to be identical to the primary site's data at the moment of failure.

Answer: True (with an important caveat)

Explanation: Metro Mirror is synchronous — every write that completed at the primary was also completed at the secondary. Therefore, at the moment replication is frozen during failover, the secondary volumes are byte-for-byte identical to the primary volumes as of the last completed write. The caveat: writes that were in-flight (submitted but not yet acknowledged) at the moment of failure may or may not be on the secondary. However, from the application's perspective, in-flight writes were never confirmed, so they're not considered committed data. DB2, CICS, and MQ recovery processes will handle any in-flight work through their normal transaction recovery (backout) procedures.

19. True or False: The best time to conduct a planned DR test is during a maintenance window when transaction volume is at its lowest.

Answer: False (partially)

Explanation: Low-volume windows minimize business risk during the test — which is appropriate for initial tests or when validating new DR infrastructure. But if you only test during low-volume windows, you've never validated that your DR site can handle production-level load. A mature DR testing program includes tests at varying load levels: low-volume for initial validation and procedure testing, medium-volume for capacity validation, and (ideally) at least one test during a representative workload to validate that RTO targets can be met under realistic conditions. CNB's unannounced tests are deliberately not scheduled for low-volume periods.

20. True or False: If the GDPS controlling system fails, automated failover capability is lost but manual failover is still possible.

Answer: True

Explanation: GDPS automates the failover process — monitoring storage health, executing HyperSwap, managing Metro Mirror freeze/failover, and coordinating subsystem startup. If the GDPS controlling system fails, this automation is unavailable. However, the underlying storage replication (Metro Mirror, XRC) continues operating independently of GDPS. An experienced storage administrator can manually suspend replication, make secondary volumes accessible, and initiate system recovery — it just takes longer and requires more expertise. This is why GDPS controlling systems should be redundant (dual LPARs, one at each site), and why the DR runbook should include a "manual failover without GDPS" appendix for the worst-case scenario.