Chapter 30 Exercises: Disaster Recovery and Business Continuity

Part A: Conceptual Questions

A1. Define RTO, RPO, and RLO. Explain why all three are necessary to fully specify a disaster recovery requirement. Give an example where two organizations have the same RTO but different RLOs, and explain how this difference affects their DR architecture.

A2. Explain the threshold concept of this chapter: "DR is about recovery time objectives, not technology." Why is it a mistake to start DR planning by selecting a technology (e.g., "we need GDPS/Metro Mirror") rather than starting with business requirements? Describe a scenario where this mistake leads to either over-investment or under-protection.

A3. Compare synchronous replication (Metro Mirror/PPRC) with asynchronous replication (XRC). For each, explain: (a) how it works mechanically, (b) its impact on production write performance, (c) its RPO guarantee, and (d) its distance constraints. Under what business conditions would you choose one over the other?

A4. Explain the difference between Disaster Recovery (DR), Business Continuity (BC), High Availability (HA), and Continuous Availability (CA). Diagram their relationship as a set containment hierarchy and provide an example of each that is not an example of the next level up.

A5. Describe why data corruption is fundamentally different from all other failure modes in the failure spectrum (Section 30.1). Why does synchronous replication make data corruption worse rather than better? What defense mechanisms exist against data corruption that don't apply to hardware failures?

A6. Explain the "N-1 rule" described by Kwame Mensah. How does this rule apply at the LPAR level, the coupling facility level, and the storage subsystem level? What are the capacity implications of designing for N-1 at every layer?

A7. Why is the GDPS controlling system itself a single point of failure risk? How should it be deployed to avoid this risk? What happens if the GDPS controlling system fails during a disaster at the primary site?

A8. Explain why "declare disaster" is a business decision, not a technical decision. Who should have the authority to declare a disaster? What are the risks of requiring too many approvals before failover can begin? What standing authorization mechanism does CNB use to address this?


Part B: Applied Analysis

B1. RTO/RPO Analysis for a Retail Bank

SecureFirst Retail Bank processes 20 million transactions per day across the following business processes:

Process Daily Volume Revenue Impact (per hour downtime) Regulatory Implication
Mobile banking login/balance 5M transactions $45,000 lost fee revenue Reg E — consumer access
Card authorization (POS/ATM) 12M transactions $180,000 lost interchange Network SLA — 30 min max
Wire transfers 50,000 transactions $25,000 per hour Fedwire — operating hours mandatory
Batch settlement N/A (nightly) $0 immediate; $500K if delayed > 24h ACH rules — next-day settlement
Regulatory reporting N/A (monthly) $0 immediate OCC — monthly deadline

a) Classify each process into a tier (Tier 0 through Tier 3) based on the information provided. Justify each classification.

b) Assign RTO, RPO, and RLO to each tier. Explain your rationale for each.

c) Based on your RTO/RPO assignments, recommend a GDPS technology for each tier. Identify any tiers where tape backup alone might be acceptable.

d) Yuki Nakamura argues that mobile banking should be Tier 0 because "that's where all our customers are." Carlos Vega argues it should be Tier 2 because "the mobile app can display cached data and show a 'temporarily unavailable' message for transactions." Who is right, and why? Is there a compromise position?

B2. Failure Domain Analysis

You manage a Parallel Sysplex with two LPARs (PROD1, PROD2), one coupling facility (CF01), and two storage subsystems (DS01, DS02). DB2 data sharing runs across both LPARs. CICS TOR runs on PROD1, two AORs run on PROD1, one AOR runs on PROD2.

a) Identify every failure domain in this configuration, from Level 1 through Level 5.

b) For each failure domain, describe the impact on the system and the recovery mechanism.

c) Identify the single points of failure in this configuration. (Hint: there are at least three.)

d) Propose a redesigned architecture that eliminates all single points of failure you identified. What additional hardware and software would you need?

e) Calculate the cost of each proposed change (qualitatively — "high," "medium," "low") and rank them by risk reduction per unit cost. If you could only make two changes, which two would you make first?

B3. Runbook Critique

The following is an excerpt from a DR runbook at a fictional organization. Identify at least eight problems with it:

DISASTER RECOVERY PROCEDURE
Last Updated: 2021-06-15

1. If the primary site is down, contact the manager.
2. Start the DR servers.
3. Restore the database from backup.
4. Start CICS.
5. Verify the system is working.
6. Notify users that the system is available.

For each problem you identify, explain why it's a problem and provide the corrected version of that step.

B4. DR Test Metrics Analysis

CNB ran four DR tests over two years. Here are the key metrics:

Metric Test 1 (2023-Q1) Test 2 (2023-Q3) Test 3 (2024-Q1) Test 4 (2024-Q3)
Test type Level 3 Level 3 Level 4 (unannounced) Level 3
RTO (actual) 22 min 14 min 31 min 11 min
RPO (actual) 0 0 0 0
Runbook deviations 7 3 11 2
Alert-to-first-action 3 min 2 min 14 min 2 min
Post-test action items 12 6 15 3
Escalation failures 2 1 4 0
DB2 restart time 8 min 4 min 5 min 3 min
CICS warmstart time 6 min 5 min 5 min 4 min
MQ channel recovery 11 min 4 min 6 min 2 min

a) What trends do you observe across the four tests? Are they improving?

b) Why was Test 3 (Level 4, unannounced) significantly worse than the preceding and following Level 3 tests? What does this tell you about the difference between planned and unannounced tests?

c) The DB2 restart time improved dramatically between Test 1 (8 min) and Test 2 (4 min). What infrastructure change described in Section 30.6 explains this improvement?

d) MQ channel recovery went from 11 minutes in Test 1 to 2 minutes in Test 4. What change likely caused this improvement?

e) If you were Kwame, what would be your top priority action item after Test 4?


Part C: Design Challenges

C1. Multi-Site GDPS Design

You are the DR architect for a mid-size insurance company with the following requirements: - Claims adjudication: RTO < 30 min, RPO < 30 seconds - Real-time eligibility: RTO < 5 min, RPO = 0 - Batch processing: RTO < 8 hours, RPO < 1 hour - Primary site: Chicago - Regulatory requirement: DR site must be at least 200 miles from primary - Budget constraint: cannot afford three sites

Design a GDPS architecture that meets these requirements. Include: a) DR site location and justification b) GDPS technology selection for each workload tier c) Storage replication topology diagram d) Explanation of why you can't use Metro Mirror for the claims adjudication tier (hint: distance) e) A creative alternative that achieves RPO < 30 seconds without synchronous replication at 200+ miles

C2. Data Corruption Recovery Design

At 2:15 PM on a Wednesday, a batch program at Pinnacle Health processes a file of provider rate updates. Due to a bug, the program applies a 10x multiplier to all rates instead of the correct 1.10x multiplier. The corrupt rates are written to the DB2 provider_rates table, which is synchronously replicated to the DR site via Metro Mirror. The error is discovered at 4:30 PM when a claims examiner notices that a routine office visit is being reimbursed at $1,500 instead of $150.

a) By 4:30 PM, the corrupt data has been on both the primary and DR site for 2 hours and 15 minutes. Why can't you simply fail over to the DR site to recover?

b) Design a step-by-step recovery procedure for the provider_rates table. Assume DB2 image copies are taken daily at 6 AM and log datasets are available.

c) During the 2 hours and 15 minutes of corrupted rates, 12,000 claims were adjudicated using the incorrect rates. Design a procedure to identify and correct these claims.

d) What preventive measures would you recommend to prevent this class of error in the future? Address both application-level and infrastructure-level controls.

e) Ahmad Rashidi (compliance officer) asks: "Do we need to report this to regulators?" Under HIPAA, is a data corruption event that affects payment amounts (but not patient medical data) a reportable breach? Explain your reasoning.

C3. Unannounced DR Test Design

Design a Level 4 (unannounced) DR test for Federal Benefits Administration. Consider: a) What failure scenario will you simulate? (Site failure? LPAR failure? Subsystem failure?) b) What time of day will you execute it? Justify your choice. c) Marcus Whitfield is retiring in 6 months. Should you deliberately schedule the test for a day when Marcus is not available? What are the arguments for and against? d) Sandra Chen is concerned about testing during benefits payment week (first week of each month). How do you balance the need for realistic testing against the risk of disrupting payments to millions of beneficiaries? e) Write the success criteria for your test — at least 8 specific, measurable criteria. f) Write the abort criteria — under what conditions would you stop the test and restore normal operations?

C4. DR Budget Justification

You've been asked to present a DR investment proposal to the CFO of a mid-size bank. The current state: backups to tape, stored offsite, tested once per year. The proposed state: GDPS/Metro Mirror with automated failover, tested quarterly. The annual cost increase is $2.8 million.

a) Build a risk-based justification. Research (or estimate) the following: - Average cost of a banking data center outage per hour (industry data) - Probability of a site failure in any given year - Expected annual loss (probability × impact × expected duration) - Reduction in expected annual loss from the proposed DR architecture

b) Beyond the financial analysis, what qualitative arguments would you make to the CFO? Consider regulatory expectations, competitive positioning, and customer trust.

c) The CFO asks: "Can we get 80% of the benefit for 50% of the cost?" Propose a phased approach that provides meaningful DR improvement at a lower initial investment.


Part D: Integration Exercises

D1. End-to-End DR Scenario

It's 11:15 PM on a Tuesday. CNB's nightly batch cycle started at 11:00 PM. Rob Calloway's batch scheduler has submitted the first 47 of 312 jobs. Jobs 1-23 have completed successfully. Jobs 24-47 are running. At 11:15 PM, the Charlotte data center experiences a complete power failure. Backup generators fail to start (fuel contamination, discovered only now).

Trace the complete recovery sequence: a) GDPS detection and failover to Raleigh (Section 30.3) b) Subsystem startup at the DR site (Section 30.5) c) Batch recovery decision: which of the 312 jobs need rerun from checkpoint, which are already committed, and which haven't started yet? (Connect to Chapter 23 batch dependencies and Chapter 24 checkpoint/restart) d) CICS online transaction recovery for the 2,000 users who were connected at the time of failure (Connect to Chapter 13 CICS region topology) e) External interface recovery: ATM network, wire transfer network, online banking portal f) Estimated total RTO for each tier of service g) Communication: who gets notified, in what order, with what message?

D2. DR and Security Intersection

Connect Chapter 28 (Mainframe Security) with this chapter. Describe how each of the following security controls interacts with DR:

a) RACF profiles: After failover to the DR site, are RACF profiles available? How are they replicated? What happens if the RACF database on the DR site is out of sync with the primary?

b) Data encryption (ICSF): Encrypted datasets at the primary site are replicated to the DR site. What key management considerations apply? What happens if the DR site doesn't have the correct ICSF keys?

c) Network security: The DR site has different IP addresses. How does this affect firewall rules, VPN tunnels, and external partner connectivity? How do you keep DR network security synchronized with production?

d) Audit trail continuity: During and after a DR failover, how do you maintain a complete audit trail? Are there gaps in SMF data during the failover window?

D3. Capacity Planning and DR Intersection

Connect Chapter 29 (Capacity Planning) with this chapter.

a) If your DR site has 60% of production capacity, what workload management decisions do you need to make after failover? Which workloads do you prioritize and which do you defer?

b) Your capacity plan shows that production will grow 15% per year. Your DR site was sized to handle 100% of today's production workload. In how many years will the DR site be unable to handle a full failover? What are your options?

c) During a DR test, you discover that the DR site's storage subsystem can handle the I/O rate for online transactions but not for online + batch simultaneously. What does this tell you about your DR batch strategy? What alternatives exist?


Part E: Reflection Questions

E1. Kwame Mensah says: "A DR plan that's been tested once is a dangerous source of false confidence." Explain what he means. How many times should a DR plan be tested before you have reasonable confidence in it? Is there a point of diminishing returns?

E2. Sandra Chen says: "Our biggest DR risk isn't technology. It's people." Describe three specific people-related DR risks that technology cannot address. For each, propose a mitigation strategy.

E3. This chapter argues that data corruption is the most dangerous failure mode because it replicates through synchronous mirroring. Design a defense-in-depth strategy against data corruption that includes at least five layers of protection. For each layer, explain what class of corruption it defends against and what it doesn't cover.

E4. Consider the ethical dimension of DR planning. A hospital and a social media company both have a 4-hour RTO. Are these equivalent from a business continuity perspective? From a societal perspective? Should regulatory requirements for DR be proportional to the societal impact of the service being protected?

E5. You're the DR architect, and you've discovered that the organization's DR plan hasn't been tested in three years due to budget constraints and organizational inertia. The plan was valid when last tested, but the production environment has changed significantly. Management is resistant to funding a DR test because "it's disruptive and expensive." Write a memo (one page) to the CTO making the case for an immediate DR test. Include risk quantification, regulatory exposure, and a proposed test approach that minimizes disruption.