Case Study 2: Federal Benefits Administration's DR Compliance Requirements
Background
Federal Benefits Administration processes monthly benefit payments for 22 million Americans. The payments — disability, retirement, survivor benefits — are often the primary or sole income for recipients. A missed payment isn't an inconvenience. For many recipients, it means they can't pay rent, buy medication, or put food on the table.
Sandra Chen has been the modernization lead at FBA for four years. When she arrived, the DR plan was a 200-page document last updated in 2016 that described a recovery process for a system architecture that no longer existed. The document referenced IMS databases that had been migrated to DB2, CICS regions that had been consolidated, and a tape backup process that had been replaced by virtual tape two years earlier.
"The DR plan passed every audit because auditors checked that we had a plan, not that the plan would work," Sandra told me. "It was a compliance artifact, not an operational document. I knew it was useless the first time I read it. But proving it to management required a test — and nobody wanted to test."
Marcus Whitfield — the legacy SME retiring in two years — had participated in FBA's last actual DR test in 2014. His memory of that test was vivid: "It took us 38 hours to recover. Thirty-eight hours. We lost two days of transactions. The FISMA auditors wrote us up, and we spent the next year on a corrective action plan. After that, nobody wanted to do another test. They were afraid of what they'd find."
The Regulatory Landscape
FBA operates under a web of overlapping federal compliance requirements:
FISMA (Federal Information Security Modernization Act)
FISMA requires federal agencies to develop, document, and implement information security programs that include contingency planning. The specific requirements flow through NIST Special Publications:
NIST SP 800-34 Rev. 1 (Contingency Planning Guide for Federal Information Systems): - Business Impact Analysis (BIA) — identify critical systems, maximum tolerable downtime - Contingency plan — documented procedures for response, recovery, and restoration - Plan testing — test the plan at least annually - Plan maintenance — update after any significant change - Personnel — identified, trained, and available
NIST SP 800-53 Rev. 5 (Security and Privacy Controls): The CP (Contingency Planning) control family includes: - CP-1: Policy and procedures - CP-2: Contingency plan - CP-3: Contingency training - CP-4: Contingency plan testing (moderate baseline: annual; high baseline: annual + exercises) - CP-6: Alternate storage site - CP-7: Alternate processing site - CP-8: Telecommunications services - CP-9: System backup - CP-10: System recovery and reconstitution
FBA's systems are classified as FISMA High — the most stringent baseline, because a compromise or outage could result in "severe or catastrophic adverse effect on organizational operations, organizational assets, or individuals."
OMB Circular A-130
Requires agencies to ensure continuity of operations for mission-essential functions. FBA's benefits payment processing is classified as a National Essential Function (NEF) — meaning the federal government has determined that this function must continue even during a national-level catastrophe.
Agency-Specific Requirements
FBA's parent department has additional requirements: - DR site must be operational within 12 hours (department-level RTO mandate) - Backup site must be at least 100 miles from primary (department COOP directive) - Annual DR test with documented results (department policy) - Quarterly tabletop exercises (added after the 2016 audit findings)
The Inspector General
FBA's Inspector General (IG) conducts annual audits of IT security controls, including contingency planning. The IG's findings are published publicly and reported to Congress. A poor DR audit finding becomes a line item in the IG's Semi-Annual Report to Congress — visible to congressional oversight committees, media, and the public.
"The IG finding is the thing that terrifies leadership," Sandra explains. "It's not the risk of a disaster — it's the risk of being publicly reported as unprepared for one. Which is a perverse incentive, because it motivates compliance theater over actual capability. I've spent four years trying to shift the conversation from 'pass the audit' to 'survive a disaster.'"
The Current Architecture
FBA's production environment spans three generations of technology:
Layer 1: IMS Legacy (circa 1983)
The benefits calculation engine runs on IMS DB/DC. This is the 40-year-old codebase with 15 million lines of COBOL that Marcus Whitfield maintains. It contains the definitive business rules for calculating benefit amounts — rules that have been modified by 147 legislative changes over four decades. Many of these rules are embedded in COBOL program logic, not in a rules engine or configuration.
DR characteristics: - IMS databases reside on VSAM datasets mirrored via GDPS/Metro Mirror - IMS log datasets are mirrored synchronously - IMS recovery requires: (a) restart IMS control region, (b) emergency restart from log, (c) database recovery if needed - Key risk: Only Marcus and two contractors understand the IMS recovery procedures. The runbook exists but hasn't been executed by anyone else.
Layer 2: CICS/DB2 Modern Core (circa 2015)
The eligibility verification system, provider interface, and online inquiry functions run on CICS TS 5.6 with DB2 12. This layer was built during a modernization wave that moved selected functions from IMS to CICS/DB2.
DR characteristics: - DB2 data and logs mirrored via GDPS/Metro Mirror - CICS system datasets mirrored - Standard GDPS failover procedures (similar to CNB's approach) - Key risk: CICS/DB2 layer depends on IMS data for benefit amount calculations. If IMS doesn't recover, CICS/DB2 can process inquiries but cannot make eligibility determinations that require benefit calculations.
Layer 3: z/OS Connect API Layer (circa 2021)
The web portal and mobile interface for beneficiaries use z/OS Connect to expose CICS transactions as REST APIs. An API gateway (running on distributed Linux systems) fronts the z/OS Connect endpoints.
DR characteristics: - z/OS Connect is stateless — restart is fast (< 1 minute) - The API gateway has its own DR configuration (distributed systems DR, outside z/OS) - Key risk: The API gateway's DR configuration was set up by a contractor who left the project. Documentation is incomplete.
The Dependency Chain
┌────────────────────────────────────────────────────────────┐
│ FBA Recovery Dependency Chain │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ IMS │────►│ CICS/DB2 │────►│ z/OS │────► Users │
│ │ (Layer 1)│ │ (Layer 2)│ │ Connect │ │
│ │ │ │ │ │ (Layer 3)│ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ Recovery: 4 hrs Recovery: 30 min Recovery: 5 min │
│ Risk: HIGH Risk: MEDIUM Risk: LOW │
│ │
│ CRITICAL PATH: IMS must recover first. │
│ Total RTO = IMS recovery + CICS/DB2 + z/OS Connect │
│ = 4 hours + 30 min + 5 min ≈ 4.5 hours │
│ │
│ Department mandate: 12 hours │
│ Actual capability (estimated): 4.5 hrs (optimistic) │
│ 12-18 hrs (pessimistic) │
│ │
│ The pessimistic estimate assumes complications during │
│ IMS recovery — which is likely because IMS recovery is │
│ rarely practiced and depends on Marcus Whitfield. │
└────────────────────────────────────────────────────────────────┘
Sandra's DR Modernization Initiative
When Sandra presented the "actual capability" assessment to FBA leadership in 2022, the gap between the department's 12-hour RTO mandate and the pessimistic 18-hour estimate triggered an immediate remediation project.
Step 1: Updated Business Impact Analysis (2022-Q3)
Sandra's team conducted FBA's first comprehensive BIA since 2016. Key findings:
| Business Process | Criticality | RTO (business need) | RPO | Regulatory Driver |
|---|---|---|---|---|
| Monthly benefit payment processing | Tier 0 | < 4 hours | Zero | OMB A-130 NEF; millions depend on payments |
| Eligibility verification | Tier 1 | < 1 hour | Zero | Legislative mandate — must respond within 24 hrs |
| New claim processing | Tier 1 | < 8 hours | < 15 min | Statutory processing deadlines |
| Appeals processing | Tier 2 | < 24 hours | < 1 hour | Statutory but with longer deadlines |
| Overpayment recovery | Tier 3 | < 72 hours | < 24 hours | No immediate impact |
| Management reporting | Tier 3 | < 1 week | < 72 hours | Monthly/quarterly deadlines |
Critical finding: The monthly benefit payment run is a batch process that executes over a 3-day window (the 28th, 29th, and 30th of each month). If a disaster strikes during this window, the recovery must not only restore the system but must also ensure that the in-progress payment run can be completed before the payment date. For 22 million beneficiaries, many of whom have no financial cushion, a late payment is not merely an inconvenience — it's a crisis.
Step 2: IMS Recovery Modernization (2023)
The IMS layer was the critical path bottleneck. Sandra's approach:
a) IMS recovery documentation and cross-training. Marcus Whitfield spent three months documenting every IMS recovery procedure in exhaustive detail. Sandra paired him with two younger systems programmers who had never touched IMS. Together, they rewrote the IMS recovery runbook with the "3 AM test": could someone who has never performed this procedure execute it successfully at 3 AM using only this document?
"Marcus was resistant at first," Sandra recalls. "He'd spent thirty years being the guy everyone called. Being the indispensable expert was part of his identity. I told him: 'Marcus, the best thing you can do for this agency and the people it serves is to make yourself unnecessary. That's not diminishing your value — it's the highest expression of it.' He got it. The runbook he wrote is the best technical document this agency has ever produced."
b) IMS recovery practice. Sandra instituted monthly IMS recovery drills on a test system. Not the production DR site — a separate test environment with a copy of the IMS databases. Each month, a different systems programmer performs the recovery using only the runbook. Marcus observes but does not intervene unless asked.
Results after 12 months of practice: - Month 1: Recovery took 6 hours. Three calls to Marcus for clarification. - Month 3: Recovery took 4 hours. One call to Marcus. - Month 6: Recovery took 2.5 hours. No calls to Marcus. - Month 12: Recovery took 1.5 hours. No calls to Marcus. The systems programmer who performed the recovery had only been at FBA for 8 months.
c) IMS-to-DB2 migration for recovery speed. Long-term, Sandra's modernization plan includes migrating the benefits calculation rules from IMS to DB2 stored procedures — not for functionality reasons but for DR reasons. DB2 recovery is faster, better-understood, and doesn't depend on scarce IMS expertise. This migration is planned for 2025-2026 and is tracked in FBA's modernization roadmap.
Step 3: GDPS Enhancement (2023-Q4)
FBA upgraded from basic Metro Mirror to GDPS/Metro Mirror with automation: - Automated failover procedures for all three layers (IMS, CICS/DB2, z/OS Connect) - Automated network routing changes - Automated notification cascade (email, SMS, phone tree)
DR site upgrade: - DR site capacity increased from 50% to 75% of production - Dedicated IMS recovery LPAR at DR site (always available for IMS practice drills)
Step 4: Testing Program (2024-Present)
Sandra established a testing calendar that satisfies all regulatory requirements:
| Test Type | Frequency | Scope | Reporting |
|---|---|---|---|
| Tabletop exercise | Quarterly | Full scenario walk-through with leadership | Internal memo + IG file |
| Component test (IMS recovery) | Monthly | IMS recovery drill on test system | Training record |
| Component test (DB2/CICS) | Monthly | DB2 tablespace recovery, CICS restart | Training record |
| Level 3 planned failover | Semi-annually | Full GDPS site failover | Formal test report + IG file + FISMA POA&M |
| Level 4 unannounced test | Annually | Surprise failover (scope varies) | Formal test report + IG file |
The 2024 IG Audit
The Inspector General's 2024 IT security audit specifically examined FBA's contingency planning controls. Key findings:
Satisfactory findings: - BIA is current (updated 2022, reviewed 2023) - Contingency plan is current (updated 2023, reviewed quarterly) - Testing program meets NIST SP 800-34 requirements - Cross-training program has eliminated single-person dependencies for CICS/DB2 recovery - GDPS automation reduces reliance on manual procedures
Finding requiring attention: - IMS recovery still depends on a small team (3 people). While this is improved from the previous single-person dependency (Marcus Whitfield), the IG recommended expanding the trained recovery team to at least 5 people. - API gateway DR documentation is incomplete (the contractor issue) - The annual unannounced test has not yet been conducted for 2024
Finding closed from prior audit: - The 2016 finding about the outdated DR plan was formally closed. The IG acknowledged "significant improvement in contingency planning maturity."
Sandra's reaction: "Two findings and a closure. Two years ago we had eleven findings and a qualified opinion. We're not done, but we're moving in the right direction."
The Marcus Whitfield Succession Challenge
The most sensitive aspect of FBA's DR story is Marcus Whitfield's retirement.
Marcus is 63. He plans to retire in June 2027. He has worked at FBA since 1991 — 36 years. He has been the primary (and for most of that time, the only) person who understands the IMS benefits calculation system at the code level.
The benefits calculation COBOL programs contain business rules that reflect 147 legislative changes over 40 years. Some of these rules are documented in program comments. Some are documented in paper files in Marcus's office. Some are documented nowhere — they exist only in Marcus's memory.
"I've been writing these programs since the first Bush administration," Marcus says. "I know why every IF statement is there. I know which rules were changed in 1996 for welfare reform and which were changed in 2010 for the ACA. I know which calculations the GAO questioned in 2003 and how we fixed them. But I never wrote most of that down because there was never time. There's always another legislative change, another production issue, another audit."
DR implications of Marcus's retirement:
-
IMS recovery procedures. The cross-training program has mitigated this — three other people can now perform IMS recovery using the runbook Marcus documented. This was the highest-priority risk and has been substantially addressed.
-
Business rule knowledge. If data corruption requires not just technical recovery but validation that the recovered data produces correct benefit calculations, someone needs to know what "correct" means. Today, that person is Marcus. After his retirement, that knowledge must be encoded either in documentation, in automated test suites, or in the modernized DB2 implementation.
-
Undocumented edge cases. During a 2023 DR drill, the recovery team encountered an IMS database that required a specific utility parameter to recover correctly. The parameter wasn't in the runbook. Marcus knew it from memory: "That database was reorganized in 2007 with a non-standard HALDB partition scheme because we ran out of CI/CA splits. The recovery utility needs the PARTITION parameter or it rebuilds the index incorrectly." This is exactly the kind of knowledge that can't be captured in a runbook because you don't know it needs to be captured until you encounter it.
Sandra's approach to the succession risk:
Short-term (2024-2025): Intensive documentation sprint. Marcus spends 20% of his time documenting undocumented business rules, utility parameters, and operational procedures. Sandra pairs him with a technical writer to capture not just procedures but rationale — the "why" behind every decision.
Medium-term (2025-2026): Automated validation suite. The modernization team builds a comprehensive test suite that computes benefit amounts for 10,000 representative cases and compares results against known-correct outputs. If the system produces the same results after recovery, the data is valid — regardless of whether anyone understands why the calculations are what they are.
Long-term (2026-2027): Migration of benefits calculation from IMS to DB2 stored procedures. The new implementation will be documented, testable, and maintainable by the existing DB2 team. Marcus's knowledge is encoded in the new implementation and its test suite rather than residing in one person's memory.
The Payment Window DR Scenario
The scenario that keeps Sandra awake at night: a disaster during the monthly payment processing window.
The scenario: It's the 29th of the month, 2:00 PM. The 3-day payment run is 60% complete — 13.2 million payments have been calculated and staged for ACH transmission. 8.8 million payments are still being calculated. The primary data center loses power. Backup generators fail.
The stakes: If the remaining 8.8 million payments aren't calculated and transmitted by 11:59 PM on the 30th, 8.8 million people don't get paid on time. For many of them, this means bounced rent checks, missed medication, or worse.
The recovery plan:
- T+0 to T+15 min: GDPS failover to DR site. Metro Mirror ensures zero data loss — all 13.2 million completed payments are on the DR site.
- T+15 min to T+2 hrs: IMS recovery. Start the benefits calculation engine. Verify the payment run checkpoint — determine exactly which payments have been calculated and which haven't.
- T+2 hrs to T+10 hrs: Resume payment calculation for the remaining 8.8 million cases. DR site capacity (75%) means this will take approximately 8 hours instead of the normal 6 hours.
- T+10 hrs: All 22 million payments calculated. Stage for ACH transmission.
- T+10 hrs to T+12 hrs: ACH file generation and transmission to Federal Reserve.
Total time: Approximately 12 hours. If the disaster occurs at 2 PM, payments are transmitted by 2 AM on the 30th — well within the deadline.
But: This assumes everything goes perfectly. If the IMS recovery encounters the kind of undocumented utility parameter that Marcus caught in the 2023 drill, recovery time could extend by hours. If the DR site's reduced capacity isn't sufficient for the payment volume (which has never been tested at full production volume), the calculation phase takes longer. If the ACH interface at the DR site has configuration issues (which was one of Sandra's concerns from the 2024 IG audit findings about incomplete API documentation), transmission is delayed.
Sandra's margin analysis: "On paper, we have a 12-hour plan against a deadline that gives us 34 hours (from 2 PM on the 29th to 11:59 PM on the 30th). That looks comfortable. But if I add realistic contingencies — IMS recovery complications, reduced capacity, interface issues — the estimate grows to 18-24 hours. We still make it, probably. But 'probably' isn't a word I want to use when 8.8 million people are waiting for their checks."
Mitigation: Sandra has proposed a pre-positioned payment strategy: during the payment window, completed payment files are transmitted to the Federal Reserve in incremental batches (every 4 hours) rather than in a single batch at the end. This way, if a disaster interrupts the process, most payments have already been transmitted. Only the most recently calculated batch would need re-processing after recovery. FBA's treasury office is evaluating this proposal.
Discussion Questions
-
Sandra says the DR plan was "a compliance artifact, not an operational document." How do you distinguish between a DR plan that exists to pass audits and one that exists to survive disasters? What organizational dynamics create compliance-focused rather than capability-focused DR programs?
-
Marcus Whitfield's retirement represents a knowledge loss that technology alone cannot solve. The chapter describes three mitigation strategies (documentation, automated testing, migration). Evaluate each strategy's effectiveness. Which is most important? Which is most realistic?
-
The monthly payment window creates a time-sensitive DR scenario where delayed recovery causes direct harm to millions of people. How should this scenario influence FBA's DR architecture decisions? Should the architecture be designed for the worst case (disaster during payment window) or the average case?
-
Sandra's pre-positioned payment strategy (incremental ACH transmission) is an application-level DR mitigation, not an infrastructure-level one. Identify three other application-level design changes that would improve FBA's DR resilience without changing the GDPS/infrastructure architecture.
-
The IG audit process creates both positive incentives (accountability) and perverse incentives (compliance theater). How should federal agencies balance the need for auditable compliance with the need for genuine disaster preparedness? Is there a way to audit DR capability rather than DR documentation?
-
FBA's system spans three technology generations (IMS from 1983, CICS/DB2 from 2015, z/OS Connect from 2021). Each generation has different DR characteristics. Compare this multi-generation DR challenge with CNB's more homogeneous architecture (Case Study 1). Which is harder to protect? Why?
-
Sandra estimates that migrating benefits calculations from IMS to DB2 will take 18 months. During that migration, the system will be running on both IMS and DB2 simultaneously (a hybrid state). How does this hybrid state affect DR planning? Is the DR plan during migration simpler or more complex than the current state?