Case Study 2: Federal Benefits' Recovery from a DB2 Corruption

Background

The Federal Benefits Administration processes disability, retirement, and survivor benefits for 68 million beneficiaries. Their DB2 environment hosts 14 databases, 82 tablespaces, and approximately 2.1 TB of active data. Sandra Chen is the lead application architect. Marcus Williams is the senior COBOL developer responsible for the nightly batch processing suite.

The system's DB2 objects have well-defined referential integrity relationships:

BENEFICIARY_MASTER (parent)
  ├── BENEFIT_ENROLLMENT (dependent, FK: BENE_ID)
  │     └── PAYMENT_HISTORY (dependent, FK: ENROLL_ID)
  ├── BENEFIT_ELIGIBILITY (dependent, FK: BENE_ID)
  └── CORRESPONDENCE (dependent, FK: BENE_ID)

BENEFIT_CODES (parent, reference table)
  └── BENEFIT_ENROLLMENT (dependent, FK: BENEFIT_CODE)

PAYMENT_SCHEDULE (parent)
  └── PAYMENT_HISTORY (dependent, FK: SCHED_ID)

Image copy schedule (pre-incident): - Full image copies: Weekly Sunday, dual copies (DASD + tape) - Incremental copies: Daily at 01:00, single copy (DASD) - Inline copies during REORG: Biweekly Saturday

The Incident

Tuesday, 2026-03-10, 23:47. The nightly batch cycle launched on schedule. Job FEDBEN42 — the payment posting program — began processing 340,000 payment records against PAYMENT_HISTORY.

Wednesday, 2026-03-11, 00:23. Marcus received an alert: FEDBEN42 had abended with SQLCODE -904, reason code 00C90081, on tablespace DBFED03.TSPAYMT. The tablespace was in LPL (Logical Page List) status — DB2 had detected I/O errors on specific pages and placed them on the logical page list, making them inaccessible.

00:31. The on-call DBA confirmed: the storage subsystem reported permanent read errors on 47 pages of the TSPAYMT tablespace. A disk in the RAID array had failed, and the rebuild process had encountered a latent error on a second disk — a double-failure scenario that the RAID configuration couldn't handle.

00:38. Sandra joined the bridge call. Her first question: "What's the RI chain? If TSPAYMT is corrupted, what else is at risk?"

The answer: TSPAYMT hosts PAYMENT_HISTORY, which is a child of both BENEFIT_ENROLLMENT and PAYMENT_SCHEDULE. The parent tablespaces were on different volumes and were not affected. But PAYMENT_HISTORY's integrity was compromised — 47 pages meant potentially thousands of corrupted or inaccessible rows.

The Recovery Process

Phase 1: Assessment (00:38 - 01:15)

Sandra directed the team through a systematic assessment.

Step 1: Determine the scope of damage.

-DB2P DISPLAY DATABASE(DBFED03) SPACENAM(TSPAYMT) RESTRICT

Output confirmed: - TSPAYMT status: LPL (47 pages) - No other tablespaces in the database affected - 4 indexes on PAYMENT_HISTORY: status RW (read-write, still accessible)

Step 2: Identify the most recent recoverable copies.

SELECT DSNAME, ICTYPE, SHRLEVEL, DSVOLSER,
       TIMESTAMP, START_RBA
FROM SYSIBM.SYSCOPY
WHERE DBNAME = 'DBFED03'
  AND TSNAME = 'TSPAYMT'
ORDER BY TIMESTAMP DESC
FETCH FIRST 10 ROWS ONLY;

Results:

Copy Type Timestamp RBA
Incremental I 2026-03-11 01:00 (not yet — scheduled job hadn't run)
Incremental I 2026-03-10 01:12 X'0000D2E3F40A0000'
Incremental I 2026-03-09 01:08 X'0000D1A2B30C0000'
Full (DASD) F 2026-03-08 02:15 X'0000CF81920E0000'
Full (Tape) F 2026-03-08 02:15 X'0000CF81920E0000'

The most recent full copy was Sunday, March 8. Two incremental copies were available (Monday and Tuesday). Tonight's incremental hadn't run yet (it was scheduled for 01:00, and the failure occurred at 00:23).

Step 3: Estimate recovery time.

Sandra calculated: - Restore full copy (Sunday): 8 minutes (85 GB tablespace, DASD-to-DASD) - Apply Monday incremental: 2 minutes - Apply Tuesday incremental: 2 minutes - Apply log records (Tuesday 01:12 to Tuesday 23:47): ~22 hours of log activity = approximately 4.5 GB of log data for this tablespace = 9 minutes at 500 MB/min apply rate - Total estimated: 21 minutes

Step 4: Assess the batch impact.

Marcus reported: - FEDBEN42 had processed 127,000 of 340,000 payment records before the abend - The program uses COMMIT every 2,000 rows with a restart table - 63 commit points completed; 127,000 rows committed to PAYMENT_HISTORY - Those 127,000 rows are in the log and will be recovered by RECOVER to current - After recovery, FEDBEN42 can restart from commit point 64

Phase 2: Recovery Execution (01:15 - 01:48)

Step 1: Stop access to the tablespace.

-DB2P STOP DATABASE(DBFED03) SPACENAM(TSPAYMT)

This ensures no application attempts to access the tablespace during recovery. The pending batch jobs (FEDBEN43, FEDBEN44) were already held by the scheduler.

Step 2: Recover to current.

//RECOVER  EXEC DSNUPROC,SYSTEM=DB2P,UID='RCVPAYMT'
//SYSIN    DD *
  RECOVER TABLESPACE DBFED03.TSPAYMT
    PARALLEL 8
/*

Sandra chose RECOVER to current (no TORBA, no TOCOPY) because the goal was to restore the tablespace to its state immediately before the I/O errors. The last committed data — including Marcus's 127,000 payment rows — should all be recoverable from the copies and logs.

Recovery progress (monitored via DISPLAY UTILITY):

01:17 - RESTORE phase: applying full copy from 2026-03-08
01:25 - RESTORE phase: applying incremental from 2026-03-09
01:27 - RESTORE phase: applying incremental from 2026-03-10
01:29 - LOGAPPLY phase: applying log records
01:38 - LOGAPPLY phase complete
01:38 - UTILITY COMPLETE, RC=0

Actual elapsed time: 21 minutes. Sandra's estimate was accurate.

Step 3: Verify index consistency.

//CHKIDX   EXEC DSNUPROC,SYSTEM=DB2P,UID='CIXPAYMT'
//SYSIN    DD *
  CHECK INDEX (ALL)
    TABLESPACE DBFED03.TSPAYMT
    SHRLEVEL REFERENCE
    SORTDEVT SYSDA
    SORTNUM 4
/*

CHECK INDEX completed in 7 minutes. All four indexes consistent. No REBUILD INDEX needed.

Step 4: Verify referential integrity.

//CHKDATA  EXEC DSNUPROC,SYSTEM=DB2P,UID='CDAPAYMT'
//SYSIN    DD *
  CHECK DATA TABLESPACE DBFED03.TSPAYMT
    FOR EXCEPTION IN FEDADM.PAYMENT_HISTORY
      USE FEDADM.PAYMENT_HISTORY_EXCEPT
    SHRLEVEL REFERENCE
    DELETE NO
/*

CHECK DATA completed in 9 minutes. Zero exceptions. Referential integrity intact.

Since this was a RECOVER to current (not point-in-time), the parent tablespaces didn't change, and the recovered child data matched the parents' current state. No cascade recovery was needed.

Step 5: Take a full image copy.

//COPYTS   EXEC DSNUPROC,SYSTEM=DB2P,UID='CPYPAYMT'
//SYSIN    DD *
  COPY TABLESPACE DBFED03.TSPAYMT
    FULL YES
    SHRLEVEL REFERENCE
    COPYDDN (SYSCOPY1, SYSCOPY2)
    PARALLEL 8
/*
//SYSCOPY1 DD DSN=FED.DB2P.PAYMT.FCOPY1.D260311,
//            DISP=(NEW,CATLG),
//            UNIT=SYSDA,SPACE=(CYL,(2000,400),RLSE)
//SYSCOPY2 DD DSN=FED.DB2P.PAYMT.FCOPY2.D260311,
//            DISP=(NEW,CATLG),
//            UNIT=VTAPE,SPACE=(CYL,(2000,400),RLSE)

The image copy is mandatory after recovery — the tablespace was in COPY-pending status. Dual copies to DASD and tape.

Step 6: Run RUNSTATS.

//RSTATS   EXEC DSNUPROC,SYSTEM=DB2P,UID='RSPAYMT'
//SYSIN    DD *
  RUNSTATS TABLESPACE DBFED03.TSPAYMT
    USEPROFILE
    SHRLEVEL CHANGE
/*

Using the stored profile ensured comprehensive statistics collection without needing to code the full specification at 1:45 AM.

Step 7: REBIND affected packages.

//REBIND   EXEC PGM=IKJEFT01
//SYSTSPRT DD SYSOUT=*
//SYSTSIN  DD *
  DSN SYSTEM(DB2P)
  REBIND PACKAGE (FEDCOLL.FEDBEN42.(*)) EXPLAIN(YES)
  REBIND PACKAGE (FEDCOLL.FEDBEN43.(*)) EXPLAIN(YES)
  REBIND PACKAGE (FEDCOLL.FEDBEN44.(*)) EXPLAIN(YES)
  REBIND PACKAGE (FEDCOLL.FEDINQ01.(*)) EXPLAIN(YES)
  REBIND PACKAGE (FEDCOLL.FEDRPT03.(*)) EXPLAIN(YES)
  END
/*

Step 8: Restart the tablespace and resume batch.

-DB2P START DATABASE(DBFED03) SPACENAM(TSPAYMT) ACCESS(RW)

Marcus released FEDBEN42 for restart at 01:52. The program read its restart table, found the last committed checkpoint (row 127,000, commit point 63), and resumed processing from commit point 64. By 02:41, all 340,000 payment records were posted. FEDBEN43 and FEDBEN44 launched on schedule.

Timeline Summary

Time Event Duration
23:47 I/O errors detected, FEDBEN42 abends
00:23 Alert received by Marcus 36 min detection
00:31 DBA confirms hardware failure 8 min
00:38 Sandra joins, begins assessment 7 min
01:15 Assessment complete, recovery begins 37 min assessment
01:17 RECOVER started
01:38 RECOVER complete 21 min
01:38 CHECK INDEX started
01:45 CHECK INDEX complete 7 min
01:45 CHECK DATA started
01:48 CHECK DATA, COPY, RUNSTATS, REBIND 4 min
01:52 Tablespace restarted, batch resumes
02:41 FEDBEN42 restart complete 49 min

Total outage: 1 hour 54 minutes (from FEDBEN42 abend to batch restart). Recovery execution: 37 minutes (from RECOVER start to tablespace restart).

The batch cycle completed only 54 minutes late — well within the 2-hour contingency buffer.

What Could Have Gone Worse

Sandra conducted a post-incident review and identified four scenarios where the recovery would have been significantly more difficult:

Scenario A: No Incremental Copies

If the daily incremental copies hadn't been running, recovery would have required restoring Sunday's full copy and applying 4 days of log records instead of 2 days. Estimated additional recovery time: 12 minutes. Not catastrophic, but every minute matters in a batch window.

Scenario B: Corrupted Full Copy

If Sunday's DASD full copy had also been corrupted (e.g., same storage subsystem), they'd have needed the tape copy. Tape mount and restore would have added 15-20 minutes. This is why dual copies to different media exist.

Scenario C: Point-in-Time Recovery Needed

If the corruption had been caused by a software bug (bad data written by an application) rather than hardware failure, they'd need point-in-time recovery to roll back to before the bug was introduced. This would have required:

  1. Identifying the exact RBA of the first bad write
  2. Recovering TSPAYMT to that RBA
  3. Assessing whether BENEFIT_ENROLLMENT and PAYMENT_SCHEDULE needed recovery too (they wouldn't, since they're parents, not children — but PAYMENT_HISTORY is the child)
  4. Running CHECK DATA to verify RI after the point-in-time recovery
  5. Potentially dealing with orphan rows in PAYMENT_HISTORY if any parent rows had been deleted after the recovery point

Estimated total time: 45-60 minutes for recovery execution, plus significant analysis time.

Scenario D: Loss of Active Logs

If the failure had also damaged the active log datasets, recovery could only go to the point of the last image copy. All data changes since Tuesday's incremental copy (approximately 18 hours of batch and online activity) would be lost. This is the catastrophic scenario — and it's why active logs should always be on mirrored volumes separate from data volumes.

Changes Implemented After the Incident

Change 1: Increased Copy Frequency for Critical Tablespaces

Sandra upgraded the copy strategy for the six most critical tablespaces:

Before After
Full weekly, incremental daily Full twice-weekly (Sun + Wed), incremental daily
Incremental to DASD only Incremental to DASD + tape

The additional Wednesday full copy reduces the worst-case log apply time by half.

Change 2: Application Developer Participation in Recovery Procedures

Sandra formalized what the incident proved: application developers must be part of recovery planning. Each critical application now has a Recovery Dependency Document that specifies:

  • All tablespaces accessed by the application
  • RI relationships and recovery order
  • Restart/checkpoint logic and how it interacts with recovery
  • Estimated batch reprocessing time after recovery
  • Contact information for the responsible developer

Marcus created the template and wrote the document for the payment processing suite.

Change 3: Automated Recovery Runbooks

The assessment phase (37 minutes) was mostly human analysis — looking up RI relationships, calculating recovery times, deciding on recovery strategy. Sandra worked with the DBA team to create automated runbooks:

-- Recovery dependency query: given a tablespace,
-- find all RI-related tablespaces and their recovery order
WITH RI_CHAIN (DBNAME, TSNAME, LEVEL, PARENT_DB, PARENT_TS) AS (
  SELECT R.DBNAME, R.TSNAME, 1,
         CAST('' AS VARCHAR(8)),
         CAST('' AS VARCHAR(8))
  FROM SYSIBM.SYSTABLEPART R
  WHERE R.DBNAME = 'DBFED03'
    AND R.TSNAME = 'TSPAYMT'
  UNION ALL
  SELECT TP.DBNAME, TP.TSNAME, RC.LEVEL + 1,
         RC.DBNAME, RC.TSNAME
  FROM RI_CHAIN RC
  JOIN SYSIBM.SYSRELS SR
    ON SR.DBNAME = RC.DBNAME
  JOIN SYSIBM.SYSTABLEPART TP
    ON TP.DBNAME = SR.DBNAME
  WHERE RC.LEVEL < 5
)
SELECT DISTINCT DBNAME, TSNAME, LEVEL
FROM RI_CHAIN
ORDER BY LEVEL;

This query (and more sophisticated versions of it) now runs automatically when a recovery event is detected, producing the recovery plan in seconds instead of minutes.

Change 4: Recovery Drill Schedule

Sandra instituted quarterly recovery drills in the test environment. Each drill exercises a different scenario:

  • Q1: Single tablespace media failure (the scenario they actually experienced)
  • Q2: Point-in-time recovery with RI cascade
  • Q3: Multiple tablespace failure across databases
  • Q4: Full disaster recovery (DR site activation)

Marcus and the development team participate in every drill.

Change 5: Monitoring Enhancement

The team added proactive monitoring for storage health indicators:

  • DASD error rates (from HMC/SE data)
  • RAID rebuild events (any rebuild means reduced redundancy)
  • Image copy age (alert if any critical tablespace copy is older than 36 hours)
  • Active log utilization (alert at 70% capacity)

Lessons Learned

Lesson 1: The recovery was successful because the fundamentals were solid. Dual copies, daily incrementals, adequate log retention — these boring, routine practices are what made a 21-minute recovery possible. The incident wasn't a story about heroic debugging; it was a story about preparation.

Lesson 2: Application developers are essential to recovery. Marcus's knowledge of the restart table and commit logic saved at least 30 minutes that would have been spent analyzing whether the batch job could restart or needed to rerun from scratch. The DBA team knew how to run RECOVER; they didn't know how to restart FEDBEN42.

Lesson 3: Assessment time dominates recovery time. The actual RECOVER utility took 21 minutes. The assessment — understanding what was affected, what the dependencies were, what the recovery strategy should be — took 37 minutes. Automating the assessment is the biggest time-saving opportunity.

Lesson 4: Point-in-time recovery is the hard case. Their actual incident was the easy scenario: hardware failure, recover to current. Point-in-time recovery (from application bugs or bad batch runs) is far more complex because of RI cascade requirements. Practice it before you need it.

Lesson 5: Every copy you skip is a gamble. The incremental copy scheduled for 01:00 hadn't run when the failure occurred at 00:23. If the failure had happened at 01:30, that incremental would have been available and recovery would have been even faster. More copies = more recovery options = faster recovery.

Discussion Questions

  1. The total outage was 1 hour 54 minutes, but the actual recovery was only 21 minutes. How could the assessment phase be shortened? What information should be pre-computed and readily available?

  2. Sandra's team took a full image copy immediately after recovery. Why is this mandatory, not optional? What would happen if another failure occurred before the copy completed?

  3. The incident was a double-disk RAID failure. What storage architecture changes would provide better protection? How do these interact with DB2's image copy strategy?

  4. Marcus's batch program uses COMMIT every 2,000 rows with a restart table. If the program committed every 50,000 rows instead, how would that change the recovery scenario? Consider both the recovery time and the batch restart time.

  5. Sandra instituted quarterly recovery drills. Some organizations consider drills too expensive because they consume test environment resources. How would you justify the cost to management? What metrics would you use?

  6. The automated RI dependency query helps identify which tablespaces need recovery. But some data dependencies aren't enforced by RI constraints (e.g., application-level cross-references). How do you capture and document these non-RI dependencies in the recovery plan?