Case Study 2: Point-in-Time Recovery — Undoing a Bad DELETE

Background

It is a Tuesday afternoon at Meridian National Bank. The online banking system runs on DB2 11.5 for LUW on a Linux server. The database, named MERIDIAN, stores account information, transaction records, and customer data for the bank's digital channels. The database is 500 GB, uses archive logging, and has a full online backup taken every Sunday at 02:00 AM and incremental backups every night at 02:00 AM.

At 14:47:32, a junior developer named Priya is working on a data cleanup ticket. The ticket calls for deleting cancelled transactions older than 5 years from the TRANSACTION_ARCHIVE table. Priya connects to the production database (she has been granted temporary write access for this task) and runs:

DELETE FROM MERIDIAN.TRANSACTION WHERE TXN_DATE < '2019-01-01';

She immediately realizes two mistakes: 1. She targeted the TRANSACTION table (active transactions) instead of TRANSACTION_ARCHIVE 2. She omitted the AND STATUS = 'CANCELLED' filter

The DELETE statement has removed 4.2 million rows of active transaction history from the production TRANSACTION table. Customer-facing applications begin returning errors within seconds — customers logging into online banking see empty transaction histories.

The First Five Minutes (14:47 - 14:52)

Priya freezes for approximately 30 seconds, then contacts the senior DBA, Marcus. Marcus has been a DB2 DBA for 14 years. He does not panic.

14:48 — Marcus verifies the scope of the damage:

SELECT COUNT(*) FROM MERIDIAN.TRANSACTION;
-- Returns: 1,847,293  (was approximately 6,047,293 before the DELETE)
-- Approximately 4.2 million rows deleted
SELECT MIN(TXN_DATE), MAX(TXN_DATE) FROM MERIDIAN.TRANSACTION;
-- MIN: 2019-01-01, MAX: 2024-03-15
-- All transactions before 2019-01-01 are gone

14:49 — Marcus checks whether the DELETE has been committed:

-- Check application connection status
SELECT APPLICATION_HANDLE, APPL_STATUS, APPL_ID
FROM TABLE(MON_GET_CONNECTION(NULL, -1)) AS C
WHERE APPL_ID LIKE '%PRIYA%';

Bad news — the connection is in auto-commit mode. The DELETE was committed immediately. A simple ROLLBACK will not help.

14:50 — Marcus declares a P1 incident and contacts the operations center. The online banking application is configured to show a maintenance message when the backend is unavailable. The ops team enables the maintenance page.

14:51 — Marcus begins the recovery procedure. He pulls up the recovery runbook (Section 3: Point-in-Time Recovery) and identifies the steps.

The Recovery Decision (14:52 - 14:55)

Marcus faces a critical decision: how to recover. His options are:

Option A: Full database PITR — Restore the entire database to 14:47:00 (before the DELETE). This recovers the deleted rows but also loses all legitimate transactions between 14:47:00 and now.

Option B: Tablespace-level PITR — Restore only the TRANSACTION table's tablespace to 14:47:00. This preserves recent changes in other tables but creates referential integrity concerns.

Option C: Export/import from a restored copy — Restore the database to a separate instance, extract the deleted rows, and re-insert them into production. This is surgical but slower.

Marcus chooses Option C for several reasons: - The bank continues to process transactions in other tables (ACCOUNT, CUSTOMER). A full database PITR would lose those changes. - Tablespace-level PITR would put the TRANSACTION tablespace at a different time than ACCOUNT, potentially violating referential integrity. - Option C preserves all current data and only adds back the missing rows.

However, Marcus also prepares Option A as a fallback in case Option C takes too long.

The Recovery Execution (14:55 - 15:45)

Step 1: Stop the Bleeding (14:55)

Marcus confirms no further damage is occurring and instructs Priya to disconnect:

db2 "force application ($(db2 list applications | grep PRIYA | awk '{print $3}'))"

He also revokes Priya's temporary write access:

REVOKE DELETE ON MERIDIAN.TRANSACTION FROM USER PRIYA;

Step 2: Identify the Recovery Point (14:57)

Marcus needs to know what backup to restore and what time to roll forward to. He checks the backup history:

db2 list history backup all for MERIDIAN

# Output (abbreviated):
#  Op Obj Timestamp+Sequence  Type Dev Earliest Log Current Log Backup ID
#  B  D   20240310020000001   F    D   S0000847.LOG S0000847.LOG  ...
#  B  D   20240311020000001   I    D   S0000860.LOG S0000860.LOG  ...
#  B  D   20240312020000001   I    D   S0000873.LOG S0000873.LOG  ...
#  ...
#  B  D   20240314020000001   I    D   S0000898.LOG S0000898.LOG  ...

The most recent backup is an incremental from last night (March 14, 02:00 AM). The full backup base is from Sunday (March 10, 02:00 AM). He needs to roll forward to 14:47:00 today — approximately 12.75 hours past the last incremental backup.

Step 3: Restore to a Temporary Database (15:00)

Marcus restores the database to a separate instance on a recovery server. This server is maintained specifically for recovery operations and has sufficient disk space.

# On the recovery server
db2 restore database MERIDIAN from /db2backups/meridian/
    incremental automatic
    taken at 20240314020000
    into MERIDIAN_RECOVERY
    redirect

# Redirect containers to the recovery server's storage
db2 "SET TABLESPACE CONTAINERS FOR 0 USING (PATH '/db2recovery/meridian/ts0')"
db2 "SET TABLESPACE CONTAINERS FOR 1 USING (PATH '/db2recovery/meridian/ts1')"
db2 "SET TABLESPACE CONTAINERS FOR 2 USING (PATH '/db2recovery/meridian/ts2')"
# ... (additional tablespaces)

db2 "RESTORE DATABASE MERIDIAN CONTINUE"

The incremental automatic restore takes 25 minutes (it automatically restores the full backup first, then applies the incremental chain). During this time, Marcus is already preparing the next steps.

Step 4: Roll Forward to Pre-DELETE Time (15:25)

db2 rollforward database MERIDIAN_RECOVERY
    to 2024-03-15-14.47.00.000000 using local time
    and complete
    overflow log path /db2archlog/meridian/node0000/

The rollforward reads approximately 12.75 hours of archive logs and applies them to the restored database. At the bank's average log rate, this is about 10 GB of log data. At 100 MB/s log apply speed, this takes approximately 100 seconds. But the rollforward also needs to process the log records selectively and handle the undo phase, so the actual elapsed time is about 4 minutes.

15:29 — Rollforward complete. The MERIDIAN_RECOVERY database is at its state as of 14:47:00, with all 6,047,293 rows in the TRANSACTION table intact.

Step 5: Extract the Deleted Rows (15:30)

Marcus exports the rows that were deleted from the production database:

db2 "EXPORT TO /db2recovery/deleted_txns.del OF DEL
     SELECT * FROM MERIDIAN_RECOVERY.MERIDIAN.TRANSACTION
     WHERE TXN_DATE < '2019-01-01'"

Result: 4,200,147 rows exported to a delimited file, approximately 2.1 GB.

Marcus verifies the count matches expectations:

# Quick sanity check
wc -l /db2recovery/deleted_txns.del
# 4200147 lines — matches the gap in production

Step 6: Re-insert the Deleted Rows (15:35)

Marcus loads the exported rows back into the production database:

db2 "IMPORT FROM /db2recovery/deleted_txns.del OF DEL
     INSERT INTO MERIDIAN.TRANSACTION"

The import of 4.2 million rows takes approximately 8 minutes with the default configuration. Marcus monitors the progress:

db2 "SELECT COUNT(*) FROM MERIDIAN.TRANSACTION"
# Monitor count increasing during import

15:43 — Import complete. Production TRANSACTION table row count: 6,047,293 — exactly the pre-DELETE count.

Step 7: Verify Data Integrity (15:43)

Marcus runs verification queries:

-- Verify total count
SELECT COUNT(*) FROM MERIDIAN.TRANSACTION;
-- 6,047,293 ✓

-- Verify date range restored
SELECT MIN(TXN_DATE), MAX(TXN_DATE) FROM MERIDIAN.TRANSACTION;
-- MIN: 2014-06-15, MAX: 2024-03-15 ✓

-- Verify no duplicates were introduced
SELECT TXN_ID, COUNT(*) FROM MERIDIAN.TRANSACTION
GROUP BY TXN_ID HAVING COUNT(*) > 1;
-- 0 rows ✓ (no duplicates — TXN_ID is a primary key, import would have failed on dupes)

-- Verify referential integrity with ACCOUNT table
SELECT COUNT(*) FROM MERIDIAN.TRANSACTION T
WHERE NOT EXISTS (
    SELECT 1 FROM MERIDIAN.ACCOUNT A
    WHERE A.ACCOUNT_ID = T.ACCOUNT_ID
);
-- 0 rows ✓ (all transactions reference valid accounts)

-- Spot-check: verify specific known transaction
SELECT * FROM MERIDIAN.TRANSACTION
WHERE TXN_ID = 1000001;
-- Returns expected row ✓

Step 8: Restore Service (15:45)

Marcus notifies the operations center that the database is recovered. The ops team disables the maintenance page and re-enables normal online banking access.

Total outage time: 58 minutes. The RTO was 30 minutes, so this was a miss — the incident report will note this. However, no committed transaction data was lost (RPO = 0 achieved) and no customer account data was affected.

Post-Incident Analysis

The next morning, Marcus leads the post-incident review. Key findings:

Root Cause

A developer with temporary production DELETE access executed a statement against the wrong table without a required filter predicate. Auto-commit was enabled, making the DELETE irreversible through simple ROLLBACK.

Contributing Factors

  1. No confirmation step for destructive operations. The bank's data access tool does not require confirmation before executing DELETE statements that affect more than N rows.

  2. Auto-commit was enabled. If the session had been in manual commit mode, Priya could have issued ROLLBACK immediately upon realizing the mistake.

  3. Temporary access grants are too broad. Priya was granted DELETE on all tables in the MERIDIAN schema, not just the TRANSACTION_ARCHIVE table she needed.

Corrective Actions

  1. Implement a "large DELETE guard." A database trigger or application-level check will require explicit confirmation for DELETE statements affecting more than 1,000 rows:
-- Create an audit trigger (simplified example)
CREATE TRIGGER MERIDIAN.TXN_DELETE_GUARD
    BEFORE DELETE ON MERIDIAN.TRANSACTION
    REFERENCING OLD TABLE AS DELETED_ROWS
    FOR EACH STATEMENT
    WHEN (SELECT COUNT(*) FROM DELETED_ROWS) > 1000
    SIGNAL SQLSTATE '75001'
    SET MESSAGE_TEXT = 'DELETE affects >1000 rows. Use ALLOW_LARGE_DELETE=TRUE session variable to proceed.';
  1. Mandate manual commit for production data changes. All production connections for ad-hoc data changes must use db2 +c (connect with auto-commit off) or equivalent.

  2. Restrict temporary access grants. Instead of granting DELETE on an entire schema, grant DELETE only on the specific table needed, with an expiration time:

GRANT DELETE ON MERIDIAN.TRANSACTION_ARCHIVE TO USER PRIYA;
-- Automated job revokes after 24 hours
  1. Pre-stage the recovery environment. The 25-minute restore could have been eliminated if the recovery server maintained a continuously-refreshed copy of the database (using HADR to a read-only standby, for example). Marcus adds this to the architecture roadmap.

  2. Update the recovery runbook. The Option C procedure (export/import from restored copy) was not documented in the runbook. Marcus adds it as a new section with estimated times for various database sizes.

Recovery Time Breakdown

Step Duration Notes
Assess damage 3 min Verify scope and commit status
Decision making 3 min Choose recovery strategy
Stop bleeding / prepare 5 min Revoke access, set up recovery server
Incremental restore 25 min Full + incremental chain restoration
Rollforward to target time 4 min 10 GB of log data applied
Export deleted rows 5 min 4.2M rows, 2.1 GB
Import to production 8 min 4.2M rows reinserted
Verify integrity 2 min Count, range, RI, spot checks
Restore service 3 min Ops team re-enables application
Total 58 min Target was 30 min

The bottleneck was the restore step (25 minutes). With a pre-staged recovery environment, the total time would drop to approximately 22 minutes — within the 30-minute RTO.

Lessons for the Reader

  1. Auto-commit is the enemy of ad-hoc production changes. Always use manual commit mode when making changes to production data. The ability to ROLLBACK is your safety net.

  2. The surgical approach (Option C) is often better than the nuclear approach (full PITR). Exporting missing data from a restored copy and re-inserting it preserves all current activity. Full database PITR throws away everything since the target time.

  3. Recovery speed depends on preparation. The 25-minute restore dominated the recovery time. Pre-staging a recovery environment — through HADR standby, storage snapshots, or periodic refreshes — dramatically reduces RTO.

  4. Verify, verify, verify. After any recovery, run comprehensive verification queries. Count rows, check ranges, validate referential integrity, and spot-check specific records. Trust but verify.

  5. Post-incident analysis drives improvement. This incident led to five corrective actions that will prevent recurrence and speed up recovery if a similar event occurs. Every incident is an opportunity to strengthen the system.

  6. Access control is part of recovery strategy. If Priya had been granted access only to TRANSACTION_ARCHIVE (the table she actually needed), the incident would not have occurred. Least-privilege access is not just a security principle — it is a data protection principle.


This case study is based on a composite of real incidents at financial institutions. The names and specific details are fictional, but the recovery procedure, timeline, and lessons are representative of actual production recovery events.