Chapter 18 Key Takeaways: Backup, Recovery, and Logging

The One Sentence Summary

DB2's entire recovery architecture rests on write-ahead logging — the guarantee that every change is recorded in the log before it reaches the data pages — and your job as a DBA is to ensure that log chain is never broken, that backups are always available and tested, and that your team can execute recovery under pressure.


Core Principles

1. Write-Ahead Logging Is the Foundation

Every recovery mechanism in DB2 depends on the WAL protocol: log records are written to stable storage before data page changes. This single invariant makes crash recovery, point-in-time recovery, and disaster recovery possible. Without it, no form of recovery is reliable.

2. RPO and RTO Drive Every Decision

Recovery Point Objective (how much data you can lose) and Recovery Time Objective (how long you can be down) are business requirements that determine your technical architecture. An RPO of zero requires synchronous replication. An RTO of 15 minutes requires automation. Know your numbers before you design your strategy.

3. The Log Is a Timeline

The recovery log is a sequential record of every change to every byte of data. Given this timeline, DB2 can replay changes forward (redo), reverse changes backward (undo), and reconstruct the database state at any logged point in time.


Platform-Specific Essentials

z/OS

  • Active logs are pre-allocated VSAM linear data sets used in circular fashion
  • Archive logs are offloaded copies of active logs — the permanent record
  • BSDS is the master directory of the log, maintained in dual copies
  • Dual logging writes two copies simultaneously for redundancy
  • COPY utility creates image copies; RECOVER utility restores them
  • LOGLOAD controls checkpoint frequency and crash recovery window
  • DSNJU004 prints the log map; DSNJU003 modifies it
  • GDPS with Metro Mirror provides RPO=0 disaster recovery

LUW

  • Circular logging (default) supports only crash recovery — never use in production
  • Archive logging enables point-in-time recovery, online backup, and rollforward
  • LOGFILSIZ, LOGPRIMARY, LOGSECOND control active log capacity
  • LOGARCHMETH1/2 control where archive logs are stored (disk, TSM, vendor)
  • BACKUP/RESTORE/ROLLFORWARD are the primary recovery commands
  • Full, incremental, and delta backups form a hierarchy of recovery speed vs. backup speed
  • Redirected restore allows recovery to different storage paths — essential for cloning
  • HADR provides automated failover to a standby database for disaster recovery

Recovery Scenarios Quick Reference

Scenario z/OS Action LUW Action
DB2 crashes Automatic crash recovery on restart Automatic crash recovery on restart
Tablespace corruption RECOVER TABLESPACE RESTORE tablespace + ROLLFORWARD
Bad DELETE/UPDATE RECOVER TOLOGPOINT (PITR) RESTORE + ROLLFORWARD to time
Disk failure RECOVER from image copy + logs RESTORE from backup + ROLLFORWARD
Data center loss GDPS failover to DR site HADR takeover to standby

Critical Mistakes to Avoid

  1. Never run production on circular logging (LUW). You lose all point-in-time recovery capability.
  2. Never assume backups work without testing. Schedule monthly restore tests.
  3. Never store backups on the same storage as the database. A single storage failure destroys both.
  4. Never let archive logs accumulate without monitoring. Disk full = database stops.
  5. Never skip the recovery runbook. Document exact commands, not concepts. Test regularly.
  6. Never ignore long-running transactions. They prevent log reuse and cause log-full conditions.

Sizing Rules of Thumb

  • Active log capacity: At least 2x the amount generated during the time to offload/archive one log file, plus headroom for the longest transaction
  • Archive log disk: Daily log volume x retention days x 1.25 (headroom)
  • Backup window: Database size / available throughput, with margin for incremental approaches
  • Crash recovery time: Proportional to LOGLOAD (z/OS) or SOFTMAX (LUW) — smaller values mean faster recovery but more checkpoint overhead

The Meridian Bank Strategy in Brief

System Backup Logs DR RPO RTO
Core Banking (z/OS) Weekly full + daily incremental image copies Dual active, dual archive GDPS Metro Mirror 0 15 min
Online Banking (LUW) Weekly full + daily incremental + 6-hour delta Archive to disk + TSM HADR NEARSYNC < 5 min 30 min
Data Warehouse (LUW) Weekly full offline Archive to disk (7-day retention) Backup to DR site 24 hours 4 hours

Connections to Other Chapters

  • Chapter 3 (Architecture): The log manager is a core DB2 subsystem — now you know what it does
  • Chapter 14 (Locking): Locks interact with recovery during online tablespace-level restore
  • Chapter 17 (Utilities): COPY, RECOVER, CHECK DATA, and REBUILD INDEX are the recovery workhorses on z/OS
  • Chapter 19 (Security): Access control prevents the "bad DELETE" scenario from Case Study 2
  • Chapter 26 (High Availability): HADR, data sharing, and pureScale build on the recovery foundations in this chapter

The Bottom Line

Backup and recovery is not about technology. It is about discipline. The technology is well-understood — WAL, image copies, archive logs, HADR, GDPS. What separates a prepared organization from a vulnerable one is the discipline to take backups on schedule, test restores regularly, maintain runbooks, monitor infrastructure, and practice recovery drills. The day you need to recover is the wrong day to learn how.