Chapter 18 Key Takeaways: Backup, Recovery, and Logging

DataField.Dev

Chapter 18 Key Takeaways: Backup, Recovery, and Logging

The One Sentence Summary

DB2's entire recovery architecture rests on write-ahead logging — the guarantee that every change is recorded in the log before it reaches the data pages — and your job as a DBA is to ensure that log chain is never broken, that backups are always available and tested, and that your team can execute recovery under pressure.

Core Principles

1. Write-Ahead Logging Is the Foundation

Every recovery mechanism in DB2 depends on the WAL protocol: log records are written to stable storage before data page changes. This single invariant makes crash recovery, point-in-time recovery, and disaster recovery possible. Without it, no form of recovery is reliable.

2. RPO and RTO Drive Every Decision

Recovery Point Objective (how much data you can lose) and Recovery Time Objective (how long you can be down) are business requirements that determine your technical architecture. An RPO of zero requires synchronous replication. An RTO of 15 minutes requires automation. Know your numbers before you design your strategy.

3. The Log Is a Timeline

The recovery log is a sequential record of every change to every byte of data. Given this timeline, DB2 can replay changes forward (redo), reverse changes backward (undo), and reconstruct the database state at any logged point in time.

Platform-Specific Essentials

z/OS

Active logs are pre-allocated VSAM linear data sets used in circular fashion
Archive logs are offloaded copies of active logs — the permanent record
BSDS is the master directory of the log, maintained in dual copies
Dual logging writes two copies simultaneously for redundancy
COPY utility creates image copies; RECOVER utility restores them
LOGLOAD controls checkpoint frequency and crash recovery window
DSNJU004 prints the log map; DSNJU003 modifies it
GDPS with Metro Mirror provides RPO=0 disaster recovery

LUW

Circular logging (default) supports only crash recovery — never use in production
Archive logging enables point-in-time recovery, online backup, and rollforward
LOGFILSIZ, LOGPRIMARY, LOGSECOND control active log capacity
LOGARCHMETH1/2 control where archive logs are stored (disk, TSM, vendor)
BACKUP/RESTORE/ROLLFORWARD are the primary recovery commands
Full, incremental, and delta backups form a hierarchy of recovery speed vs. backup speed
Redirected restore allows recovery to different storage paths — essential for cloning
HADR provides automated failover to a standby database for disaster recovery

Recovery Scenarios Quick Reference

Scenario	z/OS Action	LUW Action
DB2 crashes	Automatic crash recovery on restart	Automatic crash recovery on restart
Tablespace corruption	RECOVER TABLESPACE	RESTORE tablespace + ROLLFORWARD
Bad DELETE/UPDATE	RECOVER TOLOGPOINT (PITR)	RESTORE + ROLLFORWARD to time
Disk failure	RECOVER from image copy + logs	RESTORE from backup + ROLLFORWARD
Data center loss	GDPS failover to DR site	HADR takeover to standby

Critical Mistakes to Avoid

Never run production on circular logging (LUW). You lose all point-in-time recovery capability.
Never assume backups work without testing. Schedule monthly restore tests.
Never store backups on the same storage as the database. A single storage failure destroys both.
Never let archive logs accumulate without monitoring. Disk full = database stops.
Never skip the recovery runbook. Document exact commands, not concepts. Test regularly.
Never ignore long-running transactions. They prevent log reuse and cause log-full conditions.

Sizing Rules of Thumb

Active log capacity: At least 2x the amount generated during the time to offload/archive one log file, plus headroom for the longest transaction
Archive log disk: Daily log volume x retention days x 1.25 (headroom)
Backup window: Database size / available throughput, with margin for incremental approaches
Crash recovery time: Proportional to LOGLOAD (z/OS) or SOFTMAX (LUW) — smaller values mean faster recovery but more checkpoint overhead

The Meridian Bank Strategy in Brief

System	Backup	Logs	DR	RPO	RTO
Core Banking (z/OS)	Weekly full + daily incremental image copies	Dual active, dual archive	GDPS Metro Mirror	0	15 min
Online Banking (LUW)	Weekly full + daily incremental + 6-hour delta	Archive to disk + TSM	HADR NEARSYNC	< 5 min	30 min
Data Warehouse (LUW)	Weekly full offline	Archive to disk (7-day retention)	Backup to DR site	24 hours	4 hours

Connections to Other Chapters

Chapter 3 (Architecture): The log manager is a core DB2 subsystem — now you know what it does
Chapter 14 (Locking): Locks interact with recovery during online tablespace-level restore
Chapter 17 (Utilities): COPY, RECOVER, CHECK DATA, and REBUILD INDEX are the recovery workhorses on z/OS
Chapter 19 (Security): Access control prevents the "bad DELETE" scenario from Case Study 2
Chapter 26 (High Availability): HADR, data sharing, and pureScale build on the recovery foundations in this chapter

The Bottom Line

Backup and recovery is not about technology. It is about discipline. The technology is well-understood — WAL, image copies, archive logs, HADR, GDPS. What separates a prepared organization from a vulnerable one is the discipline to take backups on schedule, test restores regularly, maintain runbooks, monitor infrastructure, and practice recovery drills. The day you need to recover is the wrong day to learn how.