Chapter 18 Exercises: Backup, Recovery, and Logging

DataField.Dev

Chapter 18 Exercises: Backup, Recovery, and Logging

These exercises range from conceptual understanding to hands-on recovery scenarios and sizing calculations. In a production environment, backup and recovery skills are not theoretical — they are tested under pressure. Work through every exercise carefully.

Section A: Conceptual Understanding (Exercises 1-8)

Exercise 1: Write-Ahead Logging Fundamentals

Explain in your own words why the log record must be written to stable storage before the corresponding data page change is written. Describe a specific failure scenario where violating this rule would lead to an unrecoverable database.

Hint: Consider what happens if the system crashes after writing the data page but before writing the log record.

Exercise 2: Redo vs. Undo

For each of the following crash scenarios, state whether DB2 needs to redo, undo, or both for the described transaction:

a) Transaction T1 committed. Its data pages were flushed to disk. DB2 crashes. b) Transaction T2 committed. Its data pages were NOT flushed to disk. DB2 crashes. c) Transaction T3 did NOT commit. Its data pages WERE flushed to disk (by buffer pool page cleaning). DB2 crashes. d) Transaction T4 did NOT commit. Its data pages were NOT flushed to disk. DB2 crashes.

For each case, describe the state of the data on disk before and after crash recovery.

Exercise 3: RPO and RTO Analysis

Meridian National Bank is evaluating backup strategies for a new loan origination system. The business provides these requirements:

The system processes 200 loan applications per day during business hours (8 AM-6 PM)
Each loan application involves approximately 50 database operations
A lost loan application requires 4 hours of manual re-entry work
The cost of a lost application is estimated at $500 (staff time + customer frustration)
System downtime costs $2,000 per hour in lost productivity

Calculate the appropriate RPO and RTO for this system. Justify your choices with cost analysis. What backup strategy (from the options in this chapter) would you recommend?

Exercise 4: Checkpoint Impact

A z/OS DB2 system has LOGLOAD set to 200,000. The average log record rate is 10,000 records per second.

a) How often does a checkpoint occur (in seconds)? b) What is the approximate maximum crash recovery time, assuming 500 bytes per log record and 200 MB/s log read speed? c) If the DBA reduces LOGLOAD to 50,000, how does this affect checkpoint frequency and crash recovery time? d) What is the trade-off of a more frequent checkpoint?

Exercise 5: Circular vs. Archive Logging Decision

For each of the following databases, recommend circular or archive logging and explain your reasoning:

a) A production OLTP database serving an e-commerce website b) A developer's local database used for unit testing c) A staging database that mirrors production for QA testing d) A read-only reporting database refreshed nightly from production e) A temporary database created for a one-time data migration project

Exercise 6: Recovery Dependencies

The Meridian Bank core banking system on z/OS has the following tablespace structure:

CUSTDATA  — Customer master data (PK: CUSTOMER_ID)
ACCTDATA  — Account data (FK: CUSTOMER_ID → CUSTDATA)
TXNDATA   — Transaction data (FK: ACCOUNT_ID → ACCTDATA)
LOANDATA  — Loan data (FK: CUSTOMER_ID → CUSTDATA, FK: ACCOUNT_ID → ACCTDATA)

A DBA needs to perform point-in-time recovery on ACCTDATA to reverse a bad batch UPDATE. Explain:

a) Why recovering only ACCTDATA creates a referential integrity problem b) Which other tablespaces might need to be recovered to the same point c) What DB2 status flags will be set on the recovered tablespace and related objects d) The steps to resolve these status flags after recovery

Exercise 7: Log Record Analysis

A log record for an UPDATE operation on z/OS contains these fields:

RBA: X'00000003A5F20000'
Transaction ID: X'0000001234'
Record Type: UPDATE
DBID: 312, OBID: 7
Page: 4782, Offset: 200
Before Image: X'4D657269....' (22 bytes, ASCII "Meridian Savings")
After Image: X'4D657269....' (24 bytes, ASCII "Meridian Premium Savings")

a) If this transaction committed and then a crash occurred before the page was flushed, what does DB2 do during recovery? b) If this transaction did NOT commit and the page WAS flushed before the crash, what does DB2 do during recovery? c) How does DB2 locate the previous log record for this same transaction?

Exercise 8: BSDS Contents

You run DSNJU004 and see the following output (abbreviated):

ACTIVE LOG COPY 1:
  DSNAME=DSNCAT.LOGCOPY1.DS001, STARTRBA=000000000000, ENDRBA=0000007FFFFFFF, STATUS=REUSABLE
  DSNAME=DSNCAT.LOGCOPY1.DS002, STARTRBA=000000800000, ENDRBA=0000009FFFFFFF, STATUS=REUSABLE
  DSNAME=DSNCAT.LOGCOPY1.DS003, STARTRBA=0000010000000, ENDRBA=00000017FFFFFFF, STATUS=CURRENT
  DSNAME=DSNCAT.LOGCOPY1.DS004, STARTRBA=000000000000, ENDRBA=000000000000, STATUS=AVAILABLE

ARCHIVE LOG:
  DSNAME=ARCHLOG.D20240315.T010000, STARTRBA=000000000000, ENDRBA=0000007FFFFFFF
  DSNAME=ARCHLOG.D20240315.T020000, STARTRBA=000000800000, ENDRBA=0000009FFFFFFF

a) How many active log data sets are defined? b) Which one is currently being written to? c) How many have been offloaded (archived)? d) What does STATUS=AVAILABLE mean?

Section B: Log Sizing Calculations (Exercises 9-14)

Exercise 9: z/OS Active Log Sizing

An OLTP system on z/OS has the following workload characteristics:

Peak transaction rate: 3,000 transactions/second
Average log records per transaction: 4
Average log record size: 600 bytes
Archive log offload time: 45 seconds per data set

Calculate: a) The peak log write rate in MB/second b) The minimum size of each active log data set c) A recommended configuration (number and size of data sets) with a 2x safety factor d) The total active log capacity

Exercise 10: LUW Log File Sizing

A DB2 LUW database has the following characteristics:

Average daily log generation: 20 GB
Peak hour log generation: 5 GB
Longest expected transaction duration: 30 minutes
Log generation during a 30-minute peak period: 2.5 GB

Calculate: a) Appropriate LOGFILSIZ (in 4 KB pages, targeting 100 MB per file) b) Appropriate LOGPRIMARY to hold at least 2 hours of peak log data c) Appropriate LOGSECOND to provide 1 additional hour of capacity d) Total active log capacity in GB

Exercise 11: Archive Log Disk Sizing

Using the same LUW database from Exercise 10:

Archive log retention requirement: 14 days on disk
Average daily log generation: 20 GB (weekdays), 5 GB (weekends)

Calculate: a) Total archive log disk space needed for 14-day retention b) Recommended filesystem size (with 25% headroom) c) At what percentage of filesystem utilization should an alert fire?

Exercise 12: Backup Window Calculation

Meridian Bank's online banking database is 500 GB. The available backup window is 4 hours (02:00 AM to 06:00 AM). The backup target is a SAN-attached disk array.

a) What minimum sustained write throughput is needed for a full backup? b) If the disk array delivers 150 MB/s sustained write throughput, can a full backup complete in the window? c) With 3-way parallelism, what throughput is needed per stream? d) If the daily change rate is 5% of the database, how large is an incremental backup? How long does it take at 150 MB/s?

Exercise 13: Recovery Time Estimation

A z/OS tablespace has the following characteristics:

Size: 100 GB
Most recent full image copy: 6 hours old
Log generation rate for this tablespace: 500 MB/hour
Image copy restore speed: 200 MB/s
Log apply speed: 50 MB/s (log records must be read and applied selectively)

Estimate the total recovery time: a) Time to restore the image copy b) Volume of log data to apply c) Time to apply the log data d) Total estimated recovery time

Exercise 14: HADR Replication Bandwidth

Meridian Bank's HADR configuration sends log data from the primary to the standby:

Average log generation: 20 GB/day = ~230 KB/s average
Peak log generation: 5 GB/hour = ~1.4 MB/s
Network latency to DR site: 5 ms round-trip
HADR mode: NEARSYNC

a) What minimum network bandwidth is required for sustained replication during peak? b) How much additional bandwidth should be provisioned for bursts (recommend 3x peak)? c) In NEARSYNC mode, what is the approximate impact on COMMIT latency? d) If the network goes down for 10 minutes during peak, how much log data accumulates in the HADR send buffer?

Section C: Hands-On Recovery Scenarios (Exercises 15-25)

Exercise 15: LUW Full Database Recovery

Write the complete sequence of commands to:

a) Take a full online backup of database MERIDIAN to /db2backups/ b) Simulate a failure by dropping the database c) Restore the database from the backup d) Roll forward to end of logs and complete e) Verify the recovery by querying a known table

Exercise 16: LUW Point-in-Time Recovery

A developer accidentally ran DELETE FROM MERIDIAN.TRANSACTION WHERE TXN_DATE < '2024-01-01' at 14:47:00 today. The correct command should have included AND STATUS = 'CANCELLED'.

Write the complete sequence of commands to recover the TRANSACTION table data to its state at 14:46:00.

Include: a) How to determine the exact time of the bad DELETE b) The RESTORE command c) The ROLLFORWARD command with the correct timestamp d) How to verify the recovery

Exercise 17: z/OS Tablespace Recovery

Write the JCL to:

a) Take a full image copy of tablespace MERIDIAN.TXNDATA with SHRLEVEL CHANGE b) Take an incremental image copy of the same tablespace c) Recover the tablespace to current (using the most recent image copy + logs) d) Recover the tablespace to a specific RBA

Exercise 18: LUW Redirected Restore

You need to clone the production MERIDIAN database to a test environment with different storage paths:

Production data path: /db2data/meridian/
Test data path: /db2test/meridian_clone/
Production log path: /db2logs/meridian/
Test log path: /db2test/meridian_clone_logs/

Write the complete redirected restore procedure including: a) Taking the backup from production b) Generating the redirect script c) Editing the script (describe what changes to make) d) Executing the redirected restore e) Rolling forward and completing

Exercise 19: Enable Archive Logging on LUW

A database named DEVDB currently uses circular logging. Write the complete procedure to convert it to archive logging, including:

a) Checking the current logging mode b) Changing the database configuration c) Taking the required full backup d) Verifying archive logging is active e) Testing that archive logs are being created

Exercise 20: z/OS Dual Active Log Setup

Given the following requirements: - 6 active log data sets per copy - 2 GB per data set - Dual logging (COPY1 and COPY2) - COPY1 on volumes VOL001-VOL006 - COPY2 on volumes VOL011-VOL016

Write the DSNJU003 control statements to define this configuration.

Exercise 21: Crash Recovery Analysis

After a power failure, DB2 for LUW restarts and performs crash recovery. The db2diag.log shows:

2024-03-15-14.30.00.000000  DB2 RESTART: Database MERIDIAN
  Crash recovery starting
  Log analysis: reading from LSN 0x000000003A500000 to 0x000000003A5F2000
  Redo phase: 2,847 log records applied
  Undo phase: 12 transactions rolled back
  Crash recovery complete, elapsed time: 4.2 seconds

a) How many log records were in the recovery window? b) How many transactions were in-flight at the time of the crash? c) What determines the starting LSN for the log analysis? d) Is 4.2 seconds a reasonable crash recovery time for an OLTP database? Why or why not?

Exercise 22: In-Doubt Transaction Resolution (LUW)

After a crash recovery, you discover that 2 transactions are in-doubt:

db2 list indoubt transactions
  Transaction 1: XID=..., Status=PREPARED, Application=LOAN_PROCESSOR
  Transaction 2: XID=..., Status=PREPARED, Application=PAYMENT_GATEWAY

a) What does PREPARED status mean in the context of two-phase commit? b) What are the two options for resolving each transaction? c) How do you determine the correct resolution? d) What happens to locks held by in-doubt transactions? e) Write the commands to commit transaction 1 and roll back transaction 2.

Exercise 23: Archive Log Gap Recovery

You discover that archive log file S0000125.LOG is missing from the archive directory. Your most recent backup was taken when the log position was in file S0000120.LOG.

a) Can you still recover the database to the current point? Why or why not? b) Can you recover to any point within the range of S0000120 to S0000124? c) Can you recover to any point after S0000125? d) What should you do immediately to protect against further data loss? e) How could this situation have been prevented?

Exercise 24: Backup Verification Script

Write a complete bash script that:

a) Takes an online backup of database MERIDIAN b) Restores it to a temporary database called MERIDIAN_VERIFY c) Rolls forward to end of logs d) Runs a series of verification queries (row counts on key tables) e) Compares the row counts with the production database f) Logs the results to a file with timestamps g) Drops the temporary database h) Sends an alert if any verification fails

Exercise 25: Complete Recovery Runbook

Create a recovery runbook for Meridian Bank's online banking database (LUW) covering these scenarios:

a) Scenario 1: Database crashes and restarts automatically — document the verification steps b) Scenario 2: A tablespace is corrupted and needs online recovery — document the restore and rollforward procedure c) Scenario 3: Someone drops an important table — document the point-in-time recovery procedure d) Scenario 4: The entire server is lost and you must recover to the DR site — document the HADR takeover procedure

For each scenario, include: - Symptoms and detection - Decision criteria - Step-by-step commands - Verification steps - Estimated recovery time - Escalation contacts

Section D: Advanced Exercises (Exercises 26-30)

Exercise 26: Multi-Partition Backup Strategy

A DB2 LUW database uses DPF (Database Partitioning Feature) with 8 partitions across 4 servers. Design a backup strategy that:

a) Minimizes the backup window b) Ensures all partitions can be recovered consistently c) Handles the failure of a single partition server d) Includes archive log management across all partitions

Exercise 27: Recovery from Corrupt BSDS (z/OS)

Both copies of the BSDS are damaged. DB2 will not start.

a) What utility is used to create a new BSDS? b) What information must you have to reconstruct the BSDS? c) Write the DSNJU003 control statements to create a new BSDS from scratch d) How would you determine the correct RBA ranges for the active log data sets?

Exercise 28: Automated Recovery Decision Tree

Design an automated monitoring and recovery system for Meridian Bank. Create a decision tree that maps each type of failure (crash, media failure, data corruption, log space exhaustion) to the appropriate automated or manual response. Include escalation thresholds and notification rules.

Exercise 29: Cost-Benefit Analysis of DR Options

Meridian Bank is comparing three DR options for the online banking system:

Option A: HADR SYNC (RPO=0, RTO=2 min) — Cost: $500,000/year
Option B: HADR ASYNC (RPO=30 sec, RTO=2 min) — Cost: $300,000/year
Option C: Nightly backup to DR site (RPO=24 hours, RTO=6 hours) — Cost: $50,000/year

The system processes $10 million in transactions per day. A data loss event affects customer trust (estimated $2M impact). Downtime costs $5,000 per minute.

Calculate the annual risk exposure for each option and recommend the best choice.

Exercise 30: Cross-Platform Recovery Design

Meridian Bank's core banking (z/OS) feeds data to the online banking (LUW) system through a nightly ETL process. If both systems need to be recovered to the same point in time:

a) What challenges arise from cross-platform recovery? b) How do you synchronize recovery points between z/OS and LUW? c) Design a procedure that ensures data consistency across both platforms after recovery d) What role does the ETL process play in defining valid recovery points?

Answer Key Hints

Exercise 4a: 200,000 / 10,000 = 20 seconds between checkpoints
Exercise 9a: 3,000 × 4 × 600 = 7.2 MB/s
Exercise 10a: 100 MB / 4 KB = 25,600 pages
Exercise 12a: 500 GB / (4 × 3,600 seconds) = ~35 MB/s
Exercise 13a: 100 GB / (200 MB/s) = 500 seconds ≈ 8.3 minutes
Exercise 13b: 6 hours × 500 MB/hour = 3 GB of log data
Exercise 14a: At least 1.4 MB/s = ~12 Mbps (plus TCP overhead, recommend 20 Mbps minimum)