Chapter 18 Exercises: Backup, Recovery, and Logging
These exercises range from conceptual understanding to hands-on recovery scenarios and sizing calculations. In a production environment, backup and recovery skills are not theoretical — they are tested under pressure. Work through every exercise carefully.
Section A: Conceptual Understanding (Exercises 1-8)
Exercise 1: Write-Ahead Logging Fundamentals
Explain in your own words why the log record must be written to stable storage before the corresponding data page change is written. Describe a specific failure scenario where violating this rule would lead to an unrecoverable database.
Hint: Consider what happens if the system crashes after writing the data page but before writing the log record.
Exercise 2: Redo vs. Undo
For each of the following crash scenarios, state whether DB2 needs to redo, undo, or both for the described transaction:
a) Transaction T1 committed. Its data pages were flushed to disk. DB2 crashes. b) Transaction T2 committed. Its data pages were NOT flushed to disk. DB2 crashes. c) Transaction T3 did NOT commit. Its data pages WERE flushed to disk (by buffer pool page cleaning). DB2 crashes. d) Transaction T4 did NOT commit. Its data pages were NOT flushed to disk. DB2 crashes.
For each case, describe the state of the data on disk before and after crash recovery.
Exercise 3: RPO and RTO Analysis
Meridian National Bank is evaluating backup strategies for a new loan origination system. The business provides these requirements:
- The system processes 200 loan applications per day during business hours (8 AM-6 PM)
- Each loan application involves approximately 50 database operations
- A lost loan application requires 4 hours of manual re-entry work
- The cost of a lost application is estimated at $500 (staff time + customer frustration)
- System downtime costs $2,000 per hour in lost productivity
Calculate the appropriate RPO and RTO for this system. Justify your choices with cost analysis. What backup strategy (from the options in this chapter) would you recommend?
Exercise 4: Checkpoint Impact
A z/OS DB2 system has LOGLOAD set to 200,000. The average log record rate is 10,000 records per second.
a) How often does a checkpoint occur (in seconds)? b) What is the approximate maximum crash recovery time, assuming 500 bytes per log record and 200 MB/s log read speed? c) If the DBA reduces LOGLOAD to 50,000, how does this affect checkpoint frequency and crash recovery time? d) What is the trade-off of a more frequent checkpoint?
Exercise 5: Circular vs. Archive Logging Decision
For each of the following databases, recommend circular or archive logging and explain your reasoning:
a) A production OLTP database serving an e-commerce website b) A developer's local database used for unit testing c) A staging database that mirrors production for QA testing d) A read-only reporting database refreshed nightly from production e) A temporary database created for a one-time data migration project
Exercise 6: Recovery Dependencies
The Meridian Bank core banking system on z/OS has the following tablespace structure:
CUSTDATA — Customer master data (PK: CUSTOMER_ID)
ACCTDATA — Account data (FK: CUSTOMER_ID → CUSTDATA)
TXNDATA — Transaction data (FK: ACCOUNT_ID → ACCTDATA)
LOANDATA — Loan data (FK: CUSTOMER_ID → CUSTDATA, FK: ACCOUNT_ID → ACCTDATA)
A DBA needs to perform point-in-time recovery on ACCTDATA to reverse a bad batch UPDATE. Explain:
a) Why recovering only ACCTDATA creates a referential integrity problem b) Which other tablespaces might need to be recovered to the same point c) What DB2 status flags will be set on the recovered tablespace and related objects d) The steps to resolve these status flags after recovery
Exercise 7: Log Record Analysis
A log record for an UPDATE operation on z/OS contains these fields:
- RBA: X'00000003A5F20000'
- Transaction ID: X'0000001234'
- Record Type: UPDATE
- DBID: 312, OBID: 7
- Page: 4782, Offset: 200
- Before Image: X'4D657269....' (22 bytes, ASCII "Meridian Savings")
- After Image: X'4D657269....' (24 bytes, ASCII "Meridian Premium Savings")
a) If this transaction committed and then a crash occurred before the page was flushed, what does DB2 do during recovery? b) If this transaction did NOT commit and the page WAS flushed before the crash, what does DB2 do during recovery? c) How does DB2 locate the previous log record for this same transaction?
Exercise 8: BSDS Contents
You run DSNJU004 and see the following output (abbreviated):
ACTIVE LOG COPY 1:
DSNAME=DSNCAT.LOGCOPY1.DS001, STARTRBA=000000000000, ENDRBA=0000007FFFFFFF, STATUS=REUSABLE
DSNAME=DSNCAT.LOGCOPY1.DS002, STARTRBA=000000800000, ENDRBA=0000009FFFFFFF, STATUS=REUSABLE
DSNAME=DSNCAT.LOGCOPY1.DS003, STARTRBA=0000010000000, ENDRBA=00000017FFFFFFF, STATUS=CURRENT
DSNAME=DSNCAT.LOGCOPY1.DS004, STARTRBA=000000000000, ENDRBA=000000000000, STATUS=AVAILABLE
ARCHIVE LOG:
DSNAME=ARCHLOG.D20240315.T010000, STARTRBA=000000000000, ENDRBA=0000007FFFFFFF
DSNAME=ARCHLOG.D20240315.T020000, STARTRBA=000000800000, ENDRBA=0000009FFFFFFF
a) How many active log data sets are defined? b) Which one is currently being written to? c) How many have been offloaded (archived)? d) What does STATUS=AVAILABLE mean?
Section B: Log Sizing Calculations (Exercises 9-14)
Exercise 9: z/OS Active Log Sizing
An OLTP system on z/OS has the following workload characteristics:
- Peak transaction rate: 3,000 transactions/second
- Average log records per transaction: 4
- Average log record size: 600 bytes
- Archive log offload time: 45 seconds per data set
Calculate: a) The peak log write rate in MB/second b) The minimum size of each active log data set c) A recommended configuration (number and size of data sets) with a 2x safety factor d) The total active log capacity
Exercise 10: LUW Log File Sizing
A DB2 LUW database has the following characteristics:
- Average daily log generation: 20 GB
- Peak hour log generation: 5 GB
- Longest expected transaction duration: 30 minutes
- Log generation during a 30-minute peak period: 2.5 GB
Calculate: a) Appropriate LOGFILSIZ (in 4 KB pages, targeting 100 MB per file) b) Appropriate LOGPRIMARY to hold at least 2 hours of peak log data c) Appropriate LOGSECOND to provide 1 additional hour of capacity d) Total active log capacity in GB
Exercise 11: Archive Log Disk Sizing
Using the same LUW database from Exercise 10:
- Archive log retention requirement: 14 days on disk
- Average daily log generation: 20 GB (weekdays), 5 GB (weekends)
Calculate: a) Total archive log disk space needed for 14-day retention b) Recommended filesystem size (with 25% headroom) c) At what percentage of filesystem utilization should an alert fire?
Exercise 12: Backup Window Calculation
Meridian Bank's online banking database is 500 GB. The available backup window is 4 hours (02:00 AM to 06:00 AM). The backup target is a SAN-attached disk array.
a) What minimum sustained write throughput is needed for a full backup? b) If the disk array delivers 150 MB/s sustained write throughput, can a full backup complete in the window? c) With 3-way parallelism, what throughput is needed per stream? d) If the daily change rate is 5% of the database, how large is an incremental backup? How long does it take at 150 MB/s?
Exercise 13: Recovery Time Estimation
A z/OS tablespace has the following characteristics:
- Size: 100 GB
- Most recent full image copy: 6 hours old
- Log generation rate for this tablespace: 500 MB/hour
- Image copy restore speed: 200 MB/s
- Log apply speed: 50 MB/s (log records must be read and applied selectively)
Estimate the total recovery time: a) Time to restore the image copy b) Volume of log data to apply c) Time to apply the log data d) Total estimated recovery time
Exercise 14: HADR Replication Bandwidth
Meridian Bank's HADR configuration sends log data from the primary to the standby:
- Average log generation: 20 GB/day = ~230 KB/s average
- Peak log generation: 5 GB/hour = ~1.4 MB/s
- Network latency to DR site: 5 ms round-trip
- HADR mode: NEARSYNC
a) What minimum network bandwidth is required for sustained replication during peak? b) How much additional bandwidth should be provisioned for bursts (recommend 3x peak)? c) In NEARSYNC mode, what is the approximate impact on COMMIT latency? d) If the network goes down for 10 minutes during peak, how much log data accumulates in the HADR send buffer?
Section C: Hands-On Recovery Scenarios (Exercises 15-25)
Exercise 15: LUW Full Database Recovery
Write the complete sequence of commands to:
a) Take a full online backup of database MERIDIAN to /db2backups/ b) Simulate a failure by dropping the database c) Restore the database from the backup d) Roll forward to end of logs and complete e) Verify the recovery by querying a known table
Exercise 16: LUW Point-in-Time Recovery
A developer accidentally ran DELETE FROM MERIDIAN.TRANSACTION WHERE TXN_DATE < '2024-01-01' at 14:47:00 today. The correct command should have included AND STATUS = 'CANCELLED'.
Write the complete sequence of commands to recover the TRANSACTION table data to its state at 14:46:00.
Include: a) How to determine the exact time of the bad DELETE b) The RESTORE command c) The ROLLFORWARD command with the correct timestamp d) How to verify the recovery
Exercise 17: z/OS Tablespace Recovery
Write the JCL to:
a) Take a full image copy of tablespace MERIDIAN.TXNDATA with SHRLEVEL CHANGE b) Take an incremental image copy of the same tablespace c) Recover the tablespace to current (using the most recent image copy + logs) d) Recover the tablespace to a specific RBA
Exercise 18: LUW Redirected Restore
You need to clone the production MERIDIAN database to a test environment with different storage paths:
- Production data path: /db2data/meridian/
- Test data path: /db2test/meridian_clone/
- Production log path: /db2logs/meridian/
- Test log path: /db2test/meridian_clone_logs/
Write the complete redirected restore procedure including: a) Taking the backup from production b) Generating the redirect script c) Editing the script (describe what changes to make) d) Executing the redirected restore e) Rolling forward and completing
Exercise 19: Enable Archive Logging on LUW
A database named DEVDB currently uses circular logging. Write the complete procedure to convert it to archive logging, including:
a) Checking the current logging mode b) Changing the database configuration c) Taking the required full backup d) Verifying archive logging is active e) Testing that archive logs are being created
Exercise 20: z/OS Dual Active Log Setup
Given the following requirements: - 6 active log data sets per copy - 2 GB per data set - Dual logging (COPY1 and COPY2) - COPY1 on volumes VOL001-VOL006 - COPY2 on volumes VOL011-VOL016
Write the DSNJU003 control statements to define this configuration.
Exercise 21: Crash Recovery Analysis
After a power failure, DB2 for LUW restarts and performs crash recovery. The db2diag.log shows:
2024-03-15-14.30.00.000000 DB2 RESTART: Database MERIDIAN
Crash recovery starting
Log analysis: reading from LSN 0x000000003A500000 to 0x000000003A5F2000
Redo phase: 2,847 log records applied
Undo phase: 12 transactions rolled back
Crash recovery complete, elapsed time: 4.2 seconds
a) How many log records were in the recovery window? b) How many transactions were in-flight at the time of the crash? c) What determines the starting LSN for the log analysis? d) Is 4.2 seconds a reasonable crash recovery time for an OLTP database? Why or why not?
Exercise 22: In-Doubt Transaction Resolution (LUW)
After a crash recovery, you discover that 2 transactions are in-doubt:
db2 list indoubt transactions
Transaction 1: XID=..., Status=PREPARED, Application=LOAN_PROCESSOR
Transaction 2: XID=..., Status=PREPARED, Application=PAYMENT_GATEWAY
a) What does PREPARED status mean in the context of two-phase commit? b) What are the two options for resolving each transaction? c) How do you determine the correct resolution? d) What happens to locks held by in-doubt transactions? e) Write the commands to commit transaction 1 and roll back transaction 2.
Exercise 23: Archive Log Gap Recovery
You discover that archive log file S0000125.LOG is missing from the archive directory. Your most recent backup was taken when the log position was in file S0000120.LOG.
a) Can you still recover the database to the current point? Why or why not? b) Can you recover to any point within the range of S0000120 to S0000124? c) Can you recover to any point after S0000125? d) What should you do immediately to protect against further data loss? e) How could this situation have been prevented?
Exercise 24: Backup Verification Script
Write a complete bash script that:
a) Takes an online backup of database MERIDIAN b) Restores it to a temporary database called MERIDIAN_VERIFY c) Rolls forward to end of logs d) Runs a series of verification queries (row counts on key tables) e) Compares the row counts with the production database f) Logs the results to a file with timestamps g) Drops the temporary database h) Sends an alert if any verification fails
Exercise 25: Complete Recovery Runbook
Create a recovery runbook for Meridian Bank's online banking database (LUW) covering these scenarios:
a) Scenario 1: Database crashes and restarts automatically — document the verification steps b) Scenario 2: A tablespace is corrupted and needs online recovery — document the restore and rollforward procedure c) Scenario 3: Someone drops an important table — document the point-in-time recovery procedure d) Scenario 4: The entire server is lost and you must recover to the DR site — document the HADR takeover procedure
For each scenario, include: - Symptoms and detection - Decision criteria - Step-by-step commands - Verification steps - Estimated recovery time - Escalation contacts
Section D: Advanced Exercises (Exercises 26-30)
Exercise 26: Multi-Partition Backup Strategy
A DB2 LUW database uses DPF (Database Partitioning Feature) with 8 partitions across 4 servers. Design a backup strategy that:
a) Minimizes the backup window b) Ensures all partitions can be recovered consistently c) Handles the failure of a single partition server d) Includes archive log management across all partitions
Exercise 27: Recovery from Corrupt BSDS (z/OS)
Both copies of the BSDS are damaged. DB2 will not start.
a) What utility is used to create a new BSDS? b) What information must you have to reconstruct the BSDS? c) Write the DSNJU003 control statements to create a new BSDS from scratch d) How would you determine the correct RBA ranges for the active log data sets?
Exercise 28: Automated Recovery Decision Tree
Design an automated monitoring and recovery system for Meridian Bank. Create a decision tree that maps each type of failure (crash, media failure, data corruption, log space exhaustion) to the appropriate automated or manual response. Include escalation thresholds and notification rules.
Exercise 29: Cost-Benefit Analysis of DR Options
Meridian Bank is comparing three DR options for the online banking system:
- Option A: HADR SYNC (RPO=0, RTO=2 min) — Cost: $500,000/year
- Option B: HADR ASYNC (RPO=30 sec, RTO=2 min) — Cost: $300,000/year
- Option C: Nightly backup to DR site (RPO=24 hours, RTO=6 hours) — Cost: $50,000/year
The system processes $10 million in transactions per day. A data loss event affects customer trust (estimated $2M impact). Downtime costs $5,000 per minute.
Calculate the annual risk exposure for each option and recommend the best choice.
Exercise 30: Cross-Platform Recovery Design
Meridian Bank's core banking (z/OS) feeds data to the online banking (LUW) system through a nightly ETL process. If both systems need to be recovered to the same point in time:
a) What challenges arise from cross-platform recovery? b) How do you synchronize recovery points between z/OS and LUW? c) Design a procedure that ensures data consistency across both platforms after recovery d) What role does the ETL process play in defining valid recovery points?
Answer Key Hints
- Exercise 4a: 200,000 / 10,000 = 20 seconds between checkpoints
- Exercise 9a: 3,000 × 4 × 600 = 7.2 MB/s
- Exercise 10a: 100 MB / 4 KB = 25,600 pages
- Exercise 12a: 500 GB / (4 × 3,600 seconds) = ~35 MB/s
- Exercise 13a: 100 GB / (200 MB/s) = 500 seconds ≈ 8.3 minutes
- Exercise 13b: 6 hours × 500 MB/hour = 3 GB of log data
- Exercise 14a: At least 1.4 MB/s = ~12 Mbps (plus TCP overhead, recommend 20 Mbps minimum)