Chapter 20 Exercises: Monitoring and Problem Determination
These exercises progress from foundational monitoring concepts through hands-on diagnostic scenarios. For exercises requiring a running DB2 instance, adapt table and database names to your environment. Where noted, exercises reference the Meridian National Bank project.
Section A: Monitoring Concepts and Baselines (Exercises 1-7)
Exercise 1: Monitoring Pillar Classification
Classify each of the following monitoring concerns into one of the five monitoring pillars (Availability, Performance, Capacity, Concurrency, Integrity):
a) The buffer pool hit ratio dropped from 98% to 91% during the afternoon batch run. b) Three applications are waiting for locks held by a long-running transaction. c) The transaction log is 82% full. d) The last successful backup was 36 hours ago. e) The database is not accepting new connections. f) Average query response time increased from 50ms to 340ms. g) A table space is 94% full. h) A deadlock occurred between the account transfer and statement generation programs. i) The COPY PENDING flag is set on the ACCOUNTS table space. j) The REORG utility has been running for 6 hours against an expected 2-hour window.
Expected deliverable: A table mapping each item (a-j) to exactly one pillar, with a one-sentence justification for ambiguous cases.
Exercise 2: Baseline Establishment Plan
You are the new DBA at Meridian National Bank. Design a baseline establishment plan that includes:
a) Which metrics you will collect (list at least 12 specific metrics) b) The collection interval for each metric (justify your choices) c) The time periods you will segment (minimum 4 segments) d) How long you will collect data before declaring the baseline valid e) Where you will store the baseline data f) How often you will refresh the baseline
Expected deliverable: A written plan document of at least 500 words.
Exercise 3: Reactive vs. Proactive Scenario Analysis
For each scenario, describe (a) how it would play out under reactive monitoring and (b) how it would play out under proactive monitoring:
- Buffer pool BP0 hit ratio degrades from 99% to 94% over three hours during a normal business day.
- A developer accidentally deploys code that opens database connections but never closes them.
- The nightly REORG utility fails silently and the table space enters CHECK PENDING status.
- A marketing campaign causes a 300% increase in online banking queries.
Expected deliverable: For each scenario, two paragraphs (reactive and proactive), totaling approximately 200 words per scenario.
Exercise 4: Monitoring Overhead Analysis
For each monitoring tool or technique, rate the overhead on a scale of 1 (negligible) to 5 (significant), and explain your reasoning:
a) db2pd -db MERIDIAN -bufferpools
b) Statistics trace on z/OS with DEST(SMF)
c) Performance trace on z/OS with CLASS(1,2,3) filtering by PLAN
d) Performance trace on z/OS with CLASS(1,2,3,4,5,6,7,8) with no filtering
e) Statement event monitor writing to table on LUW
f) MON_GET_PKG_CACHE_STMT executed every 60 seconds
g) GET SNAPSHOT FOR DATABASE every 10 seconds
h) DISPLAY BUFFERPOOL on z/OS every 5 minutes
Expected deliverable: A table with overhead rating and a 1-2 sentence explanation for each.
Exercise 5: Monitoring Architecture Diagram
Draw (or describe in text) a monitoring architecture for Meridian National Bank that includes:
- Both z/OS and LUW DB2 instances
- Data collection agents/mechanisms for each platform
- A central monitoring repository
- A dashboard display layer
- An alerting subsystem with notification channels
- Data flow arrows showing what data moves where
Expected deliverable: A diagram or detailed textual description identifying at least 8 distinct components and their interconnections.
Exercise 6: Alert Threshold Design
Design a complete alert threshold configuration for a new Meridian Bank LUW database. For each metric below, specify: - Warning threshold value - Critical threshold value - Number of consecutive intervals before alerting - Alert cooldown period - Notification channel (email, SMS, dashboard, ticket) - Suggested first response action
Metrics: a) Buffer pool hit ratio b) Log utilization percentage c) Lock wait time (average per transaction) d) Sort overflow percentage e) Table space utilization f) Active connections as percentage of maximum g) Deadlock count per hour h) Hours since last successful backup
Expected deliverable: A table with all six columns filled for each of the 8 metrics.
Exercise 7: Monitoring Tool Selection
Your manager asks you to recommend a monitoring strategy for Meridian Bank's z/OS DB2 environment. Compare the following approaches and recommend one, with justification:
a) DISPLAY commands only, run manually by the on-call DBA b) DISPLAY commands automated via a job scheduler, with output parsed by custom scripts c) IBM OMEGAMON for DB2 d) BMC MainView for DB2
Your recommendation should consider: cost, staffing requirements, coverage (24x7 vs. business hours), alerting capabilities, historical data, and learning curve.
Expected deliverable: A 400-word recommendation memo.
Section B: z/OS Monitoring (Exercises 8-14)
Exercise 8: DISPLAY DATABASE Interpretation
Given the following DISPLAY DATABASE output, identify all objects that require attention and specify the corrective action for each:
NAME TYPE PART STATUS
-------- ---- ---- ---------------
ACCTTS TS RW
ACCTIX1 IX RW
CUSTTS TS RECP
CUSTIX1 IX RBDP
TRANSTS TS RW,COPY
TRANSIX1 IX RW
LOANSTS TS UT
LOANIX1 IX STOP
AUDITTS TS RO
AUDITIX1 IX RW,LPL
Expected deliverable: A table listing each problem object, its status, severity (high/medium/low), and the specific corrective command or utility.
Exercise 9: DISPLAY THREAD Analysis
Given the following thread display, answer the questions below:
NAME ST A REQ-CT AUTH-ID PLAN ASID TOKEN
-------- -- - ---------- -------- -------- ---- -----
DISTSERV TR * 145 WEBUSER BNKPLAN 003A 22451
DISTSERV TR * 23892 WEBUSER BNKPLAN 003A 22452
BATCH1 TR N 8945231 BATCHID BATPLAN 004B 22390
TSO TR * 12 DBADM01 DSNTEP71 0022 22467
DISTSERV QW N 0 WEBUSER BNKPLAN 003A 22470
DISTSERV QW N 0 WEBUSER BNKPLAN 003A 22471
a) Which thread is likely a batch job, and why? b) Which threads are queued and waiting? What might cause this? c) Thread TOKEN 22452 has made 23,892 requests. Is this concerning? Under what circumstances? d) What command would you issue to get more detail about thread TOKEN 22390?
Exercise 10: Buffer Pool Hit Ratio Calculation
Given the following DISPLAY BUFFERPOOL output, calculate the hit ratio and determine if action is needed:
BUFFERPOOL NAME BP0, BUFFERPOOL ID 0
VIRTUAL BUFFER POOL SIZE = 25000
GETPAGES = 5000000
SYNC READ I/O (RANDOM) = 375000
SYNC READ I/O (SEQ) = 50000
SEQUENTIAL PREFETCH REQUESTS = 100000
SEQUENTIAL PREFETCH PAGES = 800000
a) Calculate the buffer pool hit ratio. b) Is this ratio acceptable for an OLTP workload? For a batch workload? c) What percentage of I/O is random vs. sequential? d) Is prefetch working effectively? How can you tell? e) If you wanted to improve this hit ratio to 97%, approximately how many additional buffer pages would you need? (Show your reasoning.)
Exercise 11: Trace Configuration
Write the z/OS commands to accomplish each task:
a) Start a statistics trace sending data to SMF. b) Start an accounting trace for CLASS 1, 2, and 3 going to SMF. c) Start a performance trace for PLAN BNKPLAN only, limited to 10 minutes, going to GTF. d) Display all active traces. e) Stop trace number 7. f) Start an audit trace to monitor all authorization failures.
Exercise 12: SMF 101 Accounting Analysis
Given the following simplified accounting report for two plans, identify performance issues and recommend corrective actions:
Plan BNKPLAN (Online Banking):
Threads: 45,231 Avg Class 1 Elapsed: 0.235 sec
Avg Class 2 Elapsed: 0.189 sec Avg Class 2 CPU: 0.012 sec
Avg Lock Wait: 0.098 sec Avg Sync I/O: 0.034 sec
Avg Getpages: 234 Avg Sync Reads: 45
Avg Commits: 1.2 Sort Overflows: 0
Plan BATPLAN (Nightly Batch):
Threads: 12 Avg Class 1 Elapsed: 3456 sec
Avg Class 2 Elapsed: 2890 sec Avg Class 2 CPU: 145 sec
Avg Lock Wait: 890 sec Avg Sync I/O: 1234 sec
Avg Getpages: 45,678,901 Avg Sync Reads: 12,345,678
Avg Commits: 1 Sort Overflows: 234
a) For BNKPLAN: What is the biggest contributor to elapsed time? Is this normal? b) For BNKPLAN: Calculate the buffer pool hit ratio for this plan. c) For BATPLAN: What is the most concerning metric? Why? d) For BATPLAN: Why is only 1 commit recorded per thread? e) For BATPLAN: What actions would you take to improve performance?
Exercise 13: DISPLAY LOG Crisis Response
You receive this DISPLAY LOG output at 2:47 AM:
NUMBER OF ACTIVE LOG DATA SETS = 6
FULL - ACTIVE LOG DATA SETS = 5
a) How severe is this situation? b) What is your immediate first action? c) Write the command to identify which thread is preventing log offloading. d) What are your options if the blocking thread is a critical batch job that cannot be terminated? e) How would you prevent this situation from recurring?
Exercise 14: IFCID Analysis Plan
You need to investigate why the online banking plan (BNKPLAN) is experiencing intermittent response time spikes of 2-3 seconds (normal is < 250ms). Design a trace-based investigation plan:
a) Which trace type(s) will you start? b) Which IFCIDs are most relevant? c) How will you filter the trace to minimize overhead? d) How long will you run the trace? e) What will you look for in the trace output? f) What is your backup plan if the trace itself causes performance degradation?
Section C: LUW Monitoring (Exercises 15-22)
Exercise 15: db2pd Interpretation
Given the following db2pd -bufferpools output, analyze the buffer pool health:
Bufferpool Name: MERIDIAN_BP_DATA Id: 4 PageSize: 8192
nPages nPagesUsed nPagesClean nPagesDirty HitRatio
50000 49987 12345 37642 96.21
Data Logical Data Physical Index Logical Index Physical
2345678 89012 1234567 234
Async Data Reads Async Index Reads Async Data Writes Async Index Writes
456789 123456 234567 123456
Direct Reads Direct Writes Prefetch Wait Time
12345 67890 2.456
Bufferpool Name: MERIDIAN_BP_IDX Id: 5 PageSize: 8192
nPages nPagesUsed nPagesClean nPagesDirty HitRatio
30000 29998 28765 1233 99.98
Data Logical Data Physical Index Logical Index Physical
0 0 8901234 1234
Async Data Reads Async Index Reads Async Data Writes Async Index Writes
0 567890 0 12345
Direct Reads Direct Writes Prefetch Wait Time
0 0 0.000
a) Which buffer pool has a potential problem? What specifically concerns you? b) What does the nPagesDirty value for MERIDIAN_BP_DATA tell you? c) What does the Prefetch Wait Time indicate? d) The Direct Reads/Writes for MERIDIAN_BP_DATA are significant. What could cause this? e) What specific actions would you recommend for each buffer pool?
Exercise 16: MON_GET Query Writing
Write MON_GET queries to answer each of the following questions about the MERIDIAN database:
a) What is the current buffer pool hit ratio for each buffer pool, including the number of logical and physical reads? b) Which 5 tables have the highest number of rows read? c) Which applications are currently waiting for locks, and how long have they been waiting? d) What is the current sort overflow percentage? e) What is the package cache hit ratio? f) Which 10 SQL statements have the worst ratio of rows read to rows returned?
Exercise 17: Event Monitor Creation
Write the complete SQL to create the following event monitors for the MERIDIAN database:
a) A locking event monitor that captures deadlocks and lock timeouts to tables in the MONITORING schema. b) An activity event monitor that captures SQL statements taking longer than 10 seconds, with full statement text and input values. c) A unit of work event monitor to capture transactions exceeding 60 seconds duration.
Include the CREATE EVENT MONITOR statements, any required thresholds, SET EVENT MONITOR STATE commands, and a query to retrieve data from each monitor.
Exercise 18: db2diag.log Analysis
Given the following db2diag.log entries, determine the root cause and corrective action:
Entry 1:
2025-03-15-09.15.23.456789-300 LEVEL: Error
PID: 12345 TID: 140234567890 PROC: db2sysc 0
INSTANCE: db2inst1 NODE: 000 DB: MERIDIAN
FUNCTION: DB2 UDB, lock manager, sqlplnfd, probe:3
MESSAGE: ADM0504C The transaction log for the database is full.
Entry 2 (45 seconds later):
2025-03-15-09.16.08.123456-300 LEVEL: Error
PID: 12345 TID: 140234567891 PROC: db2sysc 0
INSTANCE: db2inst1 NODE: 000 DB: MERIDIAN
FUNCTION: DB2 UDB, data protection, sqlpgResSpace, probe:2
MESSAGE: ADM1823E The active log is full and is not being drained because
the oldest active unit of work is held by application handle "1234".
a) What is the root cause of this problem? b) What is the relationship between Entry 1 and Entry 2? c) What immediate action should you take? d) What information do you need to gather before taking that action? e) What preventive measures would you implement?
Exercise 19: db2diag Command Filtering
Write the db2diag filter command for each scenario:
a) Show all errors for database MERIDIAN in the last 4 hours. b) Show all entries at Warning level or above from yesterday between 2 PM and 4 PM. c) Find all entries containing message ID ADM1823E. d) Show all entries for application handle 1234. e) Show all Severe and Critical entries from the last 7 days.
Exercise 20: Snapshot vs. MON_GET Comparison
For each of the following information needs, write both the GET SNAPSHOT command and the equivalent MON_GET query. Then explain which approach you would recommend and why.
a) Current buffer pool hit ratio for all buffer pools b) Lock information for all active applications c) Total sorts and sort overflows for the database d) Number of active connections
Exercise 21: db2pd Continuous Monitoring Setup
Design a db2pd-based continuous monitoring setup for the MERIDIAN database that:
a) Captures buffer pool statistics every 60 seconds to a daily file b) Captures lock information every 30 seconds during business hours (6 AM - 10 PM) c) Captures transaction information every 120 seconds d) Automatically rotates output files daily e) Does not run during the backup window (2 AM - 4 AM)
Write the shell script (Linux/Unix) or scheduling commands needed. Explain how you would parse the output files for alerting purposes.
Exercise 22: FFDC and db2support
You need to open a support case with IBM for a recurring crash in the MERIDIAN database. Describe:
a) What is FFDC and how does it help in this situation? b) What files would you look for in the db2dump directory? c) Write the db2support command to collect diagnostic data. d) What additional information should you include with your support case? e) What DIAGLEVEL should you set and why?
Section D: Cross-Platform and Meridian Bank Scenarios (Exercises 23-30)
Exercise 23: Key Metric Quick Reference
Create a quick reference card (table format) for the 10 most important DB2 monitoring metrics. For each metric, include: - Metric name - z/OS command/source to obtain it - LUW command/source to obtain it - Healthy range - First action when threshold is breached
Exercise 24: Meridian Daily Health Check Enhancement
The daily health check script in Section 20.12 covers 8 areas. Enhance it by adding three additional checks:
a) HADR status check (if applicable) b) Automatic storage utilization trend (compare today vs. 7 days ago) c) Application connection duration — flag any connections active for more than 4 hours
Write the SQL for each additional check and explain how you would integrate it into the existing health check.
Exercise 25: Production Incident Response Playbook
Create a monitoring-driven incident response playbook for Meridian National Bank for the following scenario:
Alert received: Buffer pool hit ratio for MERIDIAN_BP_DATA dropped to 88% (critical threshold: 93%).
Write the step-by-step investigation procedure, including: a) The first 5 commands you would run (in order, with rationale) b) Decision tree: If the cause is X, do Y; if the cause is Z, do W c) Escalation criteria: when to involve application team, management, IBM support d) Documentation requirements: what to record in the incident log
Exercise 26: Monitoring Data Retention Policy
Design a monitoring data retention policy for Meridian Bank that balances storage cost against analytical needs:
a) Real-time data (db2pd output, current snapshots): retention period and storage format b) Short-term historical (event monitors, hourly summaries): retention period and storage c) Medium-term historical (daily summaries, trend data): retention period and storage d) Long-term historical (monthly summaries, capacity planning): retention period and storage e) Estimate the storage requirements for each tier (state your assumptions) f) Design the purge/archival process
Exercise 27: Lock Contention Deep Dive
A teller at Meridian Bank reports that account transfers are timing out during the afternoon. You suspect lock contention. Write the complete investigation procedure for:
a) z/OS: Which DISPLAY commands and traces would you use? Write each command. b) LUW: Which db2pd options and MON_GET queries would you use? Write each command/query. c) For both platforms: How would you identify the blocking application? d) For both platforms: How would you determine what SQL the blocker is running? e) What are your resolution options that do not involve terminating the blocking application?
Exercise 28: Monitoring During Planned Maintenance
Meridian Bank is performing a planned database migration over a weekend. Design the monitoring plan for:
a) Pre-migration: What baselines do you capture? What health checks do you run? b) During migration: What do you monitor in real-time? How frequently? c) Post-migration: What metrics do you compare? What is your "all clear" criteria? d) Rollback triggers: What monitoring thresholds would trigger a rollback decision?
Exercise 29: Performance Trend Analysis
You have 90 days of historical monitoring data for Meridian Bank. Write the SQL queries (LUW) to:
a) Calculate the average buffer pool hit ratio by day of week and hour of day b) Identify the top 5 days with the worst overall performance (define your composite metric) c) Detect queries whose average execution time has increased by more than 50% in the last 7 days compared to the previous 30-day average d) Project when the primary data table space will reach 90% utilization based on the growth trend
Assume the historical data is stored in tables: MERIDIAN.PERF_DAILY_SUMMARY, MERIDIAN.QUERY_DAILY_SUMMARY, and MERIDIAN.TABLESPACE_DAILY_SUMMARY with reasonable column names.
Exercise 30: Comprehensive Monitoring Implementation
This is a capstone exercise. Design and document a complete monitoring implementation for a new Meridian Bank DB2 LUW production database. Your deliverable should include:
a) A list of all monitoring objects to create (event monitors, thresholds, history tables) b) The SQL to create each object c) A collection schedule (what runs when) d) Alert definitions with thresholds, notification channels, and response procedures e) A daily health check script f) A weekly trending report query g) A monthly capacity planning query h) Housekeeping procedures (purge old data, maintain monitoring tables) i) Documentation for the on-call DBA team
Expected deliverable: A complete monitoring implementation plan of approximately 2,000 words plus all SQL code.
Answer Key Guidance
Exercises 1-7 have definitive answers and should be graded against the chapter content. Exercises 8-14 (z/OS) and 15-22 (LUW) have specific technical answers that can be verified against DB2 documentation. Exercises 23-30 are design exercises with multiple valid approaches; grade on completeness, technical accuracy, and practicality.
For self-study: Complete Exercises 1-6, 8, 10, 15, 16, and 23 before moving to Chapter 21. Return to the remaining exercises after completing Part 4.