Chapter 20 Exercises: Monitoring and Problem Determination

Chapter 20 Exercises: Monitoring and Problem Determination

These exercises progress from foundational monitoring concepts through hands-on diagnostic scenarios. For exercises requiring a running DB2 instance, adapt table and database names to your environment. Where noted, exercises reference the Meridian National Bank project.

Section A: Monitoring Concepts and Baselines (Exercises 1-7)

Exercise 1: Monitoring Pillar Classification

Classify each of the following monitoring concerns into one of the five monitoring pillars (Availability, Performance, Capacity, Concurrency, Integrity):

a) The buffer pool hit ratio dropped from 98% to 91% during the afternoon batch run. b) Three applications are waiting for locks held by a long-running transaction. c) The transaction log is 82% full. d) The last successful backup was 36 hours ago. e) The database is not accepting new connections. f) Average query response time increased from 50ms to 340ms. g) A table space is 94% full. h) A deadlock occurred between the account transfer and statement generation programs. i) The COPY PENDING flag is set on the ACCOUNTS table space. j) The REORG utility has been running for 6 hours against an expected 2-hour window.

Expected deliverable: A table mapping each item (a-j) to exactly one pillar, with a one-sentence justification for ambiguous cases.

Exercise 2: Baseline Establishment Plan

You are the new DBA at Meridian National Bank. Design a baseline establishment plan that includes:

a) Which metrics you will collect (list at least 12 specific metrics) b) The collection interval for each metric (justify your choices) c) The time periods you will segment (minimum 4 segments) d) How long you will collect data before declaring the baseline valid e) Where you will store the baseline data f) How often you will refresh the baseline

Expected deliverable: A written plan document of at least 500 words.

Exercise 3: Reactive vs. Proactive Scenario Analysis

For each scenario, describe (a) how it would play out under reactive monitoring and (b) how it would play out under proactive monitoring:

Buffer pool BP0 hit ratio degrades from 99% to 94% over three hours during a normal business day.
A developer accidentally deploys code that opens database connections but never closes them.
The nightly REORG utility fails silently and the table space enters CHECK PENDING status.
A marketing campaign causes a 300% increase in online banking queries.

Expected deliverable: For each scenario, two paragraphs (reactive and proactive), totaling approximately 200 words per scenario.

Exercise 4: Monitoring Overhead Analysis

For each monitoring tool or technique, rate the overhead on a scale of 1 (negligible) to 5 (significant), and explain your reasoning:

a) db2pd -db MERIDIAN -bufferpools b) Statistics trace on z/OS with DEST(SMF) c) Performance trace on z/OS with CLASS(1,2,3) filtering by PLAN d) Performance trace on z/OS with CLASS(1,2,3,4,5,6,7,8) with no filtering e) Statement event monitor writing to table on LUW f) MON_GET_PKG_CACHE_STMT executed every 60 seconds g) GET SNAPSHOT FOR DATABASE every 10 seconds h) DISPLAY BUFFERPOOL on z/OS every 5 minutes

Expected deliverable: A table with overhead rating and a 1-2 sentence explanation for each.

Exercise 5: Monitoring Architecture Diagram

Draw (or describe in text) a monitoring architecture for Meridian National Bank that includes:

Both z/OS and LUW DB2 instances
Data collection agents/mechanisms for each platform
A central monitoring repository
A dashboard display layer
An alerting subsystem with notification channels
Data flow arrows showing what data moves where

Expected deliverable: A diagram or detailed textual description identifying at least 8 distinct components and their interconnections.

Exercise 6: Alert Threshold Design

Design a complete alert threshold configuration for a new Meridian Bank LUW database. For each metric below, specify: - Warning threshold value - Critical threshold value - Number of consecutive intervals before alerting - Alert cooldown period - Notification channel (email, SMS, dashboard, ticket) - Suggested first response action

Metrics: a) Buffer pool hit ratio b) Log utilization percentage c) Lock wait time (average per transaction) d) Sort overflow percentage e) Table space utilization f) Active connections as percentage of maximum g) Deadlock count per hour h) Hours since last successful backup

Expected deliverable: A table with all six columns filled for each of the 8 metrics.

Exercise 7: Monitoring Tool Selection

Your manager asks you to recommend a monitoring strategy for Meridian Bank's z/OS DB2 environment. Compare the following approaches and recommend one, with justification:

a) DISPLAY commands only, run manually by the on-call DBA b) DISPLAY commands automated via a job scheduler, with output parsed by custom scripts c) IBM OMEGAMON for DB2 d) BMC MainView for DB2

Your recommendation should consider: cost, staffing requirements, coverage (24x7 vs. business hours), alerting capabilities, historical data, and learning curve.

Expected deliverable: A 400-word recommendation memo.

Section B: z/OS Monitoring (Exercises 8-14)

Exercise 8: DISPLAY DATABASE Interpretation

Given the following DISPLAY DATABASE output, identify all objects that require attention and specify the corrective action for each:

NAME       TYPE  PART  STATUS
--------   ----  ----  ---------------
ACCTTS     TS          RW
ACCTIX1    IX          RW
CUSTTS     TS          RECP
CUSTIX1    IX          RBDP
TRANSTS    TS          RW,COPY
TRANSIX1   IX          RW
LOANSTS    TS          UT
LOANIX1    IX          STOP
AUDITTS    TS          RO
AUDITIX1   IX          RW,LPL

Expected deliverable: A table listing each problem object, its status, severity (high/medium/low), and the specific corrective command or utility.

Exercise 9: DISPLAY THREAD Analysis

Given the following thread display, answer the questions below:

NAME      ST  A  REQ-CT      AUTH-ID   PLAN      ASID  TOKEN
--------  --  -  ----------  --------  --------  ----  -----
DISTSERV  TR  *  145         WEBUSER   BNKPLAN   003A  22451
DISTSERV  TR  *  23892       WEBUSER   BNKPLAN   003A  22452
BATCH1    TR  N  8945231     BATCHID   BATPLAN   004B  22390
TSO       TR  *  12          DBADM01   DSNTEP71  0022  22467
DISTSERV  QW  N  0           WEBUSER   BNKPLAN   003A  22470
DISTSERV  QW  N  0           WEBUSER   BNKPLAN   003A  22471

a) Which thread is likely a batch job, and why? b) Which threads are queued and waiting? What might cause this? c) Thread TOKEN 22452 has made 23,892 requests. Is this concerning? Under what circumstances? d) What command would you issue to get more detail about thread TOKEN 22390?

Exercise 10: Buffer Pool Hit Ratio Calculation

Given the following DISPLAY BUFFERPOOL output, calculate the hit ratio and determine if action is needed:

BUFFERPOOL NAME BP0, BUFFERPOOL ID 0
VIRTUAL BUFFER POOL SIZE =    25000
GETPAGES                 =  5000000
SYNC READ I/O (RANDOM)   =   375000
SYNC READ I/O (SEQ)      =    50000
SEQUENTIAL PREFETCH REQUESTS =  100000
SEQUENTIAL PREFETCH PAGES    =  800000

a) Calculate the buffer pool hit ratio. b) Is this ratio acceptable for an OLTP workload? For a batch workload? c) What percentage of I/O is random vs. sequential? d) Is prefetch working effectively? How can you tell? e) If you wanted to improve this hit ratio to 97%, approximately how many additional buffer pages would you need? (Show your reasoning.)

Exercise 11: Trace Configuration

Write the z/OS commands to accomplish each task:

a) Start a statistics trace sending data to SMF. b) Start an accounting trace for CLASS 1, 2, and 3 going to SMF. c) Start a performance trace for PLAN BNKPLAN only, limited to 10 minutes, going to GTF. d) Display all active traces. e) Stop trace number 7. f) Start an audit trace to monitor all authorization failures.

Exercise 12: SMF 101 Accounting Analysis

Given the following simplified accounting report for two plans, identify performance issues and recommend corrective actions:

Plan BNKPLAN (Online Banking):

Threads:     45,231       Avg Class 1 Elapsed:   0.235 sec
Avg Class 2 Elapsed:      0.189 sec    Avg Class 2 CPU:  0.012 sec
Avg Lock Wait:            0.098 sec    Avg Sync I/O:     0.034 sec
Avg Getpages:             234          Avg Sync Reads:    45
Avg Commits:              1.2          Sort Overflows:    0

Plan BATPLAN (Nightly Batch):

Threads:     12           Avg Class 1 Elapsed:   3456 sec
Avg Class 2 Elapsed:      2890 sec     Avg Class 2 CPU:  145 sec
Avg Lock Wait:            890 sec      Avg Sync I/O:     1234 sec
Avg Getpages:             45,678,901   Avg Sync Reads:    12,345,678
Avg Commits:              1            Sort Overflows:    234

a) For BNKPLAN: What is the biggest contributor to elapsed time? Is this normal? b) For BNKPLAN: Calculate the buffer pool hit ratio for this plan. c) For BATPLAN: What is the most concerning metric? Why? d) For BATPLAN: Why is only 1 commit recorded per thread? e) For BATPLAN: What actions would you take to improve performance?

Exercise 13: DISPLAY LOG Crisis Response

You receive this DISPLAY LOG output at 2:47 AM:

NUMBER OF ACTIVE LOG DATA SETS  =  6
FULL  - ACTIVE LOG DATA SETS    =  5

a) How severe is this situation? b) What is your immediate first action? c) Write the command to identify which thread is preventing log offloading. d) What are your options if the blocking thread is a critical batch job that cannot be terminated? e) How would you prevent this situation from recurring?

Exercise 14: IFCID Analysis Plan

You need to investigate why the online banking plan (BNKPLAN) is experiencing intermittent response time spikes of 2-3 seconds (normal is < 250ms). Design a trace-based investigation plan:

a) Which trace type(s) will you start? b) Which IFCIDs are most relevant? c) How will you filter the trace to minimize overhead? d) How long will you run the trace? e) What will you look for in the trace output? f) What is your backup plan if the trace itself causes performance degradation?

Section C: LUW Monitoring (Exercises 15-22)

Exercise 15: db2pd Interpretation

Given the following db2pd -bufferpools output, analyze the buffer pool health:

Bufferpool Name: MERIDIAN_BP_DATA   Id: 4   PageSize: 8192
nPages    nPagesUsed  nPagesClean nPagesDirty HitRatio
50000     49987       12345       37642       96.21

Data Logical  Data Physical  Index Logical  Index Physical
2345678       89012          1234567        234

Async Data Reads  Async Index Reads  Async Data Writes  Async Index Writes
456789            123456             234567             123456

Direct Reads  Direct Writes  Prefetch Wait Time
12345         67890          2.456

Bufferpool Name: MERIDIAN_BP_IDX   Id: 5   PageSize: 8192
nPages    nPagesUsed  nPagesClean nPagesDirty HitRatio
30000     29998       28765       1233        99.98

Data Logical  Data Physical  Index Logical  Index Physical
0             0              8901234        1234

Async Data Reads  Async Index Reads  Async Data Writes  Async Index Writes
0                 567890             0                  12345

Direct Reads  Direct Writes  Prefetch Wait Time
0             0              0.000

a) Which buffer pool has a potential problem? What specifically concerns you? b) What does the nPagesDirty value for MERIDIAN_BP_DATA tell you? c) What does the Prefetch Wait Time indicate? d) The Direct Reads/Writes for MERIDIAN_BP_DATA are significant. What could cause this? e) What specific actions would you recommend for each buffer pool?

Exercise 16: MON_GET Query Writing

Write MON_GET queries to answer each of the following questions about the MERIDIAN database:

a) What is the current buffer pool hit ratio for each buffer pool, including the number of logical and physical reads? b) Which 5 tables have the highest number of rows read? c) Which applications are currently waiting for locks, and how long have they been waiting? d) What is the current sort overflow percentage? e) What is the package cache hit ratio? f) Which 10 SQL statements have the worst ratio of rows read to rows returned?

Exercise 17: Event Monitor Creation

Write the complete SQL to create the following event monitors for the MERIDIAN database:

a) A locking event monitor that captures deadlocks and lock timeouts to tables in the MONITORING schema. b) An activity event monitor that captures SQL statements taking longer than 10 seconds, with full statement text and input values. c) A unit of work event monitor to capture transactions exceeding 60 seconds duration.

Include the CREATE EVENT MONITOR statements, any required thresholds, SET EVENT MONITOR STATE commands, and a query to retrieve data from each monitor.

Exercise 18: db2diag.log Analysis

Given the following db2diag.log entries, determine the root cause and corrective action:

Entry 1:

2025-03-15-09.15.23.456789-300  LEVEL: Error
PID: 12345  TID: 140234567890  PROC: db2sysc 0
INSTANCE: db2inst1  NODE: 000  DB: MERIDIAN
FUNCTION: DB2 UDB, lock manager, sqlplnfd, probe:3
MESSAGE: ADM0504C  The transaction log for the database is full.

Entry 2 (45 seconds later):

2025-03-15-09.16.08.123456-300  LEVEL: Error
PID: 12345  TID: 140234567891  PROC: db2sysc 0
INSTANCE: db2inst1  NODE: 000  DB: MERIDIAN
FUNCTION: DB2 UDB, data protection, sqlpgResSpace, probe:2
MESSAGE: ADM1823E  The active log is full and is not being drained because
         the oldest active unit of work is held by application handle "1234".

a) What is the root cause of this problem? b) What is the relationship between Entry 1 and Entry 2? c) What immediate action should you take? d) What information do you need to gather before taking that action? e) What preventive measures would you implement?

Exercise 19: db2diag Command Filtering

Write the db2diag filter command for each scenario:

a) Show all errors for database MERIDIAN in the last 4 hours. b) Show all entries at Warning level or above from yesterday between 2 PM and 4 PM. c) Find all entries containing message ID ADM1823E. d) Show all entries for application handle 1234. e) Show all Severe and Critical entries from the last 7 days.

Exercise 20: Snapshot vs. MON_GET Comparison

For each of the following information needs, write both the GET SNAPSHOT command and the equivalent MON_GET query. Then explain which approach you would recommend and why.

a) Current buffer pool hit ratio for all buffer pools b) Lock information for all active applications c) Total sorts and sort overflows for the database d) Number of active connections

Exercise 21: db2pd Continuous Monitoring Setup

Design a db2pd-based continuous monitoring setup for the MERIDIAN database that:

a) Captures buffer pool statistics every 60 seconds to a daily file b) Captures lock information every 30 seconds during business hours (6 AM - 10 PM) c) Captures transaction information every 120 seconds d) Automatically rotates output files daily e) Does not run during the backup window (2 AM - 4 AM)

Write the shell script (Linux/Unix) or scheduling commands needed. Explain how you would parse the output files for alerting purposes.

Exercise 22: FFDC and db2support

You need to open a support case with IBM for a recurring crash in the MERIDIAN database. Describe:

a) What is FFDC and how does it help in this situation? b) What files would you look for in the db2dump directory? c) Write the db2support command to collect diagnostic data. d) What additional information should you include with your support case? e) What DIAGLEVEL should you set and why?

Section D: Cross-Platform and Meridian Bank Scenarios (Exercises 23-30)

Exercise 23: Key Metric Quick Reference

Create a quick reference card (table format) for the 10 most important DB2 monitoring metrics. For each metric, include: - Metric name - z/OS command/source to obtain it - LUW command/source to obtain it - Healthy range - First action when threshold is breached

Exercise 24: Meridian Daily Health Check Enhancement

The daily health check script in Section 20.12 covers 8 areas. Enhance it by adding three additional checks:

a) HADR status check (if applicable) b) Automatic storage utilization trend (compare today vs. 7 days ago) c) Application connection duration — flag any connections active for more than 4 hours

Write the SQL for each additional check and explain how you would integrate it into the existing health check.

Exercise 25: Production Incident Response Playbook

Create a monitoring-driven incident response playbook for Meridian National Bank for the following scenario:

Alert received: Buffer pool hit ratio for MERIDIAN_BP_DATA dropped to 88% (critical threshold: 93%).

Write the step-by-step investigation procedure, including: a) The first 5 commands you would run (in order, with rationale) b) Decision tree: If the cause is X, do Y; if the cause is Z, do W c) Escalation criteria: when to involve application team, management, IBM support d) Documentation requirements: what to record in the incident log

Exercise 26: Monitoring Data Retention Policy

Design a monitoring data retention policy for Meridian Bank that balances storage cost against analytical needs:

a) Real-time data (db2pd output, current snapshots): retention period and storage format b) Short-term historical (event monitors, hourly summaries): retention period and storage c) Medium-term historical (daily summaries, trend data): retention period and storage d) Long-term historical (monthly summaries, capacity planning): retention period and storage e) Estimate the storage requirements for each tier (state your assumptions) f) Design the purge/archival process

Exercise 27: Lock Contention Deep Dive

A teller at Meridian Bank reports that account transfers are timing out during the afternoon. You suspect lock contention. Write the complete investigation procedure for:

a) z/OS: Which DISPLAY commands and traces would you use? Write each command. b) LUW: Which db2pd options and MON_GET queries would you use? Write each command/query. c) For both platforms: How would you identify the blocking application? d) For both platforms: How would you determine what SQL the blocker is running? e) What are your resolution options that do not involve terminating the blocking application?

Exercise 28: Monitoring During Planned Maintenance

Meridian Bank is performing a planned database migration over a weekend. Design the monitoring plan for:

a) Pre-migration: What baselines do you capture? What health checks do you run? b) During migration: What do you monitor in real-time? How frequently? c) Post-migration: What metrics do you compare? What is your "all clear" criteria? d) Rollback triggers: What monitoring thresholds would trigger a rollback decision?

Exercise 29: Performance Trend Analysis

You have 90 days of historical monitoring data for Meridian Bank. Write the SQL queries (LUW) to:

a) Calculate the average buffer pool hit ratio by day of week and hour of day b) Identify the top 5 days with the worst overall performance (define your composite metric) c) Detect queries whose average execution time has increased by more than 50% in the last 7 days compared to the previous 30-day average d) Project when the primary data table space will reach 90% utilization based on the growth trend

Assume the historical data is stored in tables: MERIDIAN.PERF_DAILY_SUMMARY, MERIDIAN.QUERY_DAILY_SUMMARY, and MERIDIAN.TABLESPACE_DAILY_SUMMARY with reasonable column names.

Exercise 30: Comprehensive Monitoring Implementation

This is a capstone exercise. Design and document a complete monitoring implementation for a new Meridian Bank DB2 LUW production database. Your deliverable should include:

a) A list of all monitoring objects to create (event monitors, thresholds, history tables) b) The SQL to create each object c) A collection schedule (what runs when) d) Alert definitions with thresholds, notification channels, and response procedures e) A daily health check script f) A weekly trending report query g) A monthly capacity planning query h) Housekeeping procedures (purge old data, maintain monitoring tables) i) Documentation for the on-call DBA team

Expected deliverable: A complete monitoring implementation plan of approximately 2,000 words plus all SQL code.

Answer Key Guidance

Exercises 1-7 have definitive answers and should be graded against the chapter content. Exercises 8-14 (z/OS) and 15-22 (LUW) have specific technical answers that can be verified against DB2 documentation. Exercises 23-30 are design exercises with multiple valid approaches; grade on completeness, technical accuracy, and practicality.

For self-study: Complete Exercises 1-6, 8, 10, 15, 16, and 23 before moving to Chapter 21. Return to the remaining exercises after completing Part 4.