> "The best DBAs fix problems before users notice them. The secret? Monitoring that tells you what's happening, not just what happened."
In This Chapter
- 20.1 The Monitoring Mindset
- 20.2 [z/OS] DISPLAY Commands
- 20.3 [z/OS] Traces and IFCID Records
- 20.4 [z/OS] SMF Records for DB2
- 20.5 [z/OS] OMEGAMON and Third-Party Monitors
- 20.6 [LUW] db2pd — The DBA's Swiss Army Knife
- 20.7 [LUW] Snapshot Monitor and MON_GET Functions
- 20.8 [LUW] Event Monitors
- 20.9 [LUW] db2diag.log and db2dump
- 20.10 Key Metrics and What They Mean
- 20.11 Building a Monitoring Dashboard
- 20.12 The Meridian Bank Monitoring Setup
- Spaced Review: Connecting to Earlier Chapters
- Chapter Summary
Chapter 20: Monitoring and Problem Determination — Knowing What's Wrong Before the Phone Rings
"The best DBAs fix problems before users notice them. The secret? Monitoring that tells you what's happening, not just what happened."
Every DBA remembers the first time their phone rang at 2 AM with a production crisis. The application team is panicking, end users are locked out, and management wants answers in minutes. You log in, stare at a system you barely understand at that hour, and start guessing. Maybe it is locks. Maybe the log is full. Maybe a utility is running that should have finished hours ago.
Now imagine a different scenario. Your phone never rings because at 11:47 PM, your monitoring system detected that lock wait times were climbing. It sent you an alert. You logged in, found a long-running transaction holding exclusive locks on the ACCOUNTS table, and resolved it — all before any user noticed. That is the difference between reactive and proactive monitoring. That is what this chapter teaches you to build.
Monitoring is not a task you perform occasionally. It is the heartbeat of your daily DBA practice. On z/OS, you will live inside DISPLAY commands, trace records, and SMF data. On LUW, db2pd, snapshot monitors, and MON_GET table functions become your constant companions. On both platforms, you need to know what "normal" looks like so you can recognize "abnormal" the instant it appears.
At Meridian National Bank, we are going to build a monitoring infrastructure from scratch — one that watches buffer pools, tracks transaction volumes, alerts on lock contention, monitors log space, and generates a daily health check report. By the end of this chapter, you will have the tools, the queries, and the mindset to be the DBA who fixes things before anyone else even knows there is a problem.
20.1 The Monitoring Mindset
Reactive vs. Proactive Monitoring
Most organizations begin with reactive monitoring: something breaks, somebody calls, and the DBA investigates. This approach has predictable consequences — extended outages, angry users, and DBAs who dread their on-call rotations.
Proactive monitoring inverts this model. Instead of waiting for failure, you continuously collect metrics, compare them against known baselines, and alert when deviations exceed thresholds. The goal is simple: detect the problem before it becomes a crisis.
| Characteristic | Reactive | Proactive |
|---|---|---|
| Trigger | User complaint or application error | Threshold breach or trend deviation |
| Response time | Minutes to hours after impact | Seconds to minutes before impact |
| Data available | Whatever you can gather post-mortem | Historical baselines and real-time feeds |
| DBA stress level | Extremely high | Managed and controlled |
| Business impact | Downtime, data loss risk | Minimal or zero user impact |
What to Monitor
Not everything deserves equal attention. Focus your monitoring on these five pillars:
- Availability — Is the database up? Can applications connect? Are subsystem components active?
- Performance — Are response times within acceptable bounds? Are buffer pools effective? Are sorts spilling to disk?
- Capacity — How much disk space remains? How full are the logs? Are table spaces approaching limits?
- Concurrency — Are locks being held too long? Are deadlocks occurring? Are threads waiting?
- Integrity — Are backups completing? Are utilities finishing on time? Are there any data inconsistencies?
Establishing Baselines
You cannot detect anomalies without knowing what normal looks like. A baseline is a collection of metric values captured during known-good operating conditions. Here is the process:
- Collect metrics during normal operations over a representative period (at least two weeks, ideally including month-end processing).
- Segment by time period — weekday vs. weekend, business hours vs. batch window, month-end vs. mid-month.
- Calculate statistical ranges — mean, standard deviation, 95th percentile for each metric.
- Document the baseline — store it where your monitoring tools can reference it.
- Review and refresh — baselines become stale as workloads evolve. Refresh quarterly at minimum.
For Meridian National Bank, we would establish separate baselines for: - Online banking hours (6 AM to 10 PM): high transaction volume, many short queries - Batch processing window (11 PM to 5 AM): large table scans, heavy I/O, utility operations - Month-end processing: elevated volumes across all metrics for 2-3 days
20.2 [z/OS] DISPLAY Commands
The DISPLAY command family is the z/OS DBA's primary real-time monitoring tool. These commands are issued through the DB2 command prefix (typically a hyphen followed by the subsystem name) or through DB2I panels. Every z/OS DBA should be able to type these commands from memory.
-DISPLAY DATABASE
This command shows the status of databases, table spaces, and index spaces.
-DIS DATABASE(MERIDIANDB) SPACENAM(*) LOCKS
Output interpretation:
DSNT360I ***************************************************
DSNT361I * DISPLAY DATABASE SUMMARY
* GLOBAL
DSNT362I DATABASE = MERIDIANDB STATUS = RW
DBD LENGTH = 32768
DSNT397I -
NAME TYPE PART STATUS PHYERRLO PHYERRHI CATALOG PIECE
-------- ---- ---- -------------- -------- -------- ------- -----
ACCTTS TS RW
ACCTIX1 IX RW
CUSTTS TS RW,COPY
CUSTIX1 IX RW
TRANSTS TS RW
TRANSIX1 IX RW
LOANSTS TS RW,CHKP
******* DISPLAY OF DATABASE MERIDIANDB ENDED
Key status codes to watch:
| Status | Meaning | Action Required |
|---|---|---|
| RW | Read-Write, normal | None |
| RO | Read-Only | Investigate why; may be intentional |
| UT | Utility running | Check which utility |
| STOP | Stopped | Restart if needed |
| COPY | Copy pending | Run COPY utility |
| CHKP | Check pending | Run CHECK utility |
| RECP | Recover pending | Immediate attention required |
| RBDP | Rebuild pending | Rebuild indexes |
| LPL | Logical page list entries | Pages may be inconsistent |
| GRECP | Group recover pending | Data sharing recovery needed |
Critical alert: Any status containing RECP (Recover Pending) demands immediate investigation. This means DB2 has detected a potential inconsistency that requires a RECOVER utility execution before the object can be safely used.
-DISPLAY THREAD
Thread display shows active connections and what they are doing.
-DIS THREAD(*) DETAIL
Sample output:
DSNT360I ***************************************************
DSNV401I SSID = DB2P ACTIVE THREADS -
DSNV402I NAME ST A REQ-CT AUTH-ID PLAN ASID TOKEN
-------- -- - ---------- -------- -------- ---- -----
DISTSERV TR * 125432 WEBUSER BNKPLAN 003A 22451
BATCH1 TR N 4523891 BATCHID BATPLAN 004B 22390
TSO TR * 342 DBADM01 DSNTEP71 0022 22467
Column interpretation: - ST (Status): TR = active thread, QW = queued waiting, QD = queued disabled - A (Active): * = currently executing in DB2, N = not currently executing - REQ-CT: Number of DB2 requests made — a very high count on an active thread suggests a long-running operation - PLAN: Bound plan name — identifies the application
When to use this command: - When investigating lock contention — identify who holds locks - When connection limits are approached — count active threads - During performance investigations — find long-running threads - Before maintenance — verify no critical threads are active
-DISPLAY BUFFERPOOL
Buffer pool monitoring is arguably the single most important performance metric.
-DIS BUFFERPOOL(BP0) DETAIL
Sample output (abbreviated):
DSNT360I ***************************************************
DSNB401I BUFFERPOOL NAME BP0, BUFFERPOOL ID 0, USE COUNT 15
DSNB402I VIRTUAL BUFFER POOL SIZE = 50000
DSNB404I BUFFER POOL HIT RATIO = 98.5
DSNB406I GETPAGES = 1523456789 RANDOM GETPAGES = 423567890
DSNB407I SYNC READ I/O (RANDOM) = 6123456 (SEQ) = 245678
DSNB409I PAGES WRITTEN = 3456789
DSNB410I SEQUENTIAL PREFETCH -
REQUESTS = 567890 PAGES READ = 4523456
DSNB411I LIST PREFETCH -
REQUESTS = 23456 PAGES READ = 189012
DSNB412I DYNAMIC PREFETCH -
REQUESTS = 123456 PAGES READ = 987654
DSNB415I PREFETCH I/O - SEQUENTIAL = 34567 LIST = 4512 DYNAMIC = 12345
Critical metrics:
| Metric | Healthy Threshold | Action if Breached |
|---|---|---|
| Hit Ratio | > 95% for OLTP, > 80% for batch | Increase pool size, review access paths |
| Sync Read I/O (Random) | Low relative to GETPAGES | Check for inefficient queries |
| Prefetch Disabled | Should be 0 or near 0 | Increase prefetch thresholds |
| Sequential Steal | Should be 0 | Buffer pool is too small |
Formula for hit ratio:
Hit Ratio = (1 - (Sync Read I/O / GETPAGES)) * 100
If your hit ratio drops below 95% during OLTP hours, investigate immediately. A drop from 99% to 95% represents a fivefold increase in physical I/O.
-DISPLAY LOG
Log monitoring prevents one of the most catastrophic DB2 events: running out of active log space.
-DIS LOG
Sample output:
DSNT360I ***************************************************
DSNJ371I CURRENT COPY 1 ACTIVE LOG DATA SET IS DSNAME=DB2P.LOGCOPY1.DS005
DSNJ372I CURRENT COPY 2 ACTIVE LOG DATA SET IS DSNAME=DB2P.LOGCOPY2.DS005
DSNJ375I THE CURRENT ACTIVE LOG RBA RANGE IS
FROM 0000034A2B6C0000 TO 0000034A3F8E0000
DSNJ376I THE ARCHIVE LOG RBA RANGE IS
FROM 000002F100000000 TO 0000034A2B6BFFFF
DSNJ379I NUMBER OF ACTIVE LOG DATA SETS = 6
DSNJ380I FULL - ACTIVE LOG DATA SETS = 3
When 5 of 6 active log data sets are full, you are in danger. DB2 will stall all update activity if it runs out of active log space. Monitor the ratio of full to total active logs continuously.
-DISPLAY UTILITY
-DIS UTIL(*)
This shows all running utilities, their status, and progress. Watch for: - Utilities in STOPPED status that need restart - Utilities running longer than expected (compare against baseline) - Multiple utilities competing for the same resources
20.3 [z/OS] Traces and IFCID Records
While DISPLAY commands give you real-time snapshots, traces provide detailed continuous data collection. DB2 on z/OS uses Instrumentation Facility Component Identifiers (IFCIDs) to categorize trace records.
Trace Types
| Trace Type | Purpose | IFCID Range | Overhead |
|---|---|---|---|
| Statistics | System-wide resource usage | 001, 002, 202, 225 | Very low |
| Accounting | Per-thread resource consumption | 003, 148, 239 | Low to moderate |
| Performance | Detailed SQL and I/O activity | 003, 006-014, 044-068 | Moderate to high |
| Audit | Security-relevant events | 140-146, 316-319 | Low |
| Monitor | Real-time monitoring data | 147, 148, 199, 230 | Low to moderate |
| Global | Recovery and logging events | 106, 107 | Very low |
Starting and Stopping Traces
-- Start a statistics trace (always leave this running)
-START TRACE(STAT) CLASS(1,3,4,5,6) DEST(SMF)
-- Start an accounting trace for detailed thread analysis
-START TRACE(ACCTG) CLASS(1,2,3) DEST(SMF)
-- Start a performance trace for SQL analysis (use cautiously)
-START TRACE(PERFM) CLASS(1,2,3) PLAN(BNKPLAN) DEST(GTF) TNO(5)
-- Stop a specific trace by trace number
-STOP TRACE(PERFM) TNO(5)
-- Display active traces
-DIS TRACE(*)
Key IFCIDs for Performance Analysis
IFCID 003 — Accounting Record: Generated when a thread deallocates. Contains elapsed time, CPU time, lock wait time, I/O counts, and SQL statement counts. This is the single most important IFCID for performance analysis.
IFCID 006 — SQL Statement Start: Marks the beginning of each SQL statement execution. When combined with IFCID 007 (SQL Statement End), provides per-statement timing.
IFCID 044 — Lock Suspension: Generated when a thread must wait for a lock. Contains the resource name, lock type, and duration. Essential for lock contention analysis.
IFCID 058 — Page Read/Write: Detailed I/O tracking at the page level. High overhead; use only for targeted investigation.
IFCID 172 — Dynamic SQL Statement Text: Captures the full SQL text of dynamic statements. Critical for identifying problematic queries, but generates substantial data volume.
Performance Trace Considerations
Performance traces generate enormous volumes of data and impose measurable overhead. Follow these guidelines:
- Never leave performance traces running indefinitely in production
- Filter by plan, authorization ID, or connection type to reduce volume
- Use a GTF destination for short investigations to avoid filling SMF data sets
- Set a time limit —
START TRACE(PERFM) ... TIMED(5)stops after 5 minutes - Coordinate with systems programmers to ensure adequate DASD for trace data
20.4 [z/OS] SMF Records for DB2
System Management Facility (SMF) records are the permanent record of DB2 activity. While DISPLAY commands show you the present and traces capture the transient, SMF records provide the historical data you need for trend analysis and capacity planning.
DB2 SMF Record Types
| SMF Type | DB2 Trace Source | Contents |
|---|---|---|
| SMF 100 | Statistics trace | System-wide counters: buffer pool stats, EDM pool stats, log activity, locking summary |
| SMF 101 | Accounting trace | Per-thread detail: elapsed time, CPU, I/O, SQL counts, lock waits |
| SMF 102 | Performance trace | Detailed event records: individual SQL timings, I/O events, lock events |
SMF 101: The Accounting Record
SMF 101 records (generated from IFCID 003) are the foundation of DB2 performance management. Each record represents a complete thread lifecycle — from allocation to deallocation — and contains:
Elapsed and CPU time breakdown: - Class 1: Total elapsed time (in-DB2 + out-of-DB2) - Class 2: In-DB2 elapsed time - Class 3: Wait time detail (lock, I/O, log, drain, claim, global contention)
SQL activity counts: - SELECT, INSERT, UPDATE, DELETE counts - PREPARE, BIND, OPEN/CLOSE cursor counts - Commit and rollback counts
Resource consumption: - Buffer pool getpages and physical I/O - Log records written - Sorts performed and sort overflows
DSNACCOR: Accounting Report
IBM provides the DSNACCOR program (or its successor in DB2 Analytics Accelerator) to format SMF 101 records into readable reports. A typical JCL invocation:
//ACCTRPT EXEC PGM=DSNACCOR
//STEPLIB DD DSN=DB2P.SDSNLOAD,DISP=SHR
//SMFIN DD DSN=SYS1.SMF.DAILY,DISP=SHR
//SYSPRINT DD SYSOUT=*
//SYSIN DD *
INCLUDE PLAN=(BNKPLAN,BATPLAN)
REPORT ORDER=ELAPSED
/*
Key fields in the accounting report:
PLAN: BNKPLAN AUTH-ID: WEBUSER CORR-ID: CONN0001
---------------------------------------------------------------
CLASS 1 ELAPSED: 0.452 SEC CPU: 0.023 SEC
CLASS 2 ELAPSED: 0.310 SEC CPU: 0.021 SEC
CLASS 3 LOCK WAIT: 0.089 SEC
CLASS 3 SYNC I/O: 0.045 SEC
CLASS 3 LOG WRITE: 0.002 SEC
---------------------------------------------------------------
GETPAGES: 4,523 SYNC READ: 12
SELECT: 15 INSERT: 3
UPDATE: 2 DELETE: 0
COMMIT: 1 ROLLBACK: 0
SORTS: 2 SORT OVERFLOW: 0
Interpreting the accounting report: - Class 1 minus Class 2 = time spent outside DB2 (application processing, network) - Class 2 minus Class 2 CPU = in-DB2 wait time - High lock wait relative to total elapsed time = lock contention problem - Sort overflows > 0 = sort heap too small or query generating excessive sorts - Sync reads high relative to getpages = poor buffer pool hit ratio for this plan
Building Historical Trends
Store summarized SMF data in DB2 tables for trend analysis:
CREATE TABLE MERIDIAN.PERF_HISTORY (
CAPTURE_DATE DATE NOT NULL,
CAPTURE_HOUR SMALLINT NOT NULL,
PLAN_NAME CHAR(8) NOT NULL,
THREAD_COUNT INTEGER,
AVG_ELAPSED_SEC DECIMAL(10,4),
AVG_CPU_SEC DECIMAL(10,4),
AVG_LOCK_WAIT_SEC DECIMAL(10,4),
TOTAL_GETPAGES BIGINT,
TOTAL_SYNC_READ BIGINT,
HIT_RATIO_PCT DECIMAL(5,2),
TOTAL_SORTS INTEGER,
SORT_OVERFLOWS INTEGER,
PRIMARY KEY (CAPTURE_DATE, CAPTURE_HOUR, PLAN_NAME)
);
Populate this table daily from your SMF processing and you will have the historical baseline data essential for detecting performance degradation before it becomes critical.
20.5 [z/OS] OMEGAMON and Third-Party Monitors
While DISPLAY commands and SMF records provide raw capabilities, enterprise environments typically deploy specialized monitoring products that add real-time dashboards, threshold alerting, historical trending, and automated responses.
IBM OMEGAMON for DB2
OMEGAMON for DB2 Performance Expert is IBM's flagship monitoring product. It provides:
Real-time monitoring: - Thread activity with drill-down to individual SQL statements - Buffer pool hit ratios with graphical display - Lock contention visualization — who is waiting for whom - Dynamic SQL text capture and explain - EDM pool and RID pool utilization
Exception processing: - Define thresholds for any metric (e.g., "alert when hit ratio drops below 95%") - Automatic actions when thresholds are breached (send message, cancel thread, trigger trace) - Escalation chains — alert DBA first, then manager if not resolved
Historical analysis: - Trend charts for any metric over days, weeks, or months - Compare current performance against baseline periods - Identify gradual degradation that day-to-day observation misses
Near-term history: - Short-interval (1-5 minute) data kept in memory or VSAM - Bridge the gap between real-time and SMF historical data - Essential for investigating problems that occurred minutes or hours ago
Threshold Configuration Example
In OMEGAMON, you might configure alerts for Meridian Bank like this:
| Metric | Warning Threshold | Critical Threshold | Action |
|---|---|---|---|
| Buffer pool hit ratio (BP0) | < 97% | < 93% | Page DBA, investigate access paths |
| Active log data sets full | 4 of 6 | 5 of 6 | Page DBA, check for long-running URs |
| Thread elapsed time | > 30 seconds | > 120 seconds | Alert DBA, display thread detail |
| Lock wait time | > 5 seconds | > 30 seconds | Alert DBA, display lock detail |
| Deadlocks per interval | > 2 | > 10 | Alert DBA, start performance trace |
| CPU per SQL statement | > 0.5 seconds | > 2.0 seconds | Log for review |
BMC MainView for DB2
BMC MainView is the other major third-party monitoring product for DB2 on z/OS. It provides similar capabilities to OMEGAMON with a different user interface and some unique features:
- Thread advisor: Recommends actions for problem threads
- SQL workspace: Interactive SQL analysis and tuning
- Batch tracking: Monitors batch job progress against SLA targets
- Automated recovery: Can automatically restart utilities or resolve certain error conditions
Choosing a Monitoring Product
For Meridian Bank's z/OS environment, the selection criteria include: - Integration with existing enterprise monitoring (Tivoli, BMC Patrol, etc.) - Data sharing support if running in a Parallel Sysplex - Ability to monitor multiple DB2 subsystems from a single console - Historical data retention and reporting capabilities - Cost — these products represent significant investment
20.6 [LUW] db2pd — The DBA's Swiss Army Knife
On DB2 LUW, db2pd is the most versatile real-time monitoring tool. It reads directly from DB2's shared memory structures, which means it imposes virtually zero overhead on the database engine. Unlike snapshot monitors, db2pd does not acquire latches, making it safe to run even during performance crises.
Core db2pd Options
Database overview:
db2pd -db MERIDIAN -
This is your starting point. It shows database configuration, memory usage, and general status.
Buffer pool monitoring:
db2pd -db MERIDIAN -bufferpools
Sample output:
Bufferpool Name: IBMDEFAULTBP Id: 1 PageSize: 8192
nPages nPagesUsed nPagesClean nPagesDirty HitRatio
50000 47823 38456 9367 98.73
Data Logical Data Physical Index Logical Index Physical
1245678 15432 567890 7654
Async Data Reads Async Index Reads Async Data Writes Async Index Writes
234567 123456 45678 23456
Direct Reads Direct Writes Prefetch Wait Time
1234 5678 0.023
Interpreting db2pd buffer pool output: - HitRatio: Should be above 95% for OLTP workloads. Below 80% requires immediate investigation. - nPagesDirty: If approaching nPages, the page cleaners cannot keep up — increase NUM_IOCLEANERS or buffer pool size. - Prefetch Wait Time: If non-zero and growing, I/O subsystem is saturated. - Direct Reads/Writes: Large numbers indicate operations bypassing the buffer pool (LOBs, temp tables for sorts). Some are expected; a sudden increase warrants investigation.
Lock monitoring:
db2pd -db MERIDIAN -locks showlocks
Sample output:
Locks:
Address TranHdl Lockname Type Mode Sts Owner
0x07800100 3 02000600040000000000000052 Row X G 3
0x07800200 5 02000600040000000000000052 Row S W 5
0x07800300 3 02000600000000000000000054 Tab IX G 3
0x07800400 5 02000600000000000000000054 Tab IS G 5
Column interpretation: - TranHdl: Transaction handle — links to application information - Type: Lock granularity (Row, Tab=Table, TBS=Tablespace) - Mode: Lock mode (S=Share, X=Exclusive, IX=Intent Exclusive, IS=Intent Share, U=Update) - Sts: Status (G=Granted, W=Waiting, C=Converting)
A lock with Sts=W tells you contention exists. Cross-reference the TranHdl with the -transactions output to find the blocking application.
Transaction monitoring:
db2pd -db MERIDIAN -transactions
Shows all active transactions with their start times, log space consumed, and current state. Look for: - Transactions that have been active for abnormally long periods - Transactions consuming large amounts of log space (potential log-full risk) - Transactions in ROLLBACK state that may be taking a long time to complete
Application monitoring:
db2pd -db MERIDIAN -applications
Maps application handles to connection details (IP address, client application name, authorization ID). Essential for identifying who is running a problematic query.
Combining db2pd Options
You can combine multiple options in a single command:
db2pd -db MERIDIAN -bufferpools -locks -transactions -applications
And you can set up continuous monitoring with the -repeat option:
db2pd -db MERIDIAN -bufferpools -repeat 30 -file /home/db2inst1/bp_monitor.out
This captures buffer pool statistics every 30 seconds and writes them to a file — invaluable for capturing intermittent problems.
Advanced db2pd Options
Sorting and hashing:
db2pd -db MERIDIAN -sort
Shows active sort operations and memory consumption. Sort overflows to disk devastate performance.
Dynamic SQL cache:
db2pd -db MERIDIAN -dynamic
Shows statements in the package cache, their execution counts, and costs. A quick way to find the most frequently executed or most expensive queries.
Table space status:
db2pd -db MERIDIAN -tablespaces
Shows table space state, size, utilization, and any pending operations.
Wait analysis:
db2pd -db MERIDIAN -wlocks
Specifically shows lock waits — who is waiting and who is blocking. This is the fastest way to diagnose lock contention.
20.7 [LUW] Snapshot Monitor and MON_GET Functions
While db2pd provides zero-overhead snapshots, the Snapshot Monitor and MON_GET table functions offer richer, more structured data suitable for automated collection and analysis.
The Snapshot Monitor (Legacy but Still Useful)
The snapshot monitor uses monitor switches that control what data is collected:
-- Check current monitor switch settings
db2 GET MONITOR SWITCHES
-- Enable all switches
db2 UPDATE MONITOR SWITCHES USING
BUFFERPOOL ON
LOCK ON
SORT ON
STATEMENT ON
TABLE ON
TIMESTAMP ON
UOW ON
Taking a snapshot:
-- Database-level snapshot
db2 GET SNAPSHOT FOR DATABASE ON MERIDIAN
-- Application-level snapshot
db2 GET SNAPSHOT FOR APPLICATIONS ON MERIDIAN
-- Buffer pool snapshot
db2 GET SNAPSHOT FOR BUFFERPOOLS ON MERIDIAN
-- Lock snapshot
db2 GET SNAPSHOT FOR LOCKS ON MERIDIAN
MON_GET Table Functions (Modern Approach)
Starting with DB2 9.7, MON_GET table functions provide the same data as snapshots but in a SQL-queryable form. They are more flexible, more efficient, and the recommended approach for new monitoring implementations.
Buffer pool monitoring:
SELECT
BP_NAME,
POOL_DATA_L_READS + POOL_INDEX_L_READS AS TOTAL_LOGICAL_READS,
POOL_DATA_P_READS + POOL_INDEX_P_READS AS TOTAL_PHYSICAL_READS,
CASE WHEN (POOL_DATA_L_READS + POOL_INDEX_L_READS) > 0
THEN DEC(1.0 - (FLOAT(POOL_DATA_P_READS + POOL_INDEX_P_READS) /
FLOAT(POOL_DATA_L_READS + POOL_INDEX_L_READS)), 5, 4) * 100
ELSE 100.00
END AS HIT_RATIO_PCT
FROM TABLE(MON_GET_BUFFERPOOL('', -2)) AS BP
ORDER BY TOTAL_LOGICAL_READS DESC;
Lock wait monitoring:
SELECT
APPLICATION_HANDLE,
LOCK_WAITS,
LOCK_WAIT_TIME,
LOCK_TIMEOUTS,
DEADLOCKS,
LOCK_ESCALS
FROM TABLE(MON_GET_CONNECTION(NULL, -2)) AS CONN
WHERE LOCK_WAITS > 0
ORDER BY LOCK_WAIT_TIME DESC;
Top SQL by execution time:
SELECT
SUBSTR(STMT_TEXT, 1, 200) AS SQL_TEXT,
NUM_EXECUTIONS,
TOTAL_ACT_TIME / 1000 AS TOTAL_TIME_SEC,
CASE WHEN NUM_EXECUTIONS > 0
THEN (TOTAL_ACT_TIME / NUM_EXECUTIONS) / 1000
ELSE 0
END AS AVG_TIME_SEC,
ROWS_READ,
ROWS_RETURNED
FROM TABLE(MON_GET_PKG_CACHE_STMT(NULL, NULL, NULL, -2)) AS STMT
WHERE NUM_EXECUTIONS > 0
ORDER BY TOTAL_ACT_TIME DESC
FETCH FIRST 20 ROWS ONLY;
Table I/O analysis:
SELECT
SUBSTR(TABSCHEMA, 1, 20) AS SCHEMA,
SUBSTR(TABNAME, 1, 30) AS TABLE_NAME,
ROWS_READ,
ROWS_INSERTED,
ROWS_UPDATED,
ROWS_DELETED,
TABLE_SCANS,
DATA_OBJECT_L_PAGES AS DATA_PAGES
FROM TABLE(MON_GET_TABLE('', '', -2)) AS T
WHERE TABSCHEMA = 'MERIDIAN'
ORDER BY ROWS_READ DESC
FETCH FIRST 20 ROWS ONLY;
MON_GET vs. Snapshot: When to Use Which
| Feature | GET SNAPSHOT | MON_GET Functions |
|---|---|---|
| Output format | Formatted text | SQL result set |
| Filtering | Limited | Full SQL WHERE clause |
| Aggregation | None (manual) | SQL GROUP BY, SUM, etc. |
| Automation | Requires parsing | Direct INSERT INTO |
| Overhead | Moderate | Low |
| Availability | All versions | DB2 9.7+ |
| Recommended for new work | No | Yes |
Best practice: Use MON_GET functions for all new monitoring development. Use snapshot monitors only when supporting legacy scripts or when a specific piece of information is not yet available through MON_GET.
20.8 [LUW] Event Monitors
While snapshots and MON_GET functions show you point-in-time data, event monitors capture data when specific events occur. They are essential for capturing transient problems that might not be visible in a snapshot.
Types of Event Monitors
| Event Monitor Type | Captures | Use Case |
|---|---|---|
| Statement | Every SQL statement execution | SQL performance analysis |
| Activity | SQL and non-SQL activities | Workload analysis |
| Unit of Work | Transaction boundaries | Transaction pattern analysis |
| Deadlock | Deadlock events | Deadlock investigation |
| Connection | Connect and disconnect events | Connection pattern analysis |
| Locking | Lock escalations, timeouts, deadlocks | Lock contention deep dive |
| Package Cache | Statement cache entries | Package cache analysis |
| Threshold Violations | WLM threshold breaches | Workload management |
Creating an Activity Event Monitor
For Meridian Bank, we want to capture all SQL statements that take longer than 5 seconds:
-- First, create a workload management threshold
CREATE THRESHOLD MERIDIAN_SLOW_SQL
FOR DATABASE ACTIVITIES
ENFORCEMENT DATABASE
WHEN ACTIVITYTOTALTIME > 5
COLLECT ACTIVITY DATA WITH DETAILS AND VALUES
WITHOUT STOP;
-- Create the activity event monitor
CREATE EVENT MONITOR MERIDIAN_ACTMON
FOR ACTIVITIES
WRITE TO TABLE
ACTIVITY (TABLE MERIDIAN.ACTIVITY_EVENTS
IN USERSPACE1
PCTDEACTIVATE 100)
ACTIVITYSTMT (TABLE MERIDIAN.ACTIVITY_STMT_EVENTS
IN USERSPACE1
PCTDEACTIVATE 100)
ACTIVITYVALS (TABLE MERIDIAN.ACTIVITY_VAL_EVENTS
IN USERSPACE1
PCTDEACTIVATE 100)
AUTOSTART;
-- Activate the event monitor
SET EVENT MONITOR MERIDIAN_ACTMON STATE 1;
Creating a Deadlock Event Monitor
Deadlocks should always be monitored:
CREATE EVENT MONITOR MERIDIAN_DEADLOCK
FOR LOCKING
WRITE TO TABLE
LOCK (TABLE MERIDIAN.LOCK_EVENTS
IN USERSPACE1
PCTDEACTIVATE 100)
LOCK_PARTICIPANTS (TABLE MERIDIAN.LOCK_PARTICIPANTS
IN USERSPACE1
PCTDEACTIVATE 100)
LOCK_PARTICIPANT_ACTIVITIES (TABLE MERIDIAN.LOCK_PART_ACTIVITIES
IN USERSPACE1
PCTDEACTIVATE 100)
AUTOSTART;
SET EVENT MONITOR MERIDIAN_DEADLOCK STATE 1;
Creating a Unit of Work Event Monitor
CREATE EVENT MONITOR MERIDIAN_UOW
FOR UNIT OF WORK
WRITE TO TABLE
AUTOSTART;
SET EVENT MONITOR MERIDIAN_UOW STATE 1;
Querying Event Monitor Data
Once event monitors are collecting data, query the target tables:
-- Find the slowest SQL statements from the activity event monitor
SELECT
APPL_ID,
UOW_ID,
ACTIVITY_ID,
SUBSTR(STMT_TEXT, 1, 200) AS SQL_TEXT,
TIME_STARTED,
TIME_COMPLETED,
TIMESTAMPDIFF(2, CHAR(TIME_COMPLETED - TIME_STARTED)) AS DURATION_SEC
FROM MERIDIAN.ACTIVITY_STMT_EVENTS
WHERE TIME_COMPLETED > CURRENT TIMESTAMP - 24 HOURS
ORDER BY DURATION_SEC DESC
FETCH FIRST 20 ROWS ONLY;
-- Find all deadlocks in the past 7 days
SELECT
XMLPARSE(DOCUMENT LOCK_EVENT_XML) AS DEADLOCK_INFO,
EVENT_TIMESTAMP
FROM MERIDIAN.LOCK_EVENTS
WHERE EVENT_TYPE = 'DEADLOCK'
AND EVENT_TIMESTAMP > CURRENT TIMESTAMP - 7 DAYS
ORDER BY EVENT_TIMESTAMP DESC;
Event Monitor Housekeeping
Event monitor tables grow continuously. Implement regular purging:
-- Purge activity events older than 30 days
DELETE FROM MERIDIAN.ACTIVITY_EVENTS
WHERE TIME_STARTED < CURRENT TIMESTAMP - 30 DAYS;
DELETE FROM MERIDIAN.ACTIVITY_STMT_EVENTS
WHERE TIME_STARTED < CURRENT TIMESTAMP - 30 DAYS;
DELETE FROM MERIDIAN.ACTIVITY_VAL_EVENTS
WHERE TIME_STARTED < CURRENT TIMESTAMP - 30 DAYS;
-- Purge lock events older than 90 days
DELETE FROM MERIDIAN.LOCK_EVENTS
WHERE EVENT_TIMESTAMP < CURRENT TIMESTAMP - 90 DAYS;
20.9 [LUW] db2diag.log and db2dump
When something goes wrong with DB2 LUW, the diagnostic log is the first place to look. Understanding how to navigate db2diag.log quickly and effectively separates experienced DBAs from novices.
db2diag.log Location
The diagnostic log is located in the directory specified by the DIAGPATH database manager configuration parameter:
db2 GET DBM CFG | grep DIAGPATH
Default location: $HOME/sqllib/db2dump/ (or the instance's DIAGPATH).
db2diag.log Format
Each entry in db2diag.log follows a standard format:
2025-01-15-14.23.45.123456-300 I567890H432 LEVEL: Warning
PID : 12345 TID : 140234567890 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000
DB : MERIDIAN APPHDL: 0-1234 APPID: *LOCAL.db2inst1.250115192345
AUTHID : WEBUSER HOSTNAME: dbserver01
EDUID : 15 EDUNAME: db2agent (MERIDIAN) 0
FUNCTION: DB2 UDB, buffer pool services, sqlbAlterBufferPoolSize, probe:20
MESSAGE : ADM5502W The buffer pool "MERIDIAN_BP" is currently not started.
DATA #1 : String, 45 bytes
Buffer pool MERIDIAN_BP activation deferred.
Severity Levels
| Level | Meaning | Action |
|---|---|---|
| Informational | Normal operations | Review periodically |
| Warning | Potential issue, not yet critical | Investigate within business hours |
| Error | Something failed | Investigate promptly |
| Severe | Major component failure | Investigate immediately |
| Critical | Database or instance may be compromised | Emergency response |
Filtering db2diag.log
The raw log can be enormous. Use db2diag (the diagnostic tool, not the log itself) to filter:
# Show only errors and above from the last 24 hours
db2diag -time 24h -level Error
# Show entries for a specific database
db2diag -db MERIDIAN -level Warning
# Show entries for a specific time range
db2diag -time "2025-01-15-14.00.00":"2025-01-15-15.00.00"
# Show entries matching a specific message ID
db2diag -msg ADM5502W
First Failure Data Capture (FFDC)
DB2 LUW implements First Failure Data Capture — when certain errors occur, DB2 automatically dumps diagnostic information to the db2dump directory. These files include:
- Trap files (
*.trap): Generated on exceptions/crashes. Contains stack trace, register contents, and memory state. - Lock list dump files: Generated on deadlocks when DIAGLEVEL is 3 or higher.
- Memory dump files (
*.dump): Memory pool contents at time of error. - Core files: Full process core dumps for severe failures.
DIAGLEVEL Configuration
# Check current diagnostic level
db2 GET DBM CFG | grep DIAGLEVEL
# Set diagnostic level (0=none, 1=severe, 2=errors, 3=warnings, 4=informational)
db2 UPDATE DBM CFG USING DIAGLEVEL 3
Recommended settings: - Production: DIAGLEVEL 3 (captures warnings and above — good balance) - Performance investigation: DIAGLEVEL 4 temporarily (captures everything — increases log volume substantially) - Never use: DIAGLEVEL 0 in production (you lose all diagnostic data)
db2support: Collecting Diagnostic Data
When you need to open a PMR (Problem Management Record) with IBM Support, use db2support to collect all relevant diagnostic data:
db2support /tmp/support_collection -d MERIDIAN -c -st
This collects: - Database and instance configuration - db2diag.log entries - Operating system information - Database layout and table space details - Recent trap files and dump files - Optimizer statistics
The output is a compressed archive ready to upload to IBM Support.
Practical db2diag.log Investigation
When investigating a problem, follow this sequence:
- Determine the time window: When did the problem start? When did it end?
- Filter by time and severity:
db2diag -time "start":"end" -level Error - Look for the first error: Problems often cascade — the first error is usually the root cause.
- Check for correlated entries: Multiple entries with the same APPHDL or EDUID often tell a story.
- Search for known message IDs: IBM documentation explains every message ID (e.g., SQL0911N, ADM1823E).
- Check trap files: If a process crashed, the trap file contains the detailed failure information.
20.10 Key Metrics and What They Mean
This section provides a reference for the most important DB2 metrics, their healthy thresholds, and the corrective actions when thresholds are breached. These apply to both platforms unless noted.
Buffer Pool Hit Ratio
What it measures: The percentage of page requests satisfied from memory without requiring physical I/O.
Formula:
Hit Ratio = (1 - (Physical Reads / Logical Reads)) * 100
Thresholds: | Workload | Warning | Critical | |----------|---------|----------| | OLTP | < 97% | < 93% | | Mixed | < 90% | < 80% | | DSS/Batch | < 80% | < 60% |
Corrective actions: 1. Increase buffer pool size if memory is available 2. Review access paths — table scans on large tables destroy hit ratios 3. Check for index usage — missing indexes force physical I/O 4. Verify prefetch is working effectively 5. Consider separating workloads into different buffer pools
Lock Wait Time
What it measures: Time threads spend waiting to acquire locks held by other threads.
Thresholds: | Metric | Warning | Critical | |--------|---------|----------| | Average lock wait (OLTP) | > 100ms | > 1 second | | Lock timeout rate | > 0.1% of transactions | > 1% of transactions | | Deadlock rate | > 1 per hour | > 1 per minute |
Corrective actions: 1. Identify the blocking transaction and determine if it can be tuned 2. Reduce transaction scope — commit more frequently 3. Ensure applications access tables in consistent order (prevents deadlocks) 4. Consider lock avoidance techniques (optimistic locking, SKIP LOCKED DATA) 5. Review isolation levels — can any be lowered safely?
Log Utilization
What it measures: How much of the transaction log is in use.
z/OS thresholds: | Active logs full | Severity | |-----------------|----------| | 50% full | Normal | | 67% full | Warning — investigate long-running URs | | 83% full | Critical — resolve immediately | | 100% full | Emergency — DB2 stops all update activity |
LUW thresholds:
Monitor LOG_UTILIZATION_PERCENT from MON_GET_TRANSACTION_LOG:
SELECT
LOG_UTILIZATION_PERCENT,
TOTAL_LOG_USED_KB,
TOTAL_LOG_AVAILABLE_KB
FROM TABLE(MON_GET_TRANSACTION_LOG(-2)) AS LOG;
| Utilization | Severity |
|---|---|
| < 50% | Normal |
| 50-75% | Warning |
| > 75% | Critical |
| > 90% | Emergency |
Corrective actions: 1. Identify long-running transactions consuming log space 2. Force commit or terminate the offending transaction if necessary 3. Increase log file size or number of log files for ongoing capacity 4. Review application logic for unnecessarily large transactions 5. Ensure log archiving (on LUW with LOGARCHMETH1) is functioning
Sort Overflows
What it measures: How often sort operations spill from memory to temporary disk space.
Formula:
Sort Overflow Percentage = (Sort Overflows / Total Sorts) * 100
Thresholds: | Overflow Rate | Severity | |---------------|----------| | < 2% | Acceptable | | 2-5% | Warning | | > 5% | Critical |
Corrective actions:
1. Increase SORTHEAP (LUW) or sort pool size (z/OS)
2. Review queries generating large sorts — can ORDER BY be eliminated or satisfied by an index?
3. Add indexes that provide the needed ordering
4. Check for hash joins spilling to disk
Package Cache Hit Ratio
What it measures: How often SQL statement preparation results are found in cache versus requiring fresh preparation.
Formula:
Pkg Cache Hit Ratio = (1 - (Pkg Cache Inserts / (Pkg Cache Lookups))) * 100
Thresholds: | Hit Ratio | Severity | |-----------|----------| | > 80% | Healthy | | 60-80% | Warning — possible literal SQL issue | | < 60% | Critical — application likely not using parameter markers |
Corrective actions:
1. Review application SQL for literal values instead of parameter markers
2. Increase PCKCACHESZ (LUW) or EDM pool size (z/OS)
3. Enable statement concentrator (STMT_CONC = LITERALS) for short-term relief
4. Work with application teams to parameterize queries
Agent Utilization [LUW]
What it measures: Percentage of available agents (connections) in use.
SELECT
(AGENTS_REGISTERED * 100.0) / MAX_CONNECTIONS_TOP AS AGENT_UTILIZATION_PCT
FROM
SYSIBMADM.SNAPDBM,
(SELECT VALUE AS MAX_CONNECTIONS_TOP
FROM SYSIBMADM.DBCFG
WHERE NAME = 'max_connections') AS CFG;
Thresholds: | Utilization | Severity | |-------------|----------| | < 70% | Normal | | 70-85% | Warning — plan for growth | | > 85% | Critical — connection pool or limit adjustment needed |
Corrective actions:
1. Ensure applications use connection pooling
2. Increase MAX_CONNECTIONS if system resources allow
3. Investigate connection leaks (connections opened but never closed)
4. Implement workload management to prioritize critical connections
20.11 Building a Monitoring Dashboard
A monitoring dashboard consolidates the key metrics from the previous sections into a single view that a DBA can assess in seconds. Whether you build it with a commercial tool, a custom web application, or even a simple terminal-based script, the principles are the same.
What to Display
Organize your dashboard into three tiers:
Tier 1 — Glance indicators (red/yellow/green): - Database status (up/down) - Buffer pool hit ratio - Log utilization - Active connections vs. limit - Lock escalation count - Last successful backup age
Tier 2 — Current performance (numeric values with trend arrows): - Transactions per second - Average response time - Current lock wait time - Sort overflow rate - Package cache hit ratio - I/O throughput (reads/writes per second)
Tier 3 — Detail panels (expandable): - Top 10 active SQL statements by elapsed time - Lock waits — who is waiting for whom - Buffer pool detail by pool name - Table space utilization - Application connection list
Refresh Frequency
| Dashboard Tier | Refresh Interval | Rationale |
|---|---|---|
| Tier 1 (status) | 15-30 seconds | Must reflect current state |
| Tier 2 (performance) | 30-60 seconds | Smooths out transient spikes |
| Tier 3 (detail) | On-demand | Too expensive to refresh continuously |
Trend Visualization
Show at least 4 hours of trend data for each Tier 2 metric. This allows the DBA to see: - Sudden spikes: Something just changed — investigate now - Gradual degradation: Slowly worsening over hours — investigate before it becomes critical - Periodic patterns: Normal workload cycles — no action needed if within baseline
Alerting Rules
Configure alerts with hysteresis to avoid alert storms:
IF metric > critical_threshold FOR 3 consecutive intervals:
SEND critical alert
DO NOT re-alert for 15 minutes
IF metric > warning_threshold FOR 5 consecutive intervals:
SEND warning alert
DO NOT re-alert for 30 minutes
IF metric returns below warning_threshold FOR 3 consecutive intervals:
SEND recovery notification
Alert delivery channels: - Email: For warnings and non-urgent items - SMS/Pager: For critical alerts requiring immediate response - Dashboard color change: For real-time visual monitoring - Ticketing system integration: For automated incident creation
Sample Dashboard Data Collection Query [LUW]
-- Comprehensive dashboard data collection
SELECT
'BUFFER_POOL' AS METRIC_CATEGORY,
BP_NAME AS METRIC_NAME,
CASE WHEN (POOL_DATA_L_READS + POOL_INDEX_L_READS) > 0
THEN DEC((1.0 - FLOAT(POOL_DATA_P_READS + POOL_INDEX_P_READS) /
FLOAT(POOL_DATA_L_READS + POOL_INDEX_L_READS)) * 100, 5, 2)
ELSE 100.00
END AS METRIC_VALUE,
CURRENT TIMESTAMP AS CAPTURE_TIME
FROM TABLE(MON_GET_BUFFERPOOL('', -2)) AS BP
UNION ALL
SELECT
'LOG_UTILIZATION',
'LOG_USED_PCT',
DEC(LOG_UTILIZATION_PERCENT, 5, 2),
CURRENT TIMESTAMP
FROM TABLE(MON_GET_TRANSACTION_LOG(-2)) AS LOG
UNION ALL
SELECT
'CONNECTIONS',
'ACTIVE_CONNECTIONS',
DEC(COUNT(*), 5, 2),
CURRENT TIMESTAMP
FROM TABLE(MON_GET_CONNECTION(NULL, -2)) AS CONN
UNION ALL
SELECT
'LOCK_WAITS',
'TOTAL_LOCK_WAITS',
DEC(SUM(LOCK_WAITS), 5, 2),
CURRENT TIMESTAMP
FROM TABLE(MON_GET_CONNECTION(NULL, -2)) AS CONN;
20.12 The Meridian Bank Monitoring Setup
Now we bring everything together for Meridian National Bank. This section defines the complete monitoring configuration across both platforms, including specific thresholds, collection intervals, alerting rules, and a daily health check.
z/OS Monitoring Configuration
Always-on traces:
-- Statistics trace — captures system-wide metrics every 15 minutes
-START TRACE(STAT) CLASS(1,3,4,5,6) DEST(SMF)
-- Accounting trace — captures per-thread detail
-START TRACE(ACCTG) CLASS(1,2,3) DEST(SMF)
Scheduled DISPLAY commands (run via automation):
-- Every 5 minutes during online hours:
-DIS BUFFERPOOL(*) DETAIL
-DIS LOG
-DIS DATABASE(MERIDIANDB) SPACENAM(*) RESTRICT
-- Every 15 minutes:
-DIS THREAD(*) DETAIL
-- Every hour:
-DIS UTIL(*)
OMEGAMON thresholds for Meridian:
| Metric | Warning | Critical | Action |
|---|---|---|---|
| BP0 Hit Ratio | < 97% | < 93% | Page DBA |
| Active Logs Full | 4 of 6 | 5 of 6 | Page DBA + check URs |
| Thread Elapsed > 60s | Any | > 120s | Alert DBA |
| Lock Timeout | > 5/hour | > 20/hour | Alert DBA |
| Deadlock | Any | > 5/hour | Alert DBA + start perf trace |
| RECP/GRECP Status | Any | N/A | Emergency alert |
LUW Monitoring Configuration
Database monitor configuration:
-- Enable monitoring switches permanently
UPDATE DBM CFG USING
HEALTH_MON ON
DFT_MON_BUFPOOL ON
DFT_MON_LOCK ON
DFT_MON_SORT ON
DFT_MON_STMT ON
DFT_MON_TABLE ON
DFT_MON_TIMESTAMP ON
DFT_MON_UOW ON;
Event monitors:
-- Deadlock/locking event monitor (always on)
CREATE EVENT MONITOR MERIDIAN_LOCK_MON
FOR LOCKING
WRITE TO TABLE
AUTOSTART;
-- Activity event monitor for slow queries (> 5 seconds)
CREATE THRESHOLD MERIDIAN_SLOW_SQL
FOR DATABASE ACTIVITIES
ENFORCEMENT DATABASE
WHEN ACTIVITYTOTALTIME > 5
COLLECT ACTIVITY DATA WITH DETAILS AND VALUES
WITHOUT STOP;
CREATE EVENT MONITOR MERIDIAN_ACT_MON
FOR ACTIVITIES
WRITE TO TABLE
AUTOSTART;
Continuous db2pd collection:
# Buffer pool monitoring — every 60 seconds, output to daily file
db2pd -db MERIDIAN -bufferpools -repeat 60 \
-file /db2data/monitoring/bp_$(date +%Y%m%d).out &
# Lock monitoring — every 30 seconds during online hours
db2pd -db MERIDIAN -locks showlocks -repeat 30 \
-file /db2data/monitoring/locks_$(date +%Y%m%d).out &
Buffer Pool Monitoring Detail
For Meridian Bank, we define four buffer pools with specific monitoring thresholds:
| Buffer Pool | Page Size | Size | Purpose | Hit Ratio Threshold |
|---|---|---|---|---|
| MERIDIAN_BP_DATA | 8K | 50,000 pages | General data | > 97% |
| MERIDIAN_BP_IDX | 8K | 30,000 pages | Indexes | > 99% |
| MERIDIAN_BP_LOB | 32K | 10,000 pages | LOB data | > 80% |
| MERIDIAN_BP_TEMP | 8K | 20,000 pages | Temp tables | N/A (no caching) |
-- Buffer pool monitoring query for Meridian dashboard
SELECT
BP_NAME,
POOL_DATA_L_READS AS DATA_LOGICAL,
POOL_DATA_P_READS AS DATA_PHYSICAL,
POOL_INDEX_L_READS AS INDEX_LOGICAL,
POOL_INDEX_P_READS AS INDEX_PHYSICAL,
CASE WHEN (POOL_DATA_L_READS + POOL_INDEX_L_READS) > 0
THEN DEC((1.0 - FLOAT(POOL_DATA_P_READS + POOL_INDEX_P_READS) /
FLOAT(POOL_DATA_L_READS + POOL_INDEX_L_READS)) * 100, 5, 2)
ELSE 100.00
END AS HIT_RATIO,
POOL_ASYNC_DATA_READS AS PREFETCH_DATA,
POOL_ASYNC_INDEX_READS AS PREFETCH_INDEX,
POOL_DATA_WRITES + POOL_INDEX_WRITES AS TOTAL_WRITES,
POOL_ASYNC_DATA_WRITES + POOL_ASYNC_INDEX_WRITES AS ASYNC_WRITES
FROM TABLE(MON_GET_BUFFERPOOL('', -2)) AS BP
WHERE BP_NAME LIKE 'MERIDIAN%'
ORDER BY BP_NAME;
Transaction Volume Tracking
-- Hourly transaction volume tracking
INSERT INTO MERIDIAN.TRANSACTION_METRICS
SELECT
CURRENT DATE AS METRIC_DATE,
HOUR(CURRENT TIMESTAMP) AS METRIC_HOUR,
SUM(COMMIT_SQL_STMTS) AS TOTAL_COMMITS,
SUM(ROLLBACK_SQL_STMTS) AS TOTAL_ROLLBACKS,
SUM(SELECT_SQL_STMTS) AS TOTAL_SELECTS,
SUM(UID_SQL_STMTS) AS TOTAL_UPDATES,
SUM(ROWS_READ) AS TOTAL_ROWS_READ,
SUM(ROWS_RETURNED) AS TOTAL_ROWS_RETURNED
FROM TABLE(MON_GET_CONNECTION(NULL, -2)) AS CONN;
Lock Contention Alerts
-- Check for active lock waits exceeding threshold
SELECT
HLD.APPLICATION_HANDLE AS BLOCKER_HANDLE,
HLD.AGENT_ID AS BLOCKER_AGENT,
REQ.APPLICATION_HANDLE AS WAITER_HANDLE,
REQ.AGENT_ID AS WAITER_AGENT,
REQ.LOCK_WAIT_START_TIME,
TIMESTAMPDIFF(2, CHAR(CURRENT TIMESTAMP - REQ.LOCK_WAIT_START_TIME))
AS WAIT_SECONDS,
REQ.LOCK_NAME,
REQ.LOCK_MODE AS REQUESTED_MODE,
HLD.LOCK_MODE AS HELD_MODE,
SUBSTR(REQ.TABSCHEMA, 1, 20) AS SCHEMA,
SUBSTR(REQ.TABNAME, 1, 30) AS TABLE_NAME
FROM
TABLE(MON_GET_APPL_LOCKWAIT(NULL, -2)) AS REQ,
TABLE(MON_GET_LOCKS(NULL, -2)) AS HLD
WHERE REQ.LOCK_NAME = HLD.LOCK_NAME
AND REQ.APPLICATION_HANDLE <> HLD.APPLICATION_HANDLE
AND HLD.LOCK_STATUS = 'G'
ORDER BY WAIT_SECONDS DESC;
Log Space Monitoring
-- Log space monitoring with alerting threshold
SELECT
CASE
WHEN LOG_UTILIZATION_PERCENT > 75 THEN 'CRITICAL'
WHEN LOG_UTILIZATION_PERCENT > 50 THEN 'WARNING'
ELSE 'NORMAL'
END AS ALERT_LEVEL,
LOG_UTILIZATION_PERCENT,
TOTAL_LOG_USED_KB,
TOTAL_LOG_AVAILABLE_KB,
APPLH_WITH_OLDEST_XACT AS OLDEST_TRANSACTION_HANDLE,
CURRENT TIMESTAMP AS CHECK_TIME
FROM TABLE(MON_GET_TRANSACTION_LOG(-2)) AS LOG;
Daily Health Check Script
The crown jewel of Meridian's monitoring infrastructure is the daily health check — a comprehensive script that runs every morning at 6 AM and produces a report the DBA team reviews before the business day begins.
-- ============================================================
-- MERIDIAN NATIONAL BANK — DAILY HEALTH CHECK
-- Run daily at 06:00 via scheduler
-- ============================================================
-- Section 1: Database Status
SELECT 'DATABASE STATUS' AS CHECK_CATEGORY,
DB_NAME, DB_STATUS, DB_ACTIVATION_STATE,
LAST_BACKUP,
TIMESTAMPDIFF(8, CHAR(CURRENT TIMESTAMP - LAST_BACKUP))
AS HOURS_SINCE_BACKUP
FROM TABLE(MON_GET_DATABASE(-2)) AS DB;
-- Section 2: Buffer Pool Health
SELECT 'BUFFER POOL HEALTH' AS CHECK_CATEGORY,
BP_NAME,
CASE WHEN (POOL_DATA_L_READS + POOL_INDEX_L_READS) > 0
THEN DEC((1.0 - FLOAT(POOL_DATA_P_READS + POOL_INDEX_P_READS) /
FLOAT(POOL_DATA_L_READS + POOL_INDEX_L_READS)) * 100, 5, 2)
ELSE 100.00
END AS HIT_RATIO_PCT,
POOL_DATA_L_READS + POOL_INDEX_L_READS AS TOTAL_LOGICAL_READS,
POOL_DATA_P_READS + POOL_INDEX_P_READS AS TOTAL_PHYSICAL_READS
FROM TABLE(MON_GET_BUFFERPOOL('', -2)) AS BP
ORDER BY BP_NAME;
-- Section 3: Table Space Utilization
SELECT 'TABLESPACE UTILIZATION' AS CHECK_CATEGORY,
TBSP_NAME,
TBSP_TYPE,
TBSP_UTILIZATION_PERCENT AS UTILIZATION_PCT,
TBSP_TOTAL_SIZE_KB / 1024 AS TOTAL_SIZE_MB,
TBSP_USED_SIZE_KB / 1024 AS USED_SIZE_MB,
TBSP_FREE_SIZE_KB / 1024 AS FREE_SIZE_MB,
CASE WHEN TBSP_UTILIZATION_PERCENT > 90 THEN 'CRITICAL'
WHEN TBSP_UTILIZATION_PERCENT > 80 THEN 'WARNING'
ELSE 'OK'
END AS STATUS
FROM TABLE(MON_GET_TABLESPACE('', -2)) AS TS
ORDER BY TBSP_UTILIZATION_PERCENT DESC;
-- Section 4: Log Utilization
SELECT 'LOG UTILIZATION' AS CHECK_CATEGORY,
LOG_UTILIZATION_PERCENT,
TOTAL_LOG_USED_KB / 1024 AS LOG_USED_MB,
TOTAL_LOG_AVAILABLE_KB / 1024 AS LOG_AVAILABLE_MB
FROM TABLE(MON_GET_TRANSACTION_LOG(-2)) AS LOG;
-- Section 5: Lock Summary (past 24 hours)
SELECT 'LOCK SUMMARY' AS CHECK_CATEGORY,
SUM(LOCK_WAITS) AS TOTAL_LOCK_WAITS,
SUM(LOCK_TIMEOUTS) AS TOTAL_LOCK_TIMEOUTS,
SUM(DEADLOCKS) AS TOTAL_DEADLOCKS,
SUM(LOCK_ESCALS) AS TOTAL_LOCK_ESCALATIONS
FROM TABLE(MON_GET_DATABASE(-2)) AS DB;
-- Section 6: Top 10 Most Expensive Queries
SELECT 'TOP QUERIES' AS CHECK_CATEGORY,
SUBSTR(STMT_TEXT, 1, 100) AS SQL_PREVIEW,
NUM_EXECUTIONS,
DEC(TOTAL_ACT_TIME / 1000.0, 12, 2) AS TOTAL_TIME_SEC,
DEC(CASE WHEN NUM_EXECUTIONS > 0
THEN (TOTAL_ACT_TIME / NUM_EXECUTIONS) / 1000.0
ELSE 0 END, 12, 4) AS AVG_TIME_SEC,
ROWS_READ,
ROWS_RETURNED
FROM TABLE(MON_GET_PKG_CACHE_STMT(NULL, NULL, NULL, -2)) AS STMT
WHERE NUM_EXECUTIONS > 0
ORDER BY TOTAL_ACT_TIME DESC
FETCH FIRST 10 ROWS ONLY;
-- Section 7: Connection Summary
SELECT 'CONNECTION SUMMARY' AS CHECK_CATEGORY,
COUNT(*) AS TOTAL_CONNECTIONS,
SUM(CASE WHEN APPLICATION_STATUS = 'UOWWAIT' THEN 1 ELSE 0 END) AS IDLE,
SUM(CASE WHEN APPLICATION_STATUS = 'UOWEXEC' THEN 1 ELSE 0 END) AS ACTIVE,
SUM(CASE WHEN APPLICATION_STATUS = 'LOCKWAIT' THEN 1 ELSE 0 END) AS LOCK_WAITING
FROM TABLE(MON_GET_CONNECTION(NULL, -2)) AS CONN;
-- Section 8: Sort Performance
SELECT 'SORT PERFORMANCE' AS CHECK_CATEGORY,
TOTAL_SORTS,
SORT_OVERFLOWS,
CASE WHEN TOTAL_SORTS > 0
THEN DEC((FLOAT(SORT_OVERFLOWS) / FLOAT(TOTAL_SORTS)) * 100, 5, 2)
ELSE 0.00
END AS OVERFLOW_PCT,
TOTAL_SORT_TIME
FROM TABLE(MON_GET_DATABASE(-2)) AS DB;
This health check report should be reviewed daily by the on-call DBA. Any metric flagged as WARNING or CRITICAL should be investigated before business hours begin. Over time, the team will develop an intuition for what is normal and what requires attention — but only if they review the report consistently.
Spaced Review: Connecting to Earlier Chapters
From Chapter 3: Your First Queries
In Chapter 3, you wrote your first SELECT statement against the ACCOUNTS table. Now you can monitor how that query performs:
-- Find your query in the package cache
SELECT
SUBSTR(STMT_TEXT, 1, 200) AS SQL_TEXT,
NUM_EXECUTIONS,
TOTAL_ACT_TIME,
ROWS_READ,
ROWS_RETURNED
FROM TABLE(MON_GET_PKG_CACHE_STMT(NULL, NULL, NULL, -2)) AS STMT
WHERE STMT_TEXT LIKE '%ACCOUNTS%'
ORDER BY TOTAL_ACT_TIME DESC;
Every query you learned to write in Chapter 3 has a performance profile. Monitoring teaches you to see that profile and optimize it.
Review question: If a simple SELECT on the ACCOUNTS table shows ROWS_READ = 1,000,000 but ROWS_RETURNED = 5, what does this tell you? (Answer: The query is performing a full table scan when it should be using an index. The monitoring data just revealed a missing or unused index.)
From Chapter 15: Query Optimization
Chapter 15 taught you about access paths, index design, and the optimizer. Monitoring is how you verify that the optimizer's choices are working in production:
- Buffer pool hit ratios tell you whether the optimizer's choice to use an index is actually reducing I/O
- Sort overflow metrics tell you whether the optimizer's sort estimates are accurate
- Lock wait metrics tell you whether the isolation level chosen by the optimizer is causing contention
Review question: You used EXPLAIN in Chapter 15 to analyze a query plan. How would you find, through monitoring, whether a query that looked good in EXPLAIN is actually performing poorly in production? (Answer: Use MON_GET_PKG_CACHE_STMT to find the query and compare actual ROWS_READ to estimated rows. Large discrepancies indicate stale statistics or optimizer estimation errors.)
From Chapter 17: Utilities
Chapter 17 covered RUNSTATS, REORG, COPY, and RECOVER. Monitoring tells you when these utilities are needed:
- Declining buffer pool hit ratios after many updates may indicate the need for REORG
- Increasing sort overflows may indicate stale statistics requiring RUNSTATS
- COPY pending status in DISPLAY DATABASE output means a backup is required
- Growing table space utilization trends tell you when to schedule a REORG with RECLAIM
Review question: After running RUNSTATS in Chapter 17, how would you verify through monitoring that the statistics update actually improved query performance? (Answer: Compare MON_GET_PKG_CACHE_STMT metrics before and after — look for reduced TOTAL_ACT_TIME, fewer ROWS_READ, and a higher ratio of ROWS_RETURNED to ROWS_READ for affected queries.)
Chapter Summary
Monitoring is not a feature — it is a discipline. The tools described in this chapter are only as effective as the DBA who uses them consistently, interprets their output correctly, and acts on the findings promptly.
On z/OS, your monitoring toolkit includes:
- DISPLAY commands for real-time status (-DIS DATABASE, -DIS THREAD, -DIS BUFFERPOOL, -DIS LOG, -DIS UTILITY)
- Traces and IFCIDs for detailed data collection (statistics, accounting, performance)
- SMF records (100, 101, 102) for historical analysis and trend reporting
- OMEGAMON or third-party monitors for enterprise-grade alerting and dashboards
On LUW, your monitoring toolkit includes: - db2pd for zero-overhead real-time snapshots (bufferpools, locks, transactions, applications) - MON_GET table functions for SQL-queryable monitoring data - Event monitors for capturing transient events (deadlocks, slow queries, locking) - db2diag.log for diagnostic investigation and problem determination - db2support for collecting diagnostic data for IBM Support
On both platforms, the principles are the same: 1. Establish baselines during known-good operations 2. Monitor continuously against those baselines 3. Alert when thresholds are breached — before users notice 4. Investigate with the right tool for the situation 5. Document findings and update baselines as workloads evolve
At Meridian National Bank, we built a complete monitoring infrastructure: always-on traces, event monitors, continuous db2pd collection, buffer pool monitoring per pool, lock contention alerting, log space monitoring, and a daily health check that the DBA team reviews every morning. This is not gold-plating — it is the minimum standard for a production banking system.
The best DBAs do not wait for the phone to ring. They already know what is happening in their databases, every hour of every day. This chapter gave you the tools and the knowledge. Now it is up to you to build the discipline.
Next chapter: Chapter 21 — The DB2 Catalog and Directory: Understanding the System's Metadata