27 min read

> "The best DBAs fix problems before users notice them. The secret? Monitoring that tells you what's happening, not just what happened."

Chapter 20: Monitoring and Problem Determination — Knowing What's Wrong Before the Phone Rings

"The best DBAs fix problems before users notice them. The secret? Monitoring that tells you what's happening, not just what happened."

Every DBA remembers the first time their phone rang at 2 AM with a production crisis. The application team is panicking, end users are locked out, and management wants answers in minutes. You log in, stare at a system you barely understand at that hour, and start guessing. Maybe it is locks. Maybe the log is full. Maybe a utility is running that should have finished hours ago.

Now imagine a different scenario. Your phone never rings because at 11:47 PM, your monitoring system detected that lock wait times were climbing. It sent you an alert. You logged in, found a long-running transaction holding exclusive locks on the ACCOUNTS table, and resolved it — all before any user noticed. That is the difference between reactive and proactive monitoring. That is what this chapter teaches you to build.

Monitoring is not a task you perform occasionally. It is the heartbeat of your daily DBA practice. On z/OS, you will live inside DISPLAY commands, trace records, and SMF data. On LUW, db2pd, snapshot monitors, and MON_GET table functions become your constant companions. On both platforms, you need to know what "normal" looks like so you can recognize "abnormal" the instant it appears.

At Meridian National Bank, we are going to build a monitoring infrastructure from scratch — one that watches buffer pools, tracks transaction volumes, alerts on lock contention, monitors log space, and generates a daily health check report. By the end of this chapter, you will have the tools, the queries, and the mindset to be the DBA who fixes things before anyone else even knows there is a problem.


20.1 The Monitoring Mindset

Reactive vs. Proactive Monitoring

Most organizations begin with reactive monitoring: something breaks, somebody calls, and the DBA investigates. This approach has predictable consequences — extended outages, angry users, and DBAs who dread their on-call rotations.

Proactive monitoring inverts this model. Instead of waiting for failure, you continuously collect metrics, compare them against known baselines, and alert when deviations exceed thresholds. The goal is simple: detect the problem before it becomes a crisis.

Characteristic Reactive Proactive
Trigger User complaint or application error Threshold breach or trend deviation
Response time Minutes to hours after impact Seconds to minutes before impact
Data available Whatever you can gather post-mortem Historical baselines and real-time feeds
DBA stress level Extremely high Managed and controlled
Business impact Downtime, data loss risk Minimal or zero user impact

What to Monitor

Not everything deserves equal attention. Focus your monitoring on these five pillars:

  1. Availability — Is the database up? Can applications connect? Are subsystem components active?
  2. Performance — Are response times within acceptable bounds? Are buffer pools effective? Are sorts spilling to disk?
  3. Capacity — How much disk space remains? How full are the logs? Are table spaces approaching limits?
  4. Concurrency — Are locks being held too long? Are deadlocks occurring? Are threads waiting?
  5. Integrity — Are backups completing? Are utilities finishing on time? Are there any data inconsistencies?

Establishing Baselines

You cannot detect anomalies without knowing what normal looks like. A baseline is a collection of metric values captured during known-good operating conditions. Here is the process:

  1. Collect metrics during normal operations over a representative period (at least two weeks, ideally including month-end processing).
  2. Segment by time period — weekday vs. weekend, business hours vs. batch window, month-end vs. mid-month.
  3. Calculate statistical ranges — mean, standard deviation, 95th percentile for each metric.
  4. Document the baseline — store it where your monitoring tools can reference it.
  5. Review and refresh — baselines become stale as workloads evolve. Refresh quarterly at minimum.

For Meridian National Bank, we would establish separate baselines for: - Online banking hours (6 AM to 10 PM): high transaction volume, many short queries - Batch processing window (11 PM to 5 AM): large table scans, heavy I/O, utility operations - Month-end processing: elevated volumes across all metrics for 2-3 days


20.2 [z/OS] DISPLAY Commands

The DISPLAY command family is the z/OS DBA's primary real-time monitoring tool. These commands are issued through the DB2 command prefix (typically a hyphen followed by the subsystem name) or through DB2I panels. Every z/OS DBA should be able to type these commands from memory.

-DISPLAY DATABASE

This command shows the status of databases, table spaces, and index spaces.

-DIS DATABASE(MERIDIANDB) SPACENAM(*) LOCKS

Output interpretation:

DSNT360I  ***************************************************
DSNT361I  *  DISPLAY DATABASE SUMMARY
           *    GLOBAL
DSNT362I  DATABASE = MERIDIANDB  STATUS = RW
           DBD LENGTH = 32768
DSNT397I -
NAME       TYPE  PART  STATUS          PHYERRLO  PHYERRHI  CATALOG  PIECE
--------   ----  ----  --------------  --------  --------  -------  -----
ACCTTS     TS          RW
ACCTIX1    IX          RW
CUSTTS     TS          RW,COPY
CUSTIX1    IX          RW
TRANSTS    TS          RW
TRANSIX1   IX          RW
LOANSTS    TS          RW,CHKP
******* DISPLAY OF DATABASE MERIDIANDB ENDED

Key status codes to watch:

Status Meaning Action Required
RW Read-Write, normal None
RO Read-Only Investigate why; may be intentional
UT Utility running Check which utility
STOP Stopped Restart if needed
COPY Copy pending Run COPY utility
CHKP Check pending Run CHECK utility
RECP Recover pending Immediate attention required
RBDP Rebuild pending Rebuild indexes
LPL Logical page list entries Pages may be inconsistent
GRECP Group recover pending Data sharing recovery needed

Critical alert: Any status containing RECP (Recover Pending) demands immediate investigation. This means DB2 has detected a potential inconsistency that requires a RECOVER utility execution before the object can be safely used.

-DISPLAY THREAD

Thread display shows active connections and what they are doing.

-DIS THREAD(*) DETAIL

Sample output:

DSNT360I  ***************************************************
DSNV401I  SSID = DB2P  ACTIVE THREADS -
DSNV402I  NAME    ST  A  REQ-CT      AUTH-ID  PLAN      ASID  TOKEN
--------  --  -  ----------  --------  --------  ----  -----
DISTSERV  TR  *  125432      WEBUSER   BNKPLAN   003A  22451
BATCH1    TR  N  4523891     BATCHID   BATPLAN   004B  22390
TSO       TR  *  342         DBADM01   DSNTEP71  0022  22467

Column interpretation: - ST (Status): TR = active thread, QW = queued waiting, QD = queued disabled - A (Active): * = currently executing in DB2, N = not currently executing - REQ-CT: Number of DB2 requests made — a very high count on an active thread suggests a long-running operation - PLAN: Bound plan name — identifies the application

When to use this command: - When investigating lock contention — identify who holds locks - When connection limits are approached — count active threads - During performance investigations — find long-running threads - Before maintenance — verify no critical threads are active

-DISPLAY BUFFERPOOL

Buffer pool monitoring is arguably the single most important performance metric.

-DIS BUFFERPOOL(BP0) DETAIL

Sample output (abbreviated):

DSNT360I  ***************************************************
DSNB401I  BUFFERPOOL NAME BP0, BUFFERPOOL ID 0, USE COUNT 15
DSNB402I  VIRTUAL BUFFER POOL SIZE =    50000
DSNB404I  BUFFER POOL HIT RATIO  =  98.5
DSNB406I  GETPAGES =  1523456789  RANDOM GETPAGES  =  423567890
DSNB407I  SYNC READ I/O (RANDOM) =    6123456  (SEQ) =    245678
DSNB409I  PAGES WRITTEN =    3456789
DSNB410I  SEQUENTIAL PREFETCH -
           REQUESTS =    567890  PAGES READ  =   4523456
DSNB411I  LIST PREFETCH -
           REQUESTS =     23456  PAGES READ  =    189012
DSNB412I  DYNAMIC PREFETCH -
           REQUESTS =    123456  PAGES READ  =    987654
DSNB415I  PREFETCH I/O - SEQUENTIAL =    34567  LIST =    4512  DYNAMIC =    12345

Critical metrics:

Metric Healthy Threshold Action if Breached
Hit Ratio > 95% for OLTP, > 80% for batch Increase pool size, review access paths
Sync Read I/O (Random) Low relative to GETPAGES Check for inefficient queries
Prefetch Disabled Should be 0 or near 0 Increase prefetch thresholds
Sequential Steal Should be 0 Buffer pool is too small

Formula for hit ratio:

Hit Ratio = (1 - (Sync Read I/O / GETPAGES)) * 100

If your hit ratio drops below 95% during OLTP hours, investigate immediately. A drop from 99% to 95% represents a fivefold increase in physical I/O.

-DISPLAY LOG

Log monitoring prevents one of the most catastrophic DB2 events: running out of active log space.

-DIS LOG

Sample output:

DSNT360I  ***************************************************
DSNJ371I  CURRENT COPY 1 ACTIVE LOG DATA SET IS DSNAME=DB2P.LOGCOPY1.DS005
DSNJ372I  CURRENT COPY 2 ACTIVE LOG DATA SET IS DSNAME=DB2P.LOGCOPY2.DS005
DSNJ375I  THE CURRENT ACTIVE LOG RBA RANGE IS
           FROM  0000034A2B6C0000  TO  0000034A3F8E0000
DSNJ376I  THE ARCHIVE LOG RBA RANGE IS
           FROM  000002F100000000  TO  0000034A2B6BFFFF
DSNJ379I  NUMBER OF ACTIVE LOG DATA SETS  =  6
DSNJ380I  FULL  - ACTIVE LOG DATA SETS    =  3

When 5 of 6 active log data sets are full, you are in danger. DB2 will stall all update activity if it runs out of active log space. Monitor the ratio of full to total active logs continuously.

-DISPLAY UTILITY

-DIS UTIL(*)

This shows all running utilities, their status, and progress. Watch for: - Utilities in STOPPED status that need restart - Utilities running longer than expected (compare against baseline) - Multiple utilities competing for the same resources


20.3 [z/OS] Traces and IFCID Records

While DISPLAY commands give you real-time snapshots, traces provide detailed continuous data collection. DB2 on z/OS uses Instrumentation Facility Component Identifiers (IFCIDs) to categorize trace records.

Trace Types

Trace Type Purpose IFCID Range Overhead
Statistics System-wide resource usage 001, 002, 202, 225 Very low
Accounting Per-thread resource consumption 003, 148, 239 Low to moderate
Performance Detailed SQL and I/O activity 003, 006-014, 044-068 Moderate to high
Audit Security-relevant events 140-146, 316-319 Low
Monitor Real-time monitoring data 147, 148, 199, 230 Low to moderate
Global Recovery and logging events 106, 107 Very low

Starting and Stopping Traces

-- Start a statistics trace (always leave this running)
-START TRACE(STAT) CLASS(1,3,4,5,6) DEST(SMF)

-- Start an accounting trace for detailed thread analysis
-START TRACE(ACCTG) CLASS(1,2,3) DEST(SMF)

-- Start a performance trace for SQL analysis (use cautiously)
-START TRACE(PERFM) CLASS(1,2,3) PLAN(BNKPLAN) DEST(GTF) TNO(5)

-- Stop a specific trace by trace number
-STOP TRACE(PERFM) TNO(5)

-- Display active traces
-DIS TRACE(*)

Key IFCIDs for Performance Analysis

IFCID 003 — Accounting Record: Generated when a thread deallocates. Contains elapsed time, CPU time, lock wait time, I/O counts, and SQL statement counts. This is the single most important IFCID for performance analysis.

IFCID 006 — SQL Statement Start: Marks the beginning of each SQL statement execution. When combined with IFCID 007 (SQL Statement End), provides per-statement timing.

IFCID 044 — Lock Suspension: Generated when a thread must wait for a lock. Contains the resource name, lock type, and duration. Essential for lock contention analysis.

IFCID 058 — Page Read/Write: Detailed I/O tracking at the page level. High overhead; use only for targeted investigation.

IFCID 172 — Dynamic SQL Statement Text: Captures the full SQL text of dynamic statements. Critical for identifying problematic queries, but generates substantial data volume.

Performance Trace Considerations

Performance traces generate enormous volumes of data and impose measurable overhead. Follow these guidelines:

  1. Never leave performance traces running indefinitely in production
  2. Filter by plan, authorization ID, or connection type to reduce volume
  3. Use a GTF destination for short investigations to avoid filling SMF data sets
  4. Set a time limitSTART TRACE(PERFM) ... TIMED(5) stops after 5 minutes
  5. Coordinate with systems programmers to ensure adequate DASD for trace data

20.4 [z/OS] SMF Records for DB2

System Management Facility (SMF) records are the permanent record of DB2 activity. While DISPLAY commands show you the present and traces capture the transient, SMF records provide the historical data you need for trend analysis and capacity planning.

DB2 SMF Record Types

SMF Type DB2 Trace Source Contents
SMF 100 Statistics trace System-wide counters: buffer pool stats, EDM pool stats, log activity, locking summary
SMF 101 Accounting trace Per-thread detail: elapsed time, CPU, I/O, SQL counts, lock waits
SMF 102 Performance trace Detailed event records: individual SQL timings, I/O events, lock events

SMF 101: The Accounting Record

SMF 101 records (generated from IFCID 003) are the foundation of DB2 performance management. Each record represents a complete thread lifecycle — from allocation to deallocation — and contains:

Elapsed and CPU time breakdown: - Class 1: Total elapsed time (in-DB2 + out-of-DB2) - Class 2: In-DB2 elapsed time - Class 3: Wait time detail (lock, I/O, log, drain, claim, global contention)

SQL activity counts: - SELECT, INSERT, UPDATE, DELETE counts - PREPARE, BIND, OPEN/CLOSE cursor counts - Commit and rollback counts

Resource consumption: - Buffer pool getpages and physical I/O - Log records written - Sorts performed and sort overflows

DSNACCOR: Accounting Report

IBM provides the DSNACCOR program (or its successor in DB2 Analytics Accelerator) to format SMF 101 records into readable reports. A typical JCL invocation:

//ACCTRPT  EXEC PGM=DSNACCOR
//STEPLIB  DD  DSN=DB2P.SDSNLOAD,DISP=SHR
//SMFIN    DD  DSN=SYS1.SMF.DAILY,DISP=SHR
//SYSPRINT DD  SYSOUT=*
//SYSIN    DD  *
  INCLUDE PLAN=(BNKPLAN,BATPLAN)
  REPORT  ORDER=ELAPSED
/*

Key fields in the accounting report:

PLAN: BNKPLAN    AUTH-ID: WEBUSER    CORR-ID: CONN0001
---------------------------------------------------------------
CLASS 1 ELAPSED:      0.452 SEC   CPU:      0.023 SEC
CLASS 2 ELAPSED:      0.310 SEC   CPU:      0.021 SEC
CLASS 3 LOCK WAIT:    0.089 SEC
CLASS 3 SYNC I/O:     0.045 SEC
CLASS 3 LOG WRITE:    0.002 SEC
---------------------------------------------------------------
GETPAGES:            4,523     SYNC READ:        12
SELECT:                 15     INSERT:            3
UPDATE:                  2     DELETE:            0
COMMIT:                  1     ROLLBACK:          0
SORTS:                   2     SORT OVERFLOW:     0

Interpreting the accounting report: - Class 1 minus Class 2 = time spent outside DB2 (application processing, network) - Class 2 minus Class 2 CPU = in-DB2 wait time - High lock wait relative to total elapsed time = lock contention problem - Sort overflows > 0 = sort heap too small or query generating excessive sorts - Sync reads high relative to getpages = poor buffer pool hit ratio for this plan

Store summarized SMF data in DB2 tables for trend analysis:

CREATE TABLE MERIDIAN.PERF_HISTORY (
    CAPTURE_DATE      DATE NOT NULL,
    CAPTURE_HOUR      SMALLINT NOT NULL,
    PLAN_NAME         CHAR(8) NOT NULL,
    THREAD_COUNT      INTEGER,
    AVG_ELAPSED_SEC   DECIMAL(10,4),
    AVG_CPU_SEC       DECIMAL(10,4),
    AVG_LOCK_WAIT_SEC DECIMAL(10,4),
    TOTAL_GETPAGES    BIGINT,
    TOTAL_SYNC_READ   BIGINT,
    HIT_RATIO_PCT     DECIMAL(5,2),
    TOTAL_SORTS       INTEGER,
    SORT_OVERFLOWS    INTEGER,
    PRIMARY KEY (CAPTURE_DATE, CAPTURE_HOUR, PLAN_NAME)
);

Populate this table daily from your SMF processing and you will have the historical baseline data essential for detecting performance degradation before it becomes critical.


20.5 [z/OS] OMEGAMON and Third-Party Monitors

While DISPLAY commands and SMF records provide raw capabilities, enterprise environments typically deploy specialized monitoring products that add real-time dashboards, threshold alerting, historical trending, and automated responses.

IBM OMEGAMON for DB2

OMEGAMON for DB2 Performance Expert is IBM's flagship monitoring product. It provides:

Real-time monitoring: - Thread activity with drill-down to individual SQL statements - Buffer pool hit ratios with graphical display - Lock contention visualization — who is waiting for whom - Dynamic SQL text capture and explain - EDM pool and RID pool utilization

Exception processing: - Define thresholds for any metric (e.g., "alert when hit ratio drops below 95%") - Automatic actions when thresholds are breached (send message, cancel thread, trigger trace) - Escalation chains — alert DBA first, then manager if not resolved

Historical analysis: - Trend charts for any metric over days, weeks, or months - Compare current performance against baseline periods - Identify gradual degradation that day-to-day observation misses

Near-term history: - Short-interval (1-5 minute) data kept in memory or VSAM - Bridge the gap between real-time and SMF historical data - Essential for investigating problems that occurred minutes or hours ago

Threshold Configuration Example

In OMEGAMON, you might configure alerts for Meridian Bank like this:

Metric Warning Threshold Critical Threshold Action
Buffer pool hit ratio (BP0) < 97% < 93% Page DBA, investigate access paths
Active log data sets full 4 of 6 5 of 6 Page DBA, check for long-running URs
Thread elapsed time > 30 seconds > 120 seconds Alert DBA, display thread detail
Lock wait time > 5 seconds > 30 seconds Alert DBA, display lock detail
Deadlocks per interval > 2 > 10 Alert DBA, start performance trace
CPU per SQL statement > 0.5 seconds > 2.0 seconds Log for review

BMC MainView for DB2

BMC MainView is the other major third-party monitoring product for DB2 on z/OS. It provides similar capabilities to OMEGAMON with a different user interface and some unique features:

  • Thread advisor: Recommends actions for problem threads
  • SQL workspace: Interactive SQL analysis and tuning
  • Batch tracking: Monitors batch job progress against SLA targets
  • Automated recovery: Can automatically restart utilities or resolve certain error conditions

Choosing a Monitoring Product

For Meridian Bank's z/OS environment, the selection criteria include: - Integration with existing enterprise monitoring (Tivoli, BMC Patrol, etc.) - Data sharing support if running in a Parallel Sysplex - Ability to monitor multiple DB2 subsystems from a single console - Historical data retention and reporting capabilities - Cost — these products represent significant investment


20.6 [LUW] db2pd — The DBA's Swiss Army Knife

On DB2 LUW, db2pd is the most versatile real-time monitoring tool. It reads directly from DB2's shared memory structures, which means it imposes virtually zero overhead on the database engine. Unlike snapshot monitors, db2pd does not acquire latches, making it safe to run even during performance crises.

Core db2pd Options

Database overview:

db2pd -db MERIDIAN -

This is your starting point. It shows database configuration, memory usage, and general status.

Buffer pool monitoring:

db2pd -db MERIDIAN -bufferpools

Sample output:

Bufferpool Name: IBMDEFAULTBP   Id: 1   PageSize: 8192

nPages    nPagesUsed  nPagesClean nPagesDirty HitRatio
50000     47823       38456       9367        98.73

Data Logical  Data Physical  Index Logical  Index Physical
1245678       15432          567890         7654

Async Data Reads  Async Index Reads  Async Data Writes  Async Index Writes
234567            123456             45678              23456

Direct Reads  Direct Writes  Prefetch Wait Time
1234          5678           0.023

Interpreting db2pd buffer pool output: - HitRatio: Should be above 95% for OLTP workloads. Below 80% requires immediate investigation. - nPagesDirty: If approaching nPages, the page cleaners cannot keep up — increase NUM_IOCLEANERS or buffer pool size. - Prefetch Wait Time: If non-zero and growing, I/O subsystem is saturated. - Direct Reads/Writes: Large numbers indicate operations bypassing the buffer pool (LOBs, temp tables for sorts). Some are expected; a sudden increase warrants investigation.

Lock monitoring:

db2pd -db MERIDIAN -locks showlocks

Sample output:

Locks:
Address    TranHdl  Lockname                 Type  Mode  Sts  Owner
0x07800100 3        02000600040000000000000052 Row   X     G    3
0x07800200 5        02000600040000000000000052 Row   S     W    5
0x07800300 3        02000600000000000000000054 Tab   IX    G    3
0x07800400 5        02000600000000000000000054 Tab   IS    G    5

Column interpretation: - TranHdl: Transaction handle — links to application information - Type: Lock granularity (Row, Tab=Table, TBS=Tablespace) - Mode: Lock mode (S=Share, X=Exclusive, IX=Intent Exclusive, IS=Intent Share, U=Update) - Sts: Status (G=Granted, W=Waiting, C=Converting)

A lock with Sts=W tells you contention exists. Cross-reference the TranHdl with the -transactions output to find the blocking application.

Transaction monitoring:

db2pd -db MERIDIAN -transactions

Shows all active transactions with their start times, log space consumed, and current state. Look for: - Transactions that have been active for abnormally long periods - Transactions consuming large amounts of log space (potential log-full risk) - Transactions in ROLLBACK state that may be taking a long time to complete

Application monitoring:

db2pd -db MERIDIAN -applications

Maps application handles to connection details (IP address, client application name, authorization ID). Essential for identifying who is running a problematic query.

Combining db2pd Options

You can combine multiple options in a single command:

db2pd -db MERIDIAN -bufferpools -locks -transactions -applications

And you can set up continuous monitoring with the -repeat option:

db2pd -db MERIDIAN -bufferpools -repeat 30 -file /home/db2inst1/bp_monitor.out

This captures buffer pool statistics every 30 seconds and writes them to a file — invaluable for capturing intermittent problems.

Advanced db2pd Options

Sorting and hashing:

db2pd -db MERIDIAN -sort

Shows active sort operations and memory consumption. Sort overflows to disk devastate performance.

Dynamic SQL cache:

db2pd -db MERIDIAN -dynamic

Shows statements in the package cache, their execution counts, and costs. A quick way to find the most frequently executed or most expensive queries.

Table space status:

db2pd -db MERIDIAN -tablespaces

Shows table space state, size, utilization, and any pending operations.

Wait analysis:

db2pd -db MERIDIAN -wlocks

Specifically shows lock waits — who is waiting and who is blocking. This is the fastest way to diagnose lock contention.


20.7 [LUW] Snapshot Monitor and MON_GET Functions

While db2pd provides zero-overhead snapshots, the Snapshot Monitor and MON_GET table functions offer richer, more structured data suitable for automated collection and analysis.

The Snapshot Monitor (Legacy but Still Useful)

The snapshot monitor uses monitor switches that control what data is collected:

-- Check current monitor switch settings
db2 GET MONITOR SWITCHES

-- Enable all switches
db2 UPDATE MONITOR SWITCHES USING
    BUFFERPOOL ON
    LOCK ON
    SORT ON
    STATEMENT ON
    TABLE ON
    TIMESTAMP ON
    UOW ON

Taking a snapshot:

-- Database-level snapshot
db2 GET SNAPSHOT FOR DATABASE ON MERIDIAN

-- Application-level snapshot
db2 GET SNAPSHOT FOR APPLICATIONS ON MERIDIAN

-- Buffer pool snapshot
db2 GET SNAPSHOT FOR BUFFERPOOLS ON MERIDIAN

-- Lock snapshot
db2 GET SNAPSHOT FOR LOCKS ON MERIDIAN

MON_GET Table Functions (Modern Approach)

Starting with DB2 9.7, MON_GET table functions provide the same data as snapshots but in a SQL-queryable form. They are more flexible, more efficient, and the recommended approach for new monitoring implementations.

Buffer pool monitoring:

SELECT
    BP_NAME,
    POOL_DATA_L_READS + POOL_INDEX_L_READS AS TOTAL_LOGICAL_READS,
    POOL_DATA_P_READS + POOL_INDEX_P_READS AS TOTAL_PHYSICAL_READS,
    CASE WHEN (POOL_DATA_L_READS + POOL_INDEX_L_READS) > 0
         THEN DEC(1.0 - (FLOAT(POOL_DATA_P_READS + POOL_INDEX_P_READS) /
              FLOAT(POOL_DATA_L_READS + POOL_INDEX_L_READS)), 5, 4) * 100
         ELSE 100.00
    END AS HIT_RATIO_PCT
FROM TABLE(MON_GET_BUFFERPOOL('', -2)) AS BP
ORDER BY TOTAL_LOGICAL_READS DESC;

Lock wait monitoring:

SELECT
    APPLICATION_HANDLE,
    LOCK_WAITS,
    LOCK_WAIT_TIME,
    LOCK_TIMEOUTS,
    DEADLOCKS,
    LOCK_ESCALS
FROM TABLE(MON_GET_CONNECTION(NULL, -2)) AS CONN
WHERE LOCK_WAITS > 0
ORDER BY LOCK_WAIT_TIME DESC;

Top SQL by execution time:

SELECT
    SUBSTR(STMT_TEXT, 1, 200) AS SQL_TEXT,
    NUM_EXECUTIONS,
    TOTAL_ACT_TIME / 1000 AS TOTAL_TIME_SEC,
    CASE WHEN NUM_EXECUTIONS > 0
         THEN (TOTAL_ACT_TIME / NUM_EXECUTIONS) / 1000
         ELSE 0
    END AS AVG_TIME_SEC,
    ROWS_READ,
    ROWS_RETURNED
FROM TABLE(MON_GET_PKG_CACHE_STMT(NULL, NULL, NULL, -2)) AS STMT
WHERE NUM_EXECUTIONS > 0
ORDER BY TOTAL_ACT_TIME DESC
FETCH FIRST 20 ROWS ONLY;

Table I/O analysis:

SELECT
    SUBSTR(TABSCHEMA, 1, 20) AS SCHEMA,
    SUBSTR(TABNAME, 1, 30) AS TABLE_NAME,
    ROWS_READ,
    ROWS_INSERTED,
    ROWS_UPDATED,
    ROWS_DELETED,
    TABLE_SCANS,
    DATA_OBJECT_L_PAGES AS DATA_PAGES
FROM TABLE(MON_GET_TABLE('', '', -2)) AS T
WHERE TABSCHEMA = 'MERIDIAN'
ORDER BY ROWS_READ DESC
FETCH FIRST 20 ROWS ONLY;

MON_GET vs. Snapshot: When to Use Which

Feature GET SNAPSHOT MON_GET Functions
Output format Formatted text SQL result set
Filtering Limited Full SQL WHERE clause
Aggregation None (manual) SQL GROUP BY, SUM, etc.
Automation Requires parsing Direct INSERT INTO
Overhead Moderate Low
Availability All versions DB2 9.7+
Recommended for new work No Yes

Best practice: Use MON_GET functions for all new monitoring development. Use snapshot monitors only when supporting legacy scripts or when a specific piece of information is not yet available through MON_GET.


20.8 [LUW] Event Monitors

While snapshots and MON_GET functions show you point-in-time data, event monitors capture data when specific events occur. They are essential for capturing transient problems that might not be visible in a snapshot.

Types of Event Monitors

Event Monitor Type Captures Use Case
Statement Every SQL statement execution SQL performance analysis
Activity SQL and non-SQL activities Workload analysis
Unit of Work Transaction boundaries Transaction pattern analysis
Deadlock Deadlock events Deadlock investigation
Connection Connect and disconnect events Connection pattern analysis
Locking Lock escalations, timeouts, deadlocks Lock contention deep dive
Package Cache Statement cache entries Package cache analysis
Threshold Violations WLM threshold breaches Workload management

Creating an Activity Event Monitor

For Meridian Bank, we want to capture all SQL statements that take longer than 5 seconds:

-- First, create a workload management threshold
CREATE THRESHOLD MERIDIAN_SLOW_SQL
    FOR DATABASE ACTIVITIES
    ENFORCEMENT DATABASE
    WHEN ACTIVITYTOTALTIME > 5
    COLLECT ACTIVITY DATA WITH DETAILS AND VALUES
    WITHOUT STOP;

-- Create the activity event monitor
CREATE EVENT MONITOR MERIDIAN_ACTMON
    FOR ACTIVITIES
    WRITE TO TABLE
    ACTIVITY (TABLE MERIDIAN.ACTIVITY_EVENTS
              IN USERSPACE1
              PCTDEACTIVATE 100)
    ACTIVITYSTMT (TABLE MERIDIAN.ACTIVITY_STMT_EVENTS
                  IN USERSPACE1
                  PCTDEACTIVATE 100)
    ACTIVITYVALS (TABLE MERIDIAN.ACTIVITY_VAL_EVENTS
                  IN USERSPACE1
                  PCTDEACTIVATE 100)
    AUTOSTART;

-- Activate the event monitor
SET EVENT MONITOR MERIDIAN_ACTMON STATE 1;

Creating a Deadlock Event Monitor

Deadlocks should always be monitored:

CREATE EVENT MONITOR MERIDIAN_DEADLOCK
    FOR LOCKING
    WRITE TO TABLE
    LOCK (TABLE MERIDIAN.LOCK_EVENTS
          IN USERSPACE1
          PCTDEACTIVATE 100)
    LOCK_PARTICIPANTS (TABLE MERIDIAN.LOCK_PARTICIPANTS
                       IN USERSPACE1
                       PCTDEACTIVATE 100)
    LOCK_PARTICIPANT_ACTIVITIES (TABLE MERIDIAN.LOCK_PART_ACTIVITIES
                                 IN USERSPACE1
                                 PCTDEACTIVATE 100)
    AUTOSTART;

SET EVENT MONITOR MERIDIAN_DEADLOCK STATE 1;

Creating a Unit of Work Event Monitor

CREATE EVENT MONITOR MERIDIAN_UOW
    FOR UNIT OF WORK
    WRITE TO TABLE
    AUTOSTART;

SET EVENT MONITOR MERIDIAN_UOW STATE 1;

Querying Event Monitor Data

Once event monitors are collecting data, query the target tables:

-- Find the slowest SQL statements from the activity event monitor
SELECT
    APPL_ID,
    UOW_ID,
    ACTIVITY_ID,
    SUBSTR(STMT_TEXT, 1, 200) AS SQL_TEXT,
    TIME_STARTED,
    TIME_COMPLETED,
    TIMESTAMPDIFF(2, CHAR(TIME_COMPLETED - TIME_STARTED)) AS DURATION_SEC
FROM MERIDIAN.ACTIVITY_STMT_EVENTS
WHERE TIME_COMPLETED > CURRENT TIMESTAMP - 24 HOURS
ORDER BY DURATION_SEC DESC
FETCH FIRST 20 ROWS ONLY;

-- Find all deadlocks in the past 7 days
SELECT
    XMLPARSE(DOCUMENT LOCK_EVENT_XML) AS DEADLOCK_INFO,
    EVENT_TIMESTAMP
FROM MERIDIAN.LOCK_EVENTS
WHERE EVENT_TYPE = 'DEADLOCK'
  AND EVENT_TIMESTAMP > CURRENT TIMESTAMP - 7 DAYS
ORDER BY EVENT_TIMESTAMP DESC;

Event Monitor Housekeeping

Event monitor tables grow continuously. Implement regular purging:

-- Purge activity events older than 30 days
DELETE FROM MERIDIAN.ACTIVITY_EVENTS
WHERE TIME_STARTED < CURRENT TIMESTAMP - 30 DAYS;

DELETE FROM MERIDIAN.ACTIVITY_STMT_EVENTS
WHERE TIME_STARTED < CURRENT TIMESTAMP - 30 DAYS;

DELETE FROM MERIDIAN.ACTIVITY_VAL_EVENTS
WHERE TIME_STARTED < CURRENT TIMESTAMP - 30 DAYS;

-- Purge lock events older than 90 days
DELETE FROM MERIDIAN.LOCK_EVENTS
WHERE EVENT_TIMESTAMP < CURRENT TIMESTAMP - 90 DAYS;

20.9 [LUW] db2diag.log and db2dump

When something goes wrong with DB2 LUW, the diagnostic log is the first place to look. Understanding how to navigate db2diag.log quickly and effectively separates experienced DBAs from novices.

db2diag.log Location

The diagnostic log is located in the directory specified by the DIAGPATH database manager configuration parameter:

db2 GET DBM CFG | grep DIAGPATH

Default location: $HOME/sqllib/db2dump/ (or the instance's DIAGPATH).

db2diag.log Format

Each entry in db2diag.log follows a standard format:

2025-01-15-14.23.45.123456-300 I567890H432           LEVEL: Warning
PID     : 12345                TID  : 140234567890    PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000
DB      : MERIDIAN             APPHDL: 0-1234          APPID: *LOCAL.db2inst1.250115192345
AUTHID  : WEBUSER              HOSTNAME: dbserver01
EDUID   : 15                   EDUNAME: db2agent (MERIDIAN) 0
FUNCTION: DB2 UDB, buffer pool services, sqlbAlterBufferPoolSize, probe:20
MESSAGE : ADM5502W  The buffer pool "MERIDIAN_BP" is currently not started.
DATA #1 : String, 45 bytes
Buffer pool MERIDIAN_BP activation deferred.

Severity Levels

Level Meaning Action
Informational Normal operations Review periodically
Warning Potential issue, not yet critical Investigate within business hours
Error Something failed Investigate promptly
Severe Major component failure Investigate immediately
Critical Database or instance may be compromised Emergency response

Filtering db2diag.log

The raw log can be enormous. Use db2diag (the diagnostic tool, not the log itself) to filter:

# Show only errors and above from the last 24 hours
db2diag -time 24h -level Error

# Show entries for a specific database
db2diag -db MERIDIAN -level Warning

# Show entries for a specific time range
db2diag -time "2025-01-15-14.00.00":"2025-01-15-15.00.00"

# Show entries matching a specific message ID
db2diag -msg ADM5502W

First Failure Data Capture (FFDC)

DB2 LUW implements First Failure Data Capture — when certain errors occur, DB2 automatically dumps diagnostic information to the db2dump directory. These files include:

  • Trap files (*.trap): Generated on exceptions/crashes. Contains stack trace, register contents, and memory state.
  • Lock list dump files: Generated on deadlocks when DIAGLEVEL is 3 or higher.
  • Memory dump files (*.dump): Memory pool contents at time of error.
  • Core files: Full process core dumps for severe failures.

DIAGLEVEL Configuration

# Check current diagnostic level
db2 GET DBM CFG | grep DIAGLEVEL

# Set diagnostic level (0=none, 1=severe, 2=errors, 3=warnings, 4=informational)
db2 UPDATE DBM CFG USING DIAGLEVEL 3

Recommended settings: - Production: DIAGLEVEL 3 (captures warnings and above — good balance) - Performance investigation: DIAGLEVEL 4 temporarily (captures everything — increases log volume substantially) - Never use: DIAGLEVEL 0 in production (you lose all diagnostic data)

db2support: Collecting Diagnostic Data

When you need to open a PMR (Problem Management Record) with IBM Support, use db2support to collect all relevant diagnostic data:

db2support /tmp/support_collection -d MERIDIAN -c -st

This collects: - Database and instance configuration - db2diag.log entries - Operating system information - Database layout and table space details - Recent trap files and dump files - Optimizer statistics

The output is a compressed archive ready to upload to IBM Support.

Practical db2diag.log Investigation

When investigating a problem, follow this sequence:

  1. Determine the time window: When did the problem start? When did it end?
  2. Filter by time and severity: db2diag -time "start":"end" -level Error
  3. Look for the first error: Problems often cascade — the first error is usually the root cause.
  4. Check for correlated entries: Multiple entries with the same APPHDL or EDUID often tell a story.
  5. Search for known message IDs: IBM documentation explains every message ID (e.g., SQL0911N, ADM1823E).
  6. Check trap files: If a process crashed, the trap file contains the detailed failure information.

20.10 Key Metrics and What They Mean

This section provides a reference for the most important DB2 metrics, their healthy thresholds, and the corrective actions when thresholds are breached. These apply to both platforms unless noted.

Buffer Pool Hit Ratio

What it measures: The percentage of page requests satisfied from memory without requiring physical I/O.

Formula:

Hit Ratio = (1 - (Physical Reads / Logical Reads)) * 100

Thresholds: | Workload | Warning | Critical | |----------|---------|----------| | OLTP | < 97% | < 93% | | Mixed | < 90% | < 80% | | DSS/Batch | < 80% | < 60% |

Corrective actions: 1. Increase buffer pool size if memory is available 2. Review access paths — table scans on large tables destroy hit ratios 3. Check for index usage — missing indexes force physical I/O 4. Verify prefetch is working effectively 5. Consider separating workloads into different buffer pools

Lock Wait Time

What it measures: Time threads spend waiting to acquire locks held by other threads.

Thresholds: | Metric | Warning | Critical | |--------|---------|----------| | Average lock wait (OLTP) | > 100ms | > 1 second | | Lock timeout rate | > 0.1% of transactions | > 1% of transactions | | Deadlock rate | > 1 per hour | > 1 per minute |

Corrective actions: 1. Identify the blocking transaction and determine if it can be tuned 2. Reduce transaction scope — commit more frequently 3. Ensure applications access tables in consistent order (prevents deadlocks) 4. Consider lock avoidance techniques (optimistic locking, SKIP LOCKED DATA) 5. Review isolation levels — can any be lowered safely?

Log Utilization

What it measures: How much of the transaction log is in use.

z/OS thresholds: | Active logs full | Severity | |-----------------|----------| | 50% full | Normal | | 67% full | Warning — investigate long-running URs | | 83% full | Critical — resolve immediately | | 100% full | Emergency — DB2 stops all update activity |

LUW thresholds: Monitor LOG_UTILIZATION_PERCENT from MON_GET_TRANSACTION_LOG:

SELECT
    LOG_UTILIZATION_PERCENT,
    TOTAL_LOG_USED_KB,
    TOTAL_LOG_AVAILABLE_KB
FROM TABLE(MON_GET_TRANSACTION_LOG(-2)) AS LOG;
Utilization Severity
< 50% Normal
50-75% Warning
> 75% Critical
> 90% Emergency

Corrective actions: 1. Identify long-running transactions consuming log space 2. Force commit or terminate the offending transaction if necessary 3. Increase log file size or number of log files for ongoing capacity 4. Review application logic for unnecessarily large transactions 5. Ensure log archiving (on LUW with LOGARCHMETH1) is functioning

Sort Overflows

What it measures: How often sort operations spill from memory to temporary disk space.

Formula:

Sort Overflow Percentage = (Sort Overflows / Total Sorts) * 100

Thresholds: | Overflow Rate | Severity | |---------------|----------| | < 2% | Acceptable | | 2-5% | Warning | | > 5% | Critical |

Corrective actions: 1. Increase SORTHEAP (LUW) or sort pool size (z/OS) 2. Review queries generating large sorts — can ORDER BY be eliminated or satisfied by an index? 3. Add indexes that provide the needed ordering 4. Check for hash joins spilling to disk

Package Cache Hit Ratio

What it measures: How often SQL statement preparation results are found in cache versus requiring fresh preparation.

Formula:

Pkg Cache Hit Ratio = (1 - (Pkg Cache Inserts / (Pkg Cache Lookups))) * 100

Thresholds: | Hit Ratio | Severity | |-----------|----------| | > 80% | Healthy | | 60-80% | Warning — possible literal SQL issue | | < 60% | Critical — application likely not using parameter markers |

Corrective actions: 1. Review application SQL for literal values instead of parameter markers 2. Increase PCKCACHESZ (LUW) or EDM pool size (z/OS) 3. Enable statement concentrator (STMT_CONC = LITERALS) for short-term relief 4. Work with application teams to parameterize queries

Agent Utilization [LUW]

What it measures: Percentage of available agents (connections) in use.

SELECT
    (AGENTS_REGISTERED * 100.0) / MAX_CONNECTIONS_TOP AS AGENT_UTILIZATION_PCT
FROM
    SYSIBMADM.SNAPDBM,
    (SELECT VALUE AS MAX_CONNECTIONS_TOP
     FROM SYSIBMADM.DBCFG
     WHERE NAME = 'max_connections') AS CFG;

Thresholds: | Utilization | Severity | |-------------|----------| | < 70% | Normal | | 70-85% | Warning — plan for growth | | > 85% | Critical — connection pool or limit adjustment needed |

Corrective actions: 1. Ensure applications use connection pooling 2. Increase MAX_CONNECTIONS if system resources allow 3. Investigate connection leaks (connections opened but never closed) 4. Implement workload management to prioritize critical connections


20.11 Building a Monitoring Dashboard

A monitoring dashboard consolidates the key metrics from the previous sections into a single view that a DBA can assess in seconds. Whether you build it with a commercial tool, a custom web application, or even a simple terminal-based script, the principles are the same.

What to Display

Organize your dashboard into three tiers:

Tier 1 — Glance indicators (red/yellow/green): - Database status (up/down) - Buffer pool hit ratio - Log utilization - Active connections vs. limit - Lock escalation count - Last successful backup age

Tier 2 — Current performance (numeric values with trend arrows): - Transactions per second - Average response time - Current lock wait time - Sort overflow rate - Package cache hit ratio - I/O throughput (reads/writes per second)

Tier 3 — Detail panels (expandable): - Top 10 active SQL statements by elapsed time - Lock waits — who is waiting for whom - Buffer pool detail by pool name - Table space utilization - Application connection list

Refresh Frequency

Dashboard Tier Refresh Interval Rationale
Tier 1 (status) 15-30 seconds Must reflect current state
Tier 2 (performance) 30-60 seconds Smooths out transient spikes
Tier 3 (detail) On-demand Too expensive to refresh continuously

Trend Visualization

Show at least 4 hours of trend data for each Tier 2 metric. This allows the DBA to see: - Sudden spikes: Something just changed — investigate now - Gradual degradation: Slowly worsening over hours — investigate before it becomes critical - Periodic patterns: Normal workload cycles — no action needed if within baseline

Alerting Rules

Configure alerts with hysteresis to avoid alert storms:

IF metric > critical_threshold FOR 3 consecutive intervals:
    SEND critical alert
    DO NOT re-alert for 15 minutes

IF metric > warning_threshold FOR 5 consecutive intervals:
    SEND warning alert
    DO NOT re-alert for 30 minutes

IF metric returns below warning_threshold FOR 3 consecutive intervals:
    SEND recovery notification

Alert delivery channels: - Email: For warnings and non-urgent items - SMS/Pager: For critical alerts requiring immediate response - Dashboard color change: For real-time visual monitoring - Ticketing system integration: For automated incident creation

Sample Dashboard Data Collection Query [LUW]

-- Comprehensive dashboard data collection
SELECT
    'BUFFER_POOL' AS METRIC_CATEGORY,
    BP_NAME AS METRIC_NAME,
    CASE WHEN (POOL_DATA_L_READS + POOL_INDEX_L_READS) > 0
         THEN DEC((1.0 - FLOAT(POOL_DATA_P_READS + POOL_INDEX_P_READS) /
              FLOAT(POOL_DATA_L_READS + POOL_INDEX_L_READS)) * 100, 5, 2)
         ELSE 100.00
    END AS METRIC_VALUE,
    CURRENT TIMESTAMP AS CAPTURE_TIME
FROM TABLE(MON_GET_BUFFERPOOL('', -2)) AS BP

UNION ALL

SELECT
    'LOG_UTILIZATION',
    'LOG_USED_PCT',
    DEC(LOG_UTILIZATION_PERCENT, 5, 2),
    CURRENT TIMESTAMP
FROM TABLE(MON_GET_TRANSACTION_LOG(-2)) AS LOG

UNION ALL

SELECT
    'CONNECTIONS',
    'ACTIVE_CONNECTIONS',
    DEC(COUNT(*), 5, 2),
    CURRENT TIMESTAMP
FROM TABLE(MON_GET_CONNECTION(NULL, -2)) AS CONN

UNION ALL

SELECT
    'LOCK_WAITS',
    'TOTAL_LOCK_WAITS',
    DEC(SUM(LOCK_WAITS), 5, 2),
    CURRENT TIMESTAMP
FROM TABLE(MON_GET_CONNECTION(NULL, -2)) AS CONN;

20.12 The Meridian Bank Monitoring Setup

Now we bring everything together for Meridian National Bank. This section defines the complete monitoring configuration across both platforms, including specific thresholds, collection intervals, alerting rules, and a daily health check.

z/OS Monitoring Configuration

Always-on traces:

-- Statistics trace — captures system-wide metrics every 15 minutes
-START TRACE(STAT) CLASS(1,3,4,5,6) DEST(SMF)

-- Accounting trace — captures per-thread detail
-START TRACE(ACCTG) CLASS(1,2,3) DEST(SMF)

Scheduled DISPLAY commands (run via automation):

-- Every 5 minutes during online hours:
-DIS BUFFERPOOL(*) DETAIL
-DIS LOG
-DIS DATABASE(MERIDIANDB) SPACENAM(*) RESTRICT

-- Every 15 minutes:
-DIS THREAD(*) DETAIL

-- Every hour:
-DIS UTIL(*)

OMEGAMON thresholds for Meridian:

Metric Warning Critical Action
BP0 Hit Ratio < 97% < 93% Page DBA
Active Logs Full 4 of 6 5 of 6 Page DBA + check URs
Thread Elapsed > 60s Any > 120s Alert DBA
Lock Timeout > 5/hour > 20/hour Alert DBA
Deadlock Any > 5/hour Alert DBA + start perf trace
RECP/GRECP Status Any N/A Emergency alert

LUW Monitoring Configuration

Database monitor configuration:

-- Enable monitoring switches permanently
UPDATE DBM CFG USING
    HEALTH_MON ON
    DFT_MON_BUFPOOL ON
    DFT_MON_LOCK ON
    DFT_MON_SORT ON
    DFT_MON_STMT ON
    DFT_MON_TABLE ON
    DFT_MON_TIMESTAMP ON
    DFT_MON_UOW ON;

Event monitors:

-- Deadlock/locking event monitor (always on)
CREATE EVENT MONITOR MERIDIAN_LOCK_MON
    FOR LOCKING
    WRITE TO TABLE
    AUTOSTART;

-- Activity event monitor for slow queries (> 5 seconds)
CREATE THRESHOLD MERIDIAN_SLOW_SQL
    FOR DATABASE ACTIVITIES
    ENFORCEMENT DATABASE
    WHEN ACTIVITYTOTALTIME > 5
    COLLECT ACTIVITY DATA WITH DETAILS AND VALUES
    WITHOUT STOP;

CREATE EVENT MONITOR MERIDIAN_ACT_MON
    FOR ACTIVITIES
    WRITE TO TABLE
    AUTOSTART;

Continuous db2pd collection:

# Buffer pool monitoring — every 60 seconds, output to daily file
db2pd -db MERIDIAN -bufferpools -repeat 60 \
    -file /db2data/monitoring/bp_$(date +%Y%m%d).out &

# Lock monitoring — every 30 seconds during online hours
db2pd -db MERIDIAN -locks showlocks -repeat 30 \
    -file /db2data/monitoring/locks_$(date +%Y%m%d).out &

Buffer Pool Monitoring Detail

For Meridian Bank, we define four buffer pools with specific monitoring thresholds:

Buffer Pool Page Size Size Purpose Hit Ratio Threshold
MERIDIAN_BP_DATA 8K 50,000 pages General data > 97%
MERIDIAN_BP_IDX 8K 30,000 pages Indexes > 99%
MERIDIAN_BP_LOB 32K 10,000 pages LOB data > 80%
MERIDIAN_BP_TEMP 8K 20,000 pages Temp tables N/A (no caching)
-- Buffer pool monitoring query for Meridian dashboard
SELECT
    BP_NAME,
    POOL_DATA_L_READS AS DATA_LOGICAL,
    POOL_DATA_P_READS AS DATA_PHYSICAL,
    POOL_INDEX_L_READS AS INDEX_LOGICAL,
    POOL_INDEX_P_READS AS INDEX_PHYSICAL,
    CASE WHEN (POOL_DATA_L_READS + POOL_INDEX_L_READS) > 0
         THEN DEC((1.0 - FLOAT(POOL_DATA_P_READS + POOL_INDEX_P_READS) /
              FLOAT(POOL_DATA_L_READS + POOL_INDEX_L_READS)) * 100, 5, 2)
         ELSE 100.00
    END AS HIT_RATIO,
    POOL_ASYNC_DATA_READS AS PREFETCH_DATA,
    POOL_ASYNC_INDEX_READS AS PREFETCH_INDEX,
    POOL_DATA_WRITES + POOL_INDEX_WRITES AS TOTAL_WRITES,
    POOL_ASYNC_DATA_WRITES + POOL_ASYNC_INDEX_WRITES AS ASYNC_WRITES
FROM TABLE(MON_GET_BUFFERPOOL('', -2)) AS BP
WHERE BP_NAME LIKE 'MERIDIAN%'
ORDER BY BP_NAME;

Transaction Volume Tracking

-- Hourly transaction volume tracking
INSERT INTO MERIDIAN.TRANSACTION_METRICS
SELECT
    CURRENT DATE AS METRIC_DATE,
    HOUR(CURRENT TIMESTAMP) AS METRIC_HOUR,
    SUM(COMMIT_SQL_STMTS) AS TOTAL_COMMITS,
    SUM(ROLLBACK_SQL_STMTS) AS TOTAL_ROLLBACKS,
    SUM(SELECT_SQL_STMTS) AS TOTAL_SELECTS,
    SUM(UID_SQL_STMTS) AS TOTAL_UPDATES,
    SUM(ROWS_READ) AS TOTAL_ROWS_READ,
    SUM(ROWS_RETURNED) AS TOTAL_ROWS_RETURNED
FROM TABLE(MON_GET_CONNECTION(NULL, -2)) AS CONN;

Lock Contention Alerts

-- Check for active lock waits exceeding threshold
SELECT
    HLD.APPLICATION_HANDLE AS BLOCKER_HANDLE,
    HLD.AGENT_ID AS BLOCKER_AGENT,
    REQ.APPLICATION_HANDLE AS WAITER_HANDLE,
    REQ.AGENT_ID AS WAITER_AGENT,
    REQ.LOCK_WAIT_START_TIME,
    TIMESTAMPDIFF(2, CHAR(CURRENT TIMESTAMP - REQ.LOCK_WAIT_START_TIME))
        AS WAIT_SECONDS,
    REQ.LOCK_NAME,
    REQ.LOCK_MODE AS REQUESTED_MODE,
    HLD.LOCK_MODE AS HELD_MODE,
    SUBSTR(REQ.TABSCHEMA, 1, 20) AS SCHEMA,
    SUBSTR(REQ.TABNAME, 1, 30) AS TABLE_NAME
FROM
    TABLE(MON_GET_APPL_LOCKWAIT(NULL, -2)) AS REQ,
    TABLE(MON_GET_LOCKS(NULL, -2)) AS HLD
WHERE REQ.LOCK_NAME = HLD.LOCK_NAME
  AND REQ.APPLICATION_HANDLE <> HLD.APPLICATION_HANDLE
  AND HLD.LOCK_STATUS = 'G'
ORDER BY WAIT_SECONDS DESC;

Log Space Monitoring

-- Log space monitoring with alerting threshold
SELECT
    CASE
        WHEN LOG_UTILIZATION_PERCENT > 75 THEN 'CRITICAL'
        WHEN LOG_UTILIZATION_PERCENT > 50 THEN 'WARNING'
        ELSE 'NORMAL'
    END AS ALERT_LEVEL,
    LOG_UTILIZATION_PERCENT,
    TOTAL_LOG_USED_KB,
    TOTAL_LOG_AVAILABLE_KB,
    APPLH_WITH_OLDEST_XACT AS OLDEST_TRANSACTION_HANDLE,
    CURRENT TIMESTAMP AS CHECK_TIME
FROM TABLE(MON_GET_TRANSACTION_LOG(-2)) AS LOG;

Daily Health Check Script

The crown jewel of Meridian's monitoring infrastructure is the daily health check — a comprehensive script that runs every morning at 6 AM and produces a report the DBA team reviews before the business day begins.

-- ============================================================
-- MERIDIAN NATIONAL BANK — DAILY HEALTH CHECK
-- Run daily at 06:00 via scheduler
-- ============================================================

-- Section 1: Database Status
SELECT 'DATABASE STATUS' AS CHECK_CATEGORY,
       DB_NAME, DB_STATUS, DB_ACTIVATION_STATE,
       LAST_BACKUP,
       TIMESTAMPDIFF(8, CHAR(CURRENT TIMESTAMP - LAST_BACKUP))
           AS HOURS_SINCE_BACKUP
FROM TABLE(MON_GET_DATABASE(-2)) AS DB;

-- Section 2: Buffer Pool Health
SELECT 'BUFFER POOL HEALTH' AS CHECK_CATEGORY,
       BP_NAME,
       CASE WHEN (POOL_DATA_L_READS + POOL_INDEX_L_READS) > 0
            THEN DEC((1.0 - FLOAT(POOL_DATA_P_READS + POOL_INDEX_P_READS) /
                 FLOAT(POOL_DATA_L_READS + POOL_INDEX_L_READS)) * 100, 5, 2)
            ELSE 100.00
       END AS HIT_RATIO_PCT,
       POOL_DATA_L_READS + POOL_INDEX_L_READS AS TOTAL_LOGICAL_READS,
       POOL_DATA_P_READS + POOL_INDEX_P_READS AS TOTAL_PHYSICAL_READS
FROM TABLE(MON_GET_BUFFERPOOL('', -2)) AS BP
ORDER BY BP_NAME;

-- Section 3: Table Space Utilization
SELECT 'TABLESPACE UTILIZATION' AS CHECK_CATEGORY,
       TBSP_NAME,
       TBSP_TYPE,
       TBSP_UTILIZATION_PERCENT AS UTILIZATION_PCT,
       TBSP_TOTAL_SIZE_KB / 1024 AS TOTAL_SIZE_MB,
       TBSP_USED_SIZE_KB / 1024 AS USED_SIZE_MB,
       TBSP_FREE_SIZE_KB / 1024 AS FREE_SIZE_MB,
       CASE WHEN TBSP_UTILIZATION_PERCENT > 90 THEN 'CRITICAL'
            WHEN TBSP_UTILIZATION_PERCENT > 80 THEN 'WARNING'
            ELSE 'OK'
       END AS STATUS
FROM TABLE(MON_GET_TABLESPACE('', -2)) AS TS
ORDER BY TBSP_UTILIZATION_PERCENT DESC;

-- Section 4: Log Utilization
SELECT 'LOG UTILIZATION' AS CHECK_CATEGORY,
       LOG_UTILIZATION_PERCENT,
       TOTAL_LOG_USED_KB / 1024 AS LOG_USED_MB,
       TOTAL_LOG_AVAILABLE_KB / 1024 AS LOG_AVAILABLE_MB
FROM TABLE(MON_GET_TRANSACTION_LOG(-2)) AS LOG;

-- Section 5: Lock Summary (past 24 hours)
SELECT 'LOCK SUMMARY' AS CHECK_CATEGORY,
       SUM(LOCK_WAITS) AS TOTAL_LOCK_WAITS,
       SUM(LOCK_TIMEOUTS) AS TOTAL_LOCK_TIMEOUTS,
       SUM(DEADLOCKS) AS TOTAL_DEADLOCKS,
       SUM(LOCK_ESCALS) AS TOTAL_LOCK_ESCALATIONS
FROM TABLE(MON_GET_DATABASE(-2)) AS DB;

-- Section 6: Top 10 Most Expensive Queries
SELECT 'TOP QUERIES' AS CHECK_CATEGORY,
       SUBSTR(STMT_TEXT, 1, 100) AS SQL_PREVIEW,
       NUM_EXECUTIONS,
       DEC(TOTAL_ACT_TIME / 1000.0, 12, 2) AS TOTAL_TIME_SEC,
       DEC(CASE WHEN NUM_EXECUTIONS > 0
            THEN (TOTAL_ACT_TIME / NUM_EXECUTIONS) / 1000.0
            ELSE 0 END, 12, 4) AS AVG_TIME_SEC,
       ROWS_READ,
       ROWS_RETURNED
FROM TABLE(MON_GET_PKG_CACHE_STMT(NULL, NULL, NULL, -2)) AS STMT
WHERE NUM_EXECUTIONS > 0
ORDER BY TOTAL_ACT_TIME DESC
FETCH FIRST 10 ROWS ONLY;

-- Section 7: Connection Summary
SELECT 'CONNECTION SUMMARY' AS CHECK_CATEGORY,
       COUNT(*) AS TOTAL_CONNECTIONS,
       SUM(CASE WHEN APPLICATION_STATUS = 'UOWWAIT' THEN 1 ELSE 0 END) AS IDLE,
       SUM(CASE WHEN APPLICATION_STATUS = 'UOWEXEC' THEN 1 ELSE 0 END) AS ACTIVE,
       SUM(CASE WHEN APPLICATION_STATUS = 'LOCKWAIT' THEN 1 ELSE 0 END) AS LOCK_WAITING
FROM TABLE(MON_GET_CONNECTION(NULL, -2)) AS CONN;

-- Section 8: Sort Performance
SELECT 'SORT PERFORMANCE' AS CHECK_CATEGORY,
       TOTAL_SORTS,
       SORT_OVERFLOWS,
       CASE WHEN TOTAL_SORTS > 0
            THEN DEC((FLOAT(SORT_OVERFLOWS) / FLOAT(TOTAL_SORTS)) * 100, 5, 2)
            ELSE 0.00
       END AS OVERFLOW_PCT,
       TOTAL_SORT_TIME
FROM TABLE(MON_GET_DATABASE(-2)) AS DB;

This health check report should be reviewed daily by the on-call DBA. Any metric flagged as WARNING or CRITICAL should be investigated before business hours begin. Over time, the team will develop an intuition for what is normal and what requires attention — but only if they review the report consistently.


Spaced Review: Connecting to Earlier Chapters

From Chapter 3: Your First Queries

In Chapter 3, you wrote your first SELECT statement against the ACCOUNTS table. Now you can monitor how that query performs:

-- Find your query in the package cache
SELECT
    SUBSTR(STMT_TEXT, 1, 200) AS SQL_TEXT,
    NUM_EXECUTIONS,
    TOTAL_ACT_TIME,
    ROWS_READ,
    ROWS_RETURNED
FROM TABLE(MON_GET_PKG_CACHE_STMT(NULL, NULL, NULL, -2)) AS STMT
WHERE STMT_TEXT LIKE '%ACCOUNTS%'
ORDER BY TOTAL_ACT_TIME DESC;

Every query you learned to write in Chapter 3 has a performance profile. Monitoring teaches you to see that profile and optimize it.

Review question: If a simple SELECT on the ACCOUNTS table shows ROWS_READ = 1,000,000 but ROWS_RETURNED = 5, what does this tell you? (Answer: The query is performing a full table scan when it should be using an index. The monitoring data just revealed a missing or unused index.)

From Chapter 15: Query Optimization

Chapter 15 taught you about access paths, index design, and the optimizer. Monitoring is how you verify that the optimizer's choices are working in production:

  • Buffer pool hit ratios tell you whether the optimizer's choice to use an index is actually reducing I/O
  • Sort overflow metrics tell you whether the optimizer's sort estimates are accurate
  • Lock wait metrics tell you whether the isolation level chosen by the optimizer is causing contention

Review question: You used EXPLAIN in Chapter 15 to analyze a query plan. How would you find, through monitoring, whether a query that looked good in EXPLAIN is actually performing poorly in production? (Answer: Use MON_GET_PKG_CACHE_STMT to find the query and compare actual ROWS_READ to estimated rows. Large discrepancies indicate stale statistics or optimizer estimation errors.)

From Chapter 17: Utilities

Chapter 17 covered RUNSTATS, REORG, COPY, and RECOVER. Monitoring tells you when these utilities are needed:

  • Declining buffer pool hit ratios after many updates may indicate the need for REORG
  • Increasing sort overflows may indicate stale statistics requiring RUNSTATS
  • COPY pending status in DISPLAY DATABASE output means a backup is required
  • Growing table space utilization trends tell you when to schedule a REORG with RECLAIM

Review question: After running RUNSTATS in Chapter 17, how would you verify through monitoring that the statistics update actually improved query performance? (Answer: Compare MON_GET_PKG_CACHE_STMT metrics before and after — look for reduced TOTAL_ACT_TIME, fewer ROWS_READ, and a higher ratio of ROWS_RETURNED to ROWS_READ for affected queries.)


Chapter Summary

Monitoring is not a feature — it is a discipline. The tools described in this chapter are only as effective as the DBA who uses them consistently, interprets their output correctly, and acts on the findings promptly.

On z/OS, your monitoring toolkit includes: - DISPLAY commands for real-time status (-DIS DATABASE, -DIS THREAD, -DIS BUFFERPOOL, -DIS LOG, -DIS UTILITY) - Traces and IFCIDs for detailed data collection (statistics, accounting, performance) - SMF records (100, 101, 102) for historical analysis and trend reporting - OMEGAMON or third-party monitors for enterprise-grade alerting and dashboards

On LUW, your monitoring toolkit includes: - db2pd for zero-overhead real-time snapshots (bufferpools, locks, transactions, applications) - MON_GET table functions for SQL-queryable monitoring data - Event monitors for capturing transient events (deadlocks, slow queries, locking) - db2diag.log for diagnostic investigation and problem determination - db2support for collecting diagnostic data for IBM Support

On both platforms, the principles are the same: 1. Establish baselines during known-good operations 2. Monitor continuously against those baselines 3. Alert when thresholds are breached — before users notice 4. Investigate with the right tool for the situation 5. Document findings and update baselines as workloads evolve

At Meridian National Bank, we built a complete monitoring infrastructure: always-on traces, event monitors, continuous db2pd collection, buffer pool monitoring per pool, lock contention alerting, log space monitoring, and a daily health check that the DBA team reviews every morning. This is not gold-plating — it is the minimum standard for a production banking system.

The best DBAs do not wait for the phone to ring. They already know what is happening in their databases, every hour of every day. This chapter gave you the tools and the knowledge. Now it is up to you to build the discipline.


Next chapter: Chapter 21 — The DB2 Catalog and Directory: Understanding the System's Metadata