31 min read

> "The difference between a junior DBA and a senior DBA is not the ability to fix problems. It is the ability to find problems. Fixing is usually straightforward once you know what is wrong."

Chapter 27: Performance Diagnosis Methodology — A Systematic Approach to Finding and Fixing the Bottleneck

"The difference between a junior DBA and a senior DBA is not the ability to fix problems. It is the ability to find problems. Fixing is usually straightforward once you know what is wrong." — Production wisdom


Learning Objectives

After completing this chapter you will be able to:

  1. Apply a systematic performance diagnosis methodology.
  2. Determine if a problem is CPU, I/O, lock, or memory bound.
  3. Read and interpret accounting traces (z/OS) and monitor output (LUW).
  4. Identify the worst-performing SQL statements.
  5. Prove that a fix actually worked with before/after metrics.
  6. Run a complete performance investigation on Meridian National Bank.

27.1 The Performance Diagnosis Playbook

Performance diagnosis is not an art — it is a method. When a system is slow, the temptation is to start guessing: "Maybe we need a new index." "Maybe the buffer pool is too small." "Maybe the network is slow." Guessing wastes time, introduces risk (incorrect changes can make things worse), and rarely addresses the actual bottleneck.

Instead, follow this six-step methodology every time:

Step 1: Define the Symptom

Before collecting any data, write down exactly what is wrong. "The system is slow" is not a symptom. These are symptoms:

  • "The FUNDS_TRANSFER transaction averaged 50ms last week; today it averages 1200ms."
  • "The nightly interest calculation batch finished in 3 hours last month; last night it took 7 hours."
  • "Users report intermittent timeouts on the balance inquiry screen between 2 PM and 3 PM."

A well-defined symptom tells you what is slow (which transaction, which program, which SQL), when it became slow (always, recently, only during certain hours), and how slow (quantified in milliseconds, elapsed time, or throughput).

Step 2: Collect Metrics

Once you know the symptom, collect data specific to that symptom. Do not collect everything — collect what is relevant.

For a slow SQL statement: access path (EXPLAIN), runtime statistics (accounting trace on z/OS, MON_GET functions on LUW), buffer pool hit ratios, lock wait counts.

For a slow batch job: elapsed time breakdown (CPU time, I/O wait, lock wait), number of rows processed, COMMIT frequency, lock escalation counts.

For an intermittent problem: time-series data — metrics collected at regular intervals before, during, and after the problem window.

Step 3: Identify the Bottleneck Category

Every DB2 performance problem falls into one of four categories:

Category The Transaction Is Spending Time... Key Metrics
CPU-bound ...executing instructions on the processor CPU time, getpages, instructions executed
I/O-bound ...waiting for data to be read from disk Synchronous I/O, buffer pool hit ratio, prefetch activity
Lock-bound ...waiting for another transaction to release a lock Lock wait time, lock timeout/deadlock count
Memory-bound ...dealing with insufficient memory (sorts spilling, pools thrashing) Sort overflows, buffer pool hit ratio, EDM pool full [z/OS], catalog cache hit ratio [LUW]

The diagnostic strategy differs radically by category. An I/O-bound problem is not solved by adding CPU. A lock-bound problem is not solved by increasing the buffer pool. Step 3 is about pointing the investigation in the right direction.

Step 4: Identify the Root Cause

Within the bottleneck category, drill down to the specific cause:

  • CPU-bound → Which SQL statement? Which access path? How many rows processed?
  • I/O-bound → Which tablespace? Which buffer pool? Is prefetch triggering? Is the data hot or cold?
  • Lock-bound → Which table? Which transaction holds the lock? What isolation level?
  • Memory-bound → Which pool is undersized? What is the overflow rate?

Step 5: Implement the Fix

Based on the root cause, implement a targeted fix:

  • Bad access path → Create or modify an index, update statistics, add optimizer hints
  • Buffer pool miss → Increase pool size, enable prefetch, reorganize data
  • Lock contention → Change isolation level, add index to reduce scan locks, increase COMMIT frequency
  • Sort overflow → Increase sort heap, optimize query to reduce sort size

Step 6: Verify the Improvement

This step is non-negotiable. Every fix must be proven with before/after metrics. "It feels faster" is not verification. Collect the same metrics you collected in Step 2 and compare quantitatively.

If the metrics do not show improvement, the fix was wrong — roll it back and return to Step 3. If the metrics show improvement in the problem area but regression in another area, you have moved the bottleneck but not eliminated it.

Common Mistakes in Performance Diagnosis

Before we proceed, let us acknowledge the traps that even experienced DBAs fall into:

Mistake 1: Jumping to the fix without diagnosing. "The system is slow, let us add more indexes." Adding an index without knowing the root cause may help, may do nothing, or may make things worse (more index maintenance overhead, more lock contention on index pages). Always diagnose before fixing.

Mistake 2: Collecting too much data. Enabling every trace and monitor generates gigabytes of data but no clarity. Collect only what is relevant to the defined symptom. If the problem is a slow SQL statement, you do not need system-level CPU utilization trends from the past month — you need the access path and runtime statistics for that specific statement.

Mistake 3: Fixing multiple things at once. You change the buffer pool size, add an index, and update the isolation level. Performance improves. Which change helped? You do not know. Worse, if performance degrades later, you do not know which change to roll back. Change one thing at a time, measure, and proceed.

Mistake 4: Ignoring the "Other Wait" category. When CPU + I/O + Lock Wait do not account for all elapsed time, the remainder is "other" — network latency, application processing between SQL calls, OS scheduling delays, or middleware overhead. DBAs sometimes spend weeks optimizing DB2 when the bottleneck is in the application server's garbage collection or a misconfigured connection pool.

Mistake 5: Optimizing the wrong metric. Reducing average response time from 50ms to 30ms is less impactful than reducing 99th percentile response time from 5,000ms to 500ms. The average masks outliers. Always examine the distribution, not just the mean.

The Playbook as a Flowchart

[Symptom Reported]
       |
       v
[Define Symptom Precisely]
       |
       v
[Collect Relevant Metrics]
       |
       v
[Is it CPU-bound?] --Yes--> [Section 27.6]
       |No
       v
[Is it I/O-bound?] --Yes--> [Section 27.7]
       |No
       v
[Is it Lock-bound?] --Yes--> [Section 27.8]
       |No
       v
[Is it Memory-bound?] --Yes--> [Section 27.9]
       |No
       v
[External factor: network, application logic, middleware]

27.2 Is It CPU, I/O, Lock, or Memory?

The first diagnostic question is always: where is the time going? DB2 tracks elapsed time and its components — CPU time, I/O wait time, and lock wait time — for every transaction. The gap between elapsed time and these three components is typically "other wait" (network, application processing, scheduling delays).

The Elapsed Time Equation

Elapsed Time = CPU Time + I/O Wait Time + Lock Wait Time + Other Wait Time

If CPU time dominates elapsed time, the problem is CPU-bound. If I/O wait dominates, it is I/O-bound. If lock wait dominates, it is lock-bound. If none of these dominate but elapsed time is high, look for external factors.

Quick Classification on z/OS

On z/OS, the accounting trace (IFCID 3) provides a complete time breakdown:

Class 2 Elapsed Time:     5.230 seconds
Class 2 CPU Time:         0.120 seconds
Class 2 Suspension Time:  5.110 seconds
  - I/O Suspension:       0.340 seconds
  - Lock Suspension:      4.680 seconds
  - Other Suspension:     0.090 seconds

This transaction is lock-bound. Of 5.23 seconds elapsed, 4.68 seconds (89%) was spent waiting for locks. No amount of CPU or I/O tuning will help — you must address the locking.

Quick Classification on LUW

On LUW, use MON_GET_CONNECTION or MON_GET_PKG_CACHE_STMT:

SELECT
    TOTAL_APP_COMMITS,
    TOTAL_CPU_TIME / 1000 AS CPU_MS,
    TOTAL_WAIT_TIME / 1000 AS WAIT_MS,
    LOCK_WAIT_TIME / 1000 AS LOCK_WAIT_MS,
    (POOL_DATA_P_READS + POOL_INDEX_P_READS) AS PHYSICAL_READS,
    (POOL_DATA_L_READS + POOL_INDEX_L_READS) AS LOGICAL_READS
FROM TABLE(MON_GET_CONNECTION(NULL, -2)) AS T
WHERE APPLICATION_NAME = 'FUNDS_TRANSFER';

Compare CPU_MS, LOCK_WAIT_MS, and compute I/O wait from physical reads and known disk latency. The dominant category is your starting point.

The Traffic Light Dashboard

For Meridian Bank's operations team, we build a simple classification dashboard:

Transaction: FUNDS_TRANSFER
Avg Elapsed: 1200 ms
  CPU:       120 ms  (10%)  [GREEN]
  I/O Wait:  150 ms  (12%)  [GREEN]
  Lock Wait: 890 ms  (74%)  [RED]
  Other:      40 ms  ( 4%)  [GREEN]
Diagnosis: LOCK-BOUND

Green = within normal range. Yellow = elevated but not dominant. Red = dominant time component. The red light tells you where to focus.


27.3 Reading an Accounting Trace (z/OS)

The accounting trace is the single most important diagnostic tool on z/OS. It captures a complete record of every transaction's resource consumption. Learning to read it is a career-defining skill for any z/OS DBA.

Accounting Classes

DB2 z/OS accounting traces are organized into three classes:

Class 1 — Elapsed Time in DB2. Measures the total time the thread spent in DB2, from allocation to deallocation (for TSO/batch) or from sign-on to sign-off (for CICS/IMS). This includes time when DB2 is not actively processing (e.g., the application is doing its own processing between SQL calls).

Class 2 — In-DB2 Time. Measures only the time DB2 is actively executing on behalf of the thread — the time between receiving a SQL request and returning the result. This is the metric you want for SQL performance analysis.

Class 3 — Suspension Time. Breaks down the time that DB2 was suspended (waiting) into categories: I/O wait, lock wait, latch wait, drain wait, page latch wait, and others.

Key Accounting Fields

Here are the fields every DBA must know:

QWACEJST  - Class 1 elapsed time (thread active in DB2 address space)
QWACEJST  - Class 1 CPU time

QWACESC   - Class 2 elapsed time (in-DB2 elapsed)
QWACESCC  - Class 2 CPU time (in-DB2 CPU)

-- Suspension breakdown (Class 3):
QWACAWTI  - Synchronous I/O wait time (buffer manager I/O)
QWACAWTI  - Lock/latch wait time
QWACAWTL  - Lock wait time specifically
QWACAWTE  - Drain wait time
QWACAWTG  - Global lock wait time (data sharing)

-- Resource consumption:
QWACAGET  - Number of getpages (buffer pool page accesses)
QWACARIO  - Number of synchronous reads (physical I/O)
QWACARIP  - Number of synchronous read I/O pages
QWACAPRE  - Number of prefetch requests
QWACALOK  - Number of lock requests
QWACADEA  - Number of deadlocks
QWACATIM  - Number of timeouts

Reading an Accounting Report (DSNACCOR or SAS/DB2PM)

Consider this sample output from a formatted accounting report:

Plan Name: XFER001    Connection Type: CICS
Occurrences: 10,245

                          Total         Average
                          -----         -------
Class 1 Elapsed:       2,340.5 sec      0.228 sec
Class 2 Elapsed:       1,890.2 sec      0.184 sec
Class 2 CPU:              98.4 sec      0.010 sec
Class 2 Suspensions:   1,791.8 sec      0.175 sec
  I/O Suspensions:       234.5 sec      0.023 sec
  Lock Suspensions:    1,498.3 sec      0.146 sec
  Other Suspensions:      59.0 sec      0.006 sec

Getpages:           45,230,100         4,415
Sync I/O:              890,230            87
Lock Requests:      30,450,800         2,972
Lock Escals:                12
Deadlocks:                   3
Timeouts:                   45

Analysis:

  1. Class 2 Elapsed (0.184 sec average): This is the time DB2 spends processing each transaction. 184 ms seems high for a funds transfer.

  2. Class 2 CPU (0.010 sec average): Only 10 ms of CPU per transaction. CPU is not the problem.

  3. Lock Suspensions (0.146 sec average = 79% of Class 2 elapsed): This is the bottleneck. Each transaction spends 146 ms waiting for locks — far more than CPU or I/O.

  4. Getpages (4,415 average): This is reasonable for a multi-table transaction. Not alarming.

  5. Lock Requests (2,972 average): Nearly 3,000 lock requests per transaction is high. This suggests either a scan-heavy access path or an unnecessarily restrictive isolation level.

  6. Lock Escalation (12 total): Only 12 escalations across 10,245 transactions. Not a major issue, but worth investigating.

  7. Deadlocks (3) and Timeouts (45): The 45 timeouts confirm significant lock contention. Deadlocks are low, suggesting the problem is contention rather than circular waits.

Diagnosis: This transaction is lock-bound. The root cause investigation should focus on why 2,972 lock requests are needed per transaction and which table/rows have the most contention.

Enabling Accounting Traces

-- z/OS: Start accounting trace
-START TRACE(ACCTG) CLASS(1,2,3) DEST(SMF)

-- z/OS: Start for a specific plan
-START TRACE(ACCTG) CLASS(1,2,3) PLAN(XFER001) DEST(SMF)

IFCID Reference for Diagnosis

IFCID Description Use Case
3 Accounting record Primary performance data
53 SQL statement text Identify problem SQL
107 Lock avoidance Monitor lock avoidance effectiveness
147 Deadlock Deadlock analysis
172 Lock timeout Timeout analysis
196 Lock escalation Escalation analysis
318 SQL statement details Detailed statement metrics

27.4 Using MON_GET Functions for Diagnosis (LUW)

On LUW, the MON_GET_* table functions are the primary diagnostic tools. They replaced the snapshot monitor in DB2 10.1 and provide far more granular and flexible data collection.

The Key MON_GET Functions

MON_GET_PKG_CACHE_STMT — Statement-Level Metrics

This is the single most useful diagnostic function. It returns performance metrics for every SQL statement in the package cache:

SELECT
    SUBSTR(STMT_TEXT, 1, 200) AS SQL_TEXT,
    NUM_EXECUTIONS,
    TOTAL_ACT_TIME / 1000 AS TOTAL_ELAPSED_MS,
    TOTAL_CPU_TIME / 1000 AS TOTAL_CPU_MS,
    TOTAL_ACT_WAIT_TIME / 1000 AS TOTAL_WAIT_MS,
    LOCK_WAIT_TIME / 1000 AS LOCK_WAIT_MS,
    ROWS_READ,
    ROWS_RETURNED,
    POOL_DATA_L_READS + POOL_INDEX_L_READS AS LOGICAL_READS,
    POOL_DATA_P_READS + POOL_INDEX_P_READS AS PHYSICAL_READS,
    CASE WHEN NUM_EXECUTIONS > 0
         THEN (TOTAL_ACT_TIME / NUM_EXECUTIONS) / 1000
         ELSE 0 END AS AVG_ELAPSED_MS
FROM TABLE(MON_GET_PKG_CACHE_STMT(NULL, NULL, NULL, -2)) AS T
WHERE NUM_EXECUTIONS > 0
ORDER BY TOTAL_ACT_TIME DESC
FETCH FIRST 20 ROWS ONLY;

This query returns the top 20 SQL statements by total elapsed time — the statements that contribute most to the system's overall workload.

MON_GET_ACTIVITY — Currently Executing Statements

When you need to see what is running right now:

SELECT
    APPLICATION_HANDLE,
    UOW_ID,
    ACTIVITY_ID,
    ACTIVITY_STATE,
    ACTIVITY_TYPE,
    SUBSTR(STMT_TEXT, 1, 200) AS SQL_TEXT,
    TOTAL_CPU_TIME / 1000 AS CPU_MS,
    LOCK_WAIT_TIME / 1000 AS LOCK_WAIT_MS,
    ROWS_READ,
    ROWS_RETURNED
FROM TABLE(MON_GET_ACTIVITY(NULL, -2)) AS T
WHERE ACTIVITY_STATE IN ('EXECUTING', 'IDLE')
ORDER BY TOTAL_CPU_TIME DESC;

MON_GET_CONNECTION — Connection-Level Metrics

For per-application analysis:

SELECT
    APPLICATION_NAME,
    APPLICATION_HANDLE,
    TOTAL_APP_COMMITS,
    TOTAL_CPU_TIME / 1000 AS CPU_MS,
    TOTAL_WAIT_TIME / 1000 AS WAIT_MS,
    LOCK_WAIT_TIME / 1000 AS LOCK_WAIT_MS,
    LOCK_WAITS,
    LOCK_TIMEOUTS,
    DEADLOCKS,
    LOCK_ESCALS,
    ROWS_READ,
    ROWS_MODIFIED
FROM TABLE(MON_GET_CONNECTION(NULL, -2)) AS T
ORDER BY TOTAL_CPU_TIME DESC;

MON_GET_BUFFERPOOL — Buffer Pool Metrics

For I/O diagnosis:

SELECT
    BP_NAME,
    POOL_DATA_L_READS,
    POOL_DATA_P_READS,
    POOL_INDEX_L_READS,
    POOL_INDEX_P_READS,
    CASE WHEN (POOL_DATA_L_READS + POOL_INDEX_L_READS) > 0
         THEN DECIMAL(1.0 - (FLOAT(POOL_DATA_P_READS + POOL_INDEX_P_READS) /
              FLOAT(POOL_DATA_L_READS + POOL_INDEX_L_READS)), 5, 4)
         ELSE 0 END AS HIT_RATIO,
    POOL_ASYNC_DATA_READS AS PREFETCH_DATA_READS,
    POOL_ASYNC_INDEX_READS AS PREFETCH_INDEX_READS
FROM TABLE(MON_GET_BUFFERPOOL(NULL, -2)) AS T;

Time-Series Collection

For intermittent problems, collect metrics at regular intervals:

-- Create a collection table
CREATE TABLE perf_snapshots (
    snapshot_ts   TIMESTAMP DEFAULT CURRENT TIMESTAMP,
    metric_name   VARCHAR(100),
    metric_value  BIGINT
);

-- Collect every 5 minutes (scheduled via cron or db2 task scheduler)
INSERT INTO perf_snapshots (metric_name, metric_value)
SELECT 'TOTAL_CPU_TIME', SUM(TOTAL_CPU_TIME)
FROM TABLE(MON_GET_DATABASE(-2)) AS T;

INSERT INTO perf_snapshots (metric_name, metric_value)
SELECT 'LOCK_WAIT_TIME', SUM(LOCK_WAIT_TIME)
FROM TABLE(MON_GET_DATABASE(-2)) AS T;

-- etc.

Comparing snapshots at different times reveals when the problem started and what metric changed.


27.5 Finding the Worst SQL

The Pareto principle applies ruthlessly to database performance: a small number of SQL statements account for the vast majority of resource consumption. Finding and fixing the "top N" statements yields the greatest return on effort.

Top N by Total Elapsed Time

This identifies the statements that consume the most cumulative wall-clock time across all executions:

-- LUW: Top 10 by total elapsed time
SELECT
    SUBSTR(STMT_TEXT, 1, 200) AS SQL_TEXT,
    NUM_EXECUTIONS AS EXECS,
    TOTAL_ACT_TIME / 1000 AS TOTAL_MS,
    CASE WHEN NUM_EXECUTIONS > 0
         THEN (TOTAL_ACT_TIME / NUM_EXECUTIONS) / 1000
         ELSE 0 END AS AVG_MS,
    TOTAL_CPU_TIME / 1000 AS CPU_MS,
    ROWS_READ,
    ROWS_RETURNED,
    CASE WHEN ROWS_RETURNED > 0
         THEN ROWS_READ / ROWS_RETURNED
         ELSE ROWS_READ END AS READ_EFFICIENCY
FROM TABLE(MON_GET_PKG_CACHE_STMT(NULL, NULL, NULL, -2)) AS T
WHERE NUM_EXECUTIONS > 100
ORDER BY TOTAL_ACT_TIME DESC
FETCH FIRST 10 ROWS ONLY;

A statement that executes 100,000 times at 50ms each consumes 5,000 seconds of elapsed time — even though each individual execution seems fast. Optimizing this statement's average from 50ms to 10ms saves 4,000 seconds of capacity.

Top N by CPU Consumption

-- LUW: Top 10 by CPU
SELECT
    SUBSTR(STMT_TEXT, 1, 200) AS SQL_TEXT,
    NUM_EXECUTIONS,
    TOTAL_CPU_TIME / 1000 AS TOTAL_CPU_MS,
    CASE WHEN NUM_EXECUTIONS > 0
         THEN (TOTAL_CPU_TIME / NUM_EXECUTIONS) / 1000
         ELSE 0 END AS AVG_CPU_MS,
    ROWS_READ,
    ROWS_RETURNED
FROM TABLE(MON_GET_PKG_CACHE_STMT(NULL, NULL, NULL, -2)) AS T
WHERE NUM_EXECUTIONS > 100
ORDER BY TOTAL_CPU_TIME DESC
FETCH FIRST 10 ROWS ONLY;

Top N by Getpages (z/OS)

On z/OS, getpages is the primary measure of data access cost. Each getpage represents a buffer pool page access (logical read).

-- z/OS: From accounting trace, look for statements with
-- highest QWACAGET (getpages) values.
-- Use IFCID 318 for per-statement getpage counts.

Top N by Lock Wait

-- LUW: Top 10 by lock wait time
SELECT
    SUBSTR(STMT_TEXT, 1, 200) AS SQL_TEXT,
    NUM_EXECUTIONS,
    LOCK_WAIT_TIME / 1000 AS LOCK_WAIT_MS,
    LOCK_WAITS,
    LOCK_ESCALS,
    CASE WHEN NUM_EXECUTIONS > 0
         THEN (LOCK_WAIT_TIME / NUM_EXECUTIONS) / 1000
         ELSE 0 END AS AVG_LOCK_WAIT_MS
FROM TABLE(MON_GET_PKG_CACHE_STMT(NULL, NULL, NULL, -2)) AS T
WHERE LOCK_WAITS > 0
ORDER BY LOCK_WAIT_TIME DESC
FETCH FIRST 10 ROWS ONLY;

The Read Efficiency Ratio

One of the most revealing diagnostic metrics is the ratio of rows read to rows returned:

Read Efficiency = ROWS_READ / ROWS_RETURNED

A value of 1.0 means every row DB2 read was returned to the application — perfect efficiency. A value of 100 means DB2 read 100 rows for every 1 it returned — it is doing 99x more work than necessary, typically because of missing indexes or poor predicate filtering.

-- LUW: Find inefficient statements
SELECT
    SUBSTR(STMT_TEXT, 1, 200) AS SQL_TEXT,
    NUM_EXECUTIONS,
    ROWS_READ,
    ROWS_RETURNED,
    CASE WHEN ROWS_RETURNED > 0
         THEN ROWS_READ / ROWS_RETURNED
         ELSE ROWS_READ END AS READ_EFFICIENCY
FROM TABLE(MON_GET_PKG_CACHE_STMT(NULL, NULL, NULL, -2)) AS T
WHERE NUM_EXECUTIONS > 100
  AND ROWS_RETURNED > 0
ORDER BY (ROWS_READ / ROWS_RETURNED) DESC
FETCH FIRST 10 ROWS ONLY;

On z/OS, the equivalent is the ratio of getpages to rows returned. A high ratio signals that DB2 is scanning large amounts of data to satisfy a selective query — a classic indexing problem.


27.6 CPU-Bound Problems

Symptoms

  • Class 2 CPU time is a large fraction (> 50%) of Class 2 elapsed time
  • Transaction uses few I/O waits and few lock waits
  • Getpage count is high relative to rows returned
  • System CPU utilization is at or near 100%

Common Causes

1. Excessive Getpages (Logical Reads)

Every getpage requires CPU to hash the page address, search the buffer pool chain, and access the page. A query that performs 100,000 getpages when it should perform 100 is wasting 99,900 getpages worth of CPU.

Root causes of excessive getpages: - Missing index → tablespace scan instead of index access - Index on wrong columns → index scan reads too many entries - Stage 2 predicates (z/OS) → predicates not evaluated during index access, requiring data page access - No predicate pushdown → filtering happens in the application instead of the database

2. Stage 2 Predicates (z/OS)

On z/OS, predicates are classified as Stage 1 (evaluated during data access, possibly using an index) or Stage 2 (evaluated after data pages are retrieved). Stage 2 predicates are more CPU-expensive because they require fetching the data page even if the row will ultimately be rejected.

Common Stage 2 predicates: - WHERE SUBSTR(col, 1, 3) = 'ABC' — function on column prevents index use - WHERE col1 + col2 > 100 — arithmetic expression prevents index use - WHERE col LIKE '%pattern' — leading wildcard prevents index use - WHERE col IN (SELECT ...) — some subquery forms

Rewrite to Stage 1:

-- Stage 2 (function on column)
SELECT * FROM accounts WHERE YEAR(open_date) = 2024;

-- Stage 1 equivalent (range predicate)
SELECT * FROM accounts
WHERE open_date >= '2024-01-01'
  AND open_date < '2025-01-01';

3. Inefficient Join Methods

A nested loop join on a large result set without a supporting index results in repeated tablespace scans of the inner table — quadratic CPU consumption. The fix is typically an index on the join column of the inner table or a different join method (hash join or merge scan join).

4. Unnecessary Sorting

ORDER BY, GROUP BY, DISTINCT, and UNION (not UNION ALL) may require sort operations. Each sort consumes CPU proportional to N * log(N) where N is the number of rows. If the sort can be avoided (e.g., by creating an index that delivers rows in the required order), CPU consumption drops.

Diagnosis Steps for CPU-Bound Problems

  1. Find the top CPU-consuming SQL statement (Section 27.5)
  2. Run EXPLAIN to see the access path
  3. Check for tablespace scans on large tables
  4. Check for Stage 2 predicates (z/OS: EXPLAIN shows FILTER_FACTOR for Stage 2)
  5. Check the read efficiency ratio (ROWS_READ vs. ROWS_RETURNED)
  6. Check for unnecessary sorts (EXPLAIN shows SORT operations)
  7. Design an index or rewrite the query to address the root cause
  8. Verify with before/after CPU metrics

27.7 I/O-Bound Problems

Symptoms

  • I/O wait time dominates elapsed time
  • Buffer pool hit ratio is low (< 95% for OLTP, < 80% for batch)
  • High number of synchronous (non-prefetch) reads
  • Disk latency is elevated (> 5ms average)

Common Causes

1. Buffer Pool Misses

The buffer pool is DB2's in-memory cache of data and index pages. When a needed page is not in the buffer pool, DB2 must read it from disk — a synchronous I/O that is orders of magnitude slower than a memory access.

-- LUW: Check buffer pool hit ratios
SELECT
    BP_NAME,
    POOL_DATA_L_READS AS DATA_LOGICAL,
    POOL_DATA_P_READS AS DATA_PHYSICAL,
    POOL_INDEX_L_READS AS INDEX_LOGICAL,
    POOL_INDEX_P_READS AS INDEX_PHYSICAL,
    CASE WHEN (POOL_DATA_L_READS + POOL_INDEX_L_READS) > 0
         THEN DECIMAL(
              (1.0 - FLOAT(POOL_DATA_P_READS + POOL_INDEX_P_READS)
                   / FLOAT(POOL_DATA_L_READS + POOL_INDEX_L_READS))
              * 100, 5, 2)
         ELSE 100.00 END AS HIT_RATIO_PCT
FROM TABLE(MON_GET_BUFFERPOOL(NULL, -2)) AS T;

Targets: - OLTP buffer pool: > 98% hit ratio - Index buffer pool: > 99% hit ratio - Batch/DSS buffer pool: > 80% hit ratio (lower is acceptable due to sequential scanning)

2. Sequential Prefetch Not Triggering

DB2 uses sequential prefetch to read ahead when it detects a sequential scan pattern. Prefetch reads are asynchronous — they happen in the background while the application processes the current page. If prefetch is not triggering, every page read becomes a synchronous I/O.

Prefetch may not trigger when: - The data is fragmented on disk (REORG needed) - The buffer pool is too small to hold the prefetch pages - The prefetch threshold is set too low - The access pattern is not recognized as sequential (random index access)

-- LUW: Check prefetch activity
SELECT
    BP_NAME,
    POOL_ASYNC_DATA_READS AS PREFETCH_DATA,
    POOL_DATA_P_READS AS SYNC_DATA,
    CASE WHEN POOL_DATA_P_READS > 0
         THEN DECIMAL(
              FLOAT(POOL_ASYNC_DATA_READS) /
              FLOAT(POOL_DATA_P_READS) * 100, 5, 2)
         ELSE 0 END AS PREFETCH_PCT
FROM TABLE(MON_GET_BUFFERPOOL(NULL, -2)) AS T
WHERE POOL_DATA_P_READS > 0;

For sequential batch processing, prefetch should account for > 80% of physical reads. If it is low, investigate data organization and buffer pool sizing.

3. Random I/O on Cold Data

When an index access results in reading rows that are spread randomly across many data pages, each page access may result in a physical I/O if the page is not in the buffer pool. This is common when: - A non-clustering index is used for a range scan - The table has never been reorganized by the index key - The query accesses a "cold" table (rarely accessed, not in buffer pool)

Solutions: - Reorganize the table clustered by the index used for the most critical query - Increase the buffer pool size to keep the data pages in memory - Use a covering index (all needed columns in the index) to avoid data page access entirely

4. Disk Latency

Modern enterprise storage should deliver < 1ms average read latency for SSD/flash and < 5ms for spinning disk. If latencies are higher: - Check the storage subsystem for hot spots - Verify that DB2 datasets are on appropriate storage tiers - Check for storage-level cache misses

Diagnosis Steps for I/O-Bound Problems

  1. Check buffer pool hit ratios per pool
  2. Identify which tables/tablespaces have the most physical reads
  3. Check prefetch ratios for sequential access patterns
  4. Review access paths for non-clustering index scans on large tables
  5. Check disk latency at the storage level
  6. Consider REORG, buffer pool resizing, or index changes
  7. Verify with before/after I/O metrics
-- LUW: Per-table I/O metrics
SELECT
    TABSCHEMA, TABNAME,
    DATA_OBJECT_P_READS AS DATA_PHYSICAL_READS,
    INDEX_OBJECT_P_READS AS INDEX_PHYSICAL_READS,
    ROWS_READ,
    ROWS_INSERTED + ROWS_UPDATED + ROWS_DELETED AS ROWS_MODIFIED
FROM TABLE(MON_GET_TABLE('', '', -2)) AS T
WHERE DATA_OBJECT_P_READS + INDEX_OBJECT_P_READS > 0
ORDER BY DATA_OBJECT_P_READS + INDEX_OBJECT_P_READS DESC
FETCH FIRST 20 ROWS ONLY;

27.8 Lock-Bound Problems

Symptoms

  • Lock wait time dominates elapsed time
  • High lock timeout or deadlock counts
  • Lock escalation events occurring
  • Throughput drops during specific time windows (batch-OLTP conflict)

Common Causes

1. Lock Contention on Hot Rows

Certain rows are accessed far more frequently than others — the "hot row" problem. Examples: a counter row, a sequence table, a status row that every transaction reads and updates.

Diagnosis:

-- LUW: Identify lock contention by table
SELECT
    TABSCHEMA, TABNAME,
    LOCK_WAITS,
    LOCK_WAIT_TIME / 1000 AS LOCK_WAIT_MS,
    LOCK_ESCALS
FROM TABLE(MON_GET_TABLE('', '', -2)) AS T
WHERE LOCK_WAITS > 0
ORDER BY LOCK_WAIT_TIME DESC
FETCH FIRST 10 ROWS ONLY;

Solutions: - Use DB2 sequences instead of counter rows - Partition the hot row (e.g., multiple counter rows, round-robin access) - Reduce the transaction scope that touches the hot row

2. Lock Escalation

See Chapter 26, Section 26.3 for detailed coverage. When escalation converts row locks to a table lock, every concurrent transaction that needs any row in that table must wait.

3. Long-Running Transactions Holding Locks

A transaction that starts at 2:00 PM and does not commit until 2:15 PM holds its locks for 15 minutes. Under RS or RR isolation, every row it read is locked for the entire duration.

Diagnosis:

-- LUW: Find long-running transactions
SELECT
    APPLICATION_HANDLE,
    APPLICATION_NAME,
    UOW_START_TIME,
    TIMESTAMPDIFF(2, CURRENT TIMESTAMP - UOW_START_TIME) AS UOW_SECONDS,
    LOCK_WAITS,
    ROWS_READ,
    ROWS_MODIFIED
FROM TABLE(MON_GET_CONNECTION(NULL, -2)) AS T
WHERE UOW_START_TIME IS NOT NULL
ORDER BY UOW_START_TIME
FETCH FIRST 10 ROWS ONLY;

4. Inappropriate Isolation Level

A transaction running under RR that only needs CS wastes concurrency. Under RR, locks on non-qualifying rows are held, dramatically increasing the lock footprint and contention surface.

Diagnosis Steps for Lock-Bound Problems

  1. Confirm lock wait dominates elapsed time
  2. Identify which tables have the most lock contention
  3. Find the lock holders (what transaction holds the blocking lock?)
  4. Check for lock escalation events
  5. Review isolation levels used by the contending transactions
  6. Check for long-running transactions
  7. Review Chapter 26 strategies: COMMIT frequency, lock ordering, isolation level selection
  8. Verify with before/after lock wait metrics

27.9 Memory-Bound Problems

Symptoms

  • Sort overflows (sorts spilling to disk)
  • Buffer pool hit ratio degrading over time (working set does not fit)
  • EDM pool full condition (z/OS)
  • Catalog cache misses (LUW)
  • Package cache overflow (LUW)

Sort Overflows

When a sort operation's data exceeds the available sort memory, DB2 spills the sort to disk (a "sort overflow"). This converts a memory operation into a disk operation with dramatically higher latency.

z/OS:

-- Check for sort overflows in accounting trace
-- QWACASNO = Number of sorts that overflowed to work file

LUW:

-- Monitor sort overflows
SELECT
    SORT_OVERFLOWS,
    TOTAL_SORTS,
    CASE WHEN TOTAL_SORTS > 0
         THEN DECIMAL(FLOAT(SORT_OVERFLOWS) /
              FLOAT(TOTAL_SORTS) * 100, 5, 2)
         ELSE 0 END AS OVERFLOW_PCT,
    POST_THRESHOLD_SORTS,
    POST_SHRTHRESHOLD_SORTS
FROM TABLE(MON_GET_DATABASE(-2)) AS T;

Target: Sort overflow rate < 2%. If higher: - Increase SORTHEAP (LUW) or the sort pool / work file space (z/OS) - Optimize queries to reduce the amount of data being sorted - Add indexes to avoid sorts entirely (deliver data in the required order)

EDM Pool (z/OS)

The Environmental Descriptor Manager (EDM) pool caches DBDs (database descriptors), CTs (cursor tables), PTs (package tables), and the dynamic SQL cache. If the EDM pool is too small: - DBDs are continuously loaded and unloaded, causing I/O - The dynamic SQL cache is too small, causing repeated full prepares - Package tables are paged out, causing delays at package allocation

Monitor:

-- z/OS: Check EDM pool usage
-DISPLAY BUFFERPOOL(BP0) DETAIL
-- Look for EDM pool full conditions in the system log

Catalog Cache (LUW)

The catalog cache stores frequently accessed catalog information. A high miss rate means DB2 repeatedly reads the system catalog tables:

SELECT
    CAT_CACHE_LOOKUPS,
    CAT_CACHE_INSERTS,
    CAT_CACHE_OVERFLOWS,
    CASE WHEN CAT_CACHE_LOOKUPS > 0
         THEN DECIMAL(1.0 - FLOAT(CAT_CACHE_INSERTS) /
              FLOAT(CAT_CACHE_LOOKUPS), 5, 4)
         ELSE 0 END AS CACHE_HIT_RATIO
FROM TABLE(MON_GET_DATABASE(-2)) AS T;

Target: > 99% hit ratio. If lower, increase CATALOGCACHE_SZ.

Package Cache (LUW)

The package cache stores compiled SQL statements (sections). A high miss rate means DB2 repeatedly compiles the same SQL:

SELECT
    PKG_CACHE_LOOKUPS,
    PKG_CACHE_INSERTS,
    CASE WHEN PKG_CACHE_LOOKUPS > 0
         THEN DECIMAL(1.0 - FLOAT(PKG_CACHE_INSERTS) /
              FLOAT(PKG_CACHE_LOOKUPS), 5, 4)
         ELSE 0 END AS CACHE_HIT_RATIO
FROM TABLE(MON_GET_DATABASE(-2)) AS T;

Target: > 98% hit ratio. If lower, increase PCKCACHESZ or investigate whether the application is using non-parameterized dynamic SQL (each unique SQL text gets a separate cache entry, wasting cache space).

Diagnosis Steps for Memory-Bound Problems

  1. Check sort overflow rates
  2. Check buffer pool hit ratios (are they declining?)
  3. Check catalog cache and package cache hit ratios (LUW)
  4. Check EDM pool usage (z/OS)
  5. Identify which memory component is under-sized
  6. Increase the relevant memory parameter
  7. Verify with before/after metrics

27.10 The Before/After Proof

The Dangerous Temptation to Skip Verification

In a production crisis, the pressure to declare victory is intense. You make a change, the immediate symptoms disappear, and everyone wants to move on. But without verification:

  • You may have masked the symptom without fixing the root cause. The problem will return.
  • The change may have introduced a subtle regression that will not manifest until peak load on the next business day.
  • You have no documentation to help the next DBA who encounters a similar problem.

The before/after proof is what separates engineering from guessing. It takes 15 minutes to collect post-change metrics. It saves hours the next time something goes wrong.

Why Proof Matters

Without before/after measurement, you cannot know: - Whether the change actually helped - How much it helped - Whether it introduced a regression elsewhere

"It feels faster" is not evidence. Databases are stochastic systems — performance varies naturally. A change that appears to help may simply coincide with lower load. A change that appears to hurt may coincide with higher load.

The Measurement Protocol

  1. Baseline Period: Collect metrics for a representative period before the change. This should cover normal load patterns — not just peak or trough.

  2. Change Implementation: Implement exactly one change. If you change two things simultaneously, you cannot attribute improvement to either one.

  3. Post-Change Period: Collect the same metrics for a comparable period after the change. "Comparable" means similar day of week, similar time of day, similar transaction volume.

  4. Comparison: Calculate the delta for key metrics: - Average elapsed time per transaction - 95th percentile elapsed time - CPU time per transaction - I/O per transaction - Lock wait time per transaction - Throughput (transactions per second)

Statistical Significance

For high-volume transactions, use confidence intervals:

-- LUW: Collect individual statement execution times
-- (requires event monitor or activity tracking)

-- Compare averages with standard deviation
-- Before: AVG = 50ms, STDDEV = 15ms, N = 10000
-- After:  AVG = 35ms, STDDEV = 12ms, N = 10000

-- The improvement of 15ms is real if:
-- (50 - 35) > 2 * SQRT((15^2/10000) + (12^2/10000))
-- 15 > 2 * SQRT(0.0225 + 0.0144)
-- 15 > 2 * 0.192
-- 15 > 0.384  → Yes, statistically significant

For lower-volume transactions, you need longer observation periods to achieve statistical significance.

Regression Testing

After any change, check that other transactions were not negatively affected:

-- LUW: Compare overall system metrics before and after
SELECT
    TOTAL_APP_COMMITS,
    TOTAL_CPU_TIME,
    TOTAL_WAIT_TIME,
    LOCK_WAIT_TIME,
    DEADLOCKS,
    LOCK_TIMEOUTS,
    LOCK_ESCALS,
    POOL_DATA_P_READS + POOL_INDEX_P_READS AS TOTAL_PHYSICAL_READS
FROM TABLE(MON_GET_DATABASE(-2)) AS T;

If overall lock wait time decreased but CPU time increased, you may have shifted the bottleneck rather than eliminated it. This is still progress (lock waits are typically more damaging than CPU consumption), but it should be documented and understood.

The Change Log

Maintain a performance change log:

Date: 2025-11-15
Change: Added index IX_ACCOUNTS_BRANCH on ACCOUNTS(BRANCH_ID)
Reason: FUNDS_TRANSFER query doing tablespace scan under RS isolation,
        causing 45,000 getpages and lock escalation
Before: Avg elapsed 184ms, CPU 10ms, Lock wait 146ms, Getpages 4415
After:  Avg elapsed  28ms, CPU  8ms, Lock wait   4ms, Getpages   85
Improvement: 85% reduction in elapsed time
Side effects: None observed. Insert time on ACCOUNTS increased by
              <1ms due to index maintenance.
Approved by: J. Chen, Senior DBA

27.11 Real Production Scenarios

Scenario 1: The Monday Morning Slowdown

Symptom: Every Monday from 9:00-10:00 AM, the BALANCE_INQUIRY transaction averages 800ms instead of the normal 30ms.

Metrics Collected:

Buffer Pool BP_ACCT:
  Monday 9 AM:  Hit ratio 72%  Physical reads: 45,000/min
  Tuesday 9 AM: Hit ratio 99%  Physical reads:    200/min

Diagnosis: The buffer pool is cold on Monday morning. Over the weekend, batch jobs flush the OLTP working set from the pool. At 9 AM Monday, every account balance lookup causes a physical read.

Root Cause: Weekend batch processes use the same buffer pool as OLTP and evict the OLTP working set.

Fix: Create a dedicated buffer pool for batch tablespaces. Assign OLTP tables to a separate pool that is not polluted by batch sequential scans.

Verification:

Monday after fix:
  Buffer Pool BP_ACCT: Hit ratio 98%  Physical reads: 1,200/min
  BALANCE_INQUIRY: Avg 35ms

Scenario 2: The Mysterious 3 PM Lock Storm

Symptom: Between 3:00-3:30 PM daily, lock timeouts spike from near-zero to 200+ per minute.

Metrics Collected:

-- Found a long-running transaction starting at 2:55 PM daily
SELECT APPLICATION_NAME, UOW_START_TIME,
       ROWS_READ, LOCK_ESCALS
FROM TABLE(MON_GET_CONNECTION(NULL, -2))
WHERE APPLICATION_NAME = 'DAILY_REPORT_GEN';

-- Result: DAILY_REPORT_GEN, started 14:55, rows_read 2M, lock_escals 12

Diagnosis: The daily report generation job starts at 2:55 PM, reads 2 million rows under RS isolation, triggers 12 lock escalations on the TRANSACTIONS table, and blocks all OLTP transactions that need to insert into that table.

Root Cause: Report uses RS isolation (unnecessary — it is a read-only report) on a table shared with OLTP writes.

Fix: Change report to use UR isolation (acceptable for approximate report) or CS with Currently Committed. Schedule the report during low-traffic hours if precision is required.

Verification:

After fix (UR isolation):
  Lock timeouts 3-3:30 PM: 0
  Report runtime: Decreased from 35 min to 22 min (no lock waiting)

Scenario 3: The Gradual Degradation

Symptom: The FUNDS_TRANSFER transaction has been getting 5% slower each month for the past 6 months. Current average: 120ms (was 60ms six months ago).

Metrics Collected:

Month 1:  Avg elapsed 60ms,  Getpages 85,   Rows read 45
Month 3:  Avg elapsed 80ms,  Getpages 340,  Rows read 180
Month 6:  Avg elapsed 120ms, Getpages 890,  Rows read 450

Diagnosis: Getpages and rows read are growing proportionally. The access path is stable (same index), but the data distribution has changed — the index is becoming less selective as data grows.

Root Cause: The TRANSACTIONS table has grown from 50 million to 300 million rows. The index on (ACCT_ID, TRANS_DATE) is being used for a range scan that now returns more rows per account. RUNSTATS has not been updated since month 2.

Fix: 1. Run RUNSTATS to update catalog statistics 2. Add TRANS_TYPE to the index to make it more selective: (ACCT_ID, TRANS_DATE, TRANS_TYPE) 3. Add a WHERE clause to the query to filter earlier in the access path

Verification:

After fix: Avg elapsed 25ms, Getpages 40, Rows read 20
Better than original — the new index is more selective.

Scenario 4: The CPU Spike

Symptom: System CPU utilization jumped from 65% to 95% after a weekend deployment. No new transactions were added.

Metrics Collected:

-- Top CPU consumers before deployment
-- #1: ACCT_LOOKUP — Avg CPU 2ms, 50K execs/hour = 100 CPU-sec/hour
-- #2: FUNDS_XFER  — Avg CPU 8ms, 10K execs/hour = 80 CPU-sec/hour

-- Top CPU consumers after deployment
-- #1: ACCT_LOOKUP — Avg CPU 45ms, 50K execs/hour = 2,250 CPU-sec/hour
-- #2: FUNDS_XFER  — Avg CPU 8ms, 10K execs/hour = 80 CPU-sec/hour

Diagnosis: ACCT_LOOKUP CPU consumption increased 22x after deployment. The access path changed.

Root Cause: The deployment included a REORG of the ACCOUNTS table, which invalidated the cached access path. When DB2 re-optimized, it chose a tablespace scan (because RUNSTATS was not run after the REORG, so the statistics reflected the old data layout).

Fix: Run RUNSTATS on the ACCOUNTS table and its indexes. Rebind the ACCT_LOOKUP package.

Verification:

After RUNSTATS + REBIND:
  ACCT_LOOKUP: Avg CPU 1.5ms (better than before REORG)
  System CPU: 58%

Scenario 5: The Deadlock Epidemic

Symptom: After going live with a new online loan application feature, deadlock count increased from ~2/hour to ~50/hour.

Metrics Collected:

-- z/OS IFCID 147 deadlock records show:
-- Transaction LOAN_APP holds X on CUSTOMERS row, waits for X on ACCOUNTS row
-- Transaction FUNDS_XFER holds X on ACCOUNTS row, waits for X on CUSTOMERS row

Diagnosis: Classic lock ordering violation. LOAN_APP locks CUSTOMERS then ACCOUNTS. FUNDS_XFER locks ACCOUNTS then CUSTOMERS.

Root Cause: The new LOAN_APP code was developed without awareness of the established lock ordering convention (CUSTOMERS → ACCOUNTS → LOANS → TRANSACTIONS).

Fix: Modify LOAN_APP to lock ACCOUNTS before CUSTOMERS (following the established order). Alternatively, since the LOAN_APP update to CUSTOMERS is a simple status change, perform it after the ACCOUNTS update in the same transaction.

Verification:

After fix: Deadlocks returned to ~2/hour baseline

27.12 Meridian Bank Performance Investigation

The Scenario

It is the first business day after month-end. The operations team reports that the FUNDS_TRANSFER transaction, which normally completes in 40ms, is averaging 1,500ms. Customer complaints are flooding the call center. The overnight batch window overran by 2 hours, and the interest calculation job is still running.

This is a full-scale performance emergency. Let us work through the playbook.

Step 1: Define the Symptom

  • What: FUNDS_TRANSFER transaction, normally 40ms, now 1,500ms
  • When: Started at approximately 6:00 AM on the first business day after month-end
  • Context: Overnight batch overran; interest calculation still running
  • Impact: Customer-facing, P1 severity

Step 2: Collect Metrics

-- LUW: Current FUNDS_TRANSFER performance
SELECT
    SUBSTR(STMT_TEXT, 1, 100) AS SQL,
    NUM_EXECUTIONS,
    TOTAL_ACT_TIME / NUM_EXECUTIONS / 1000 AS AVG_ELAPSED_MS,
    TOTAL_CPU_TIME / NUM_EXECUTIONS / 1000 AS AVG_CPU_MS,
    LOCK_WAIT_TIME / NUM_EXECUTIONS / 1000 AS AVG_LOCK_WAIT_MS,
    ROWS_READ / NUM_EXECUTIONS AS AVG_ROWS_READ
FROM TABLE(MON_GET_PKG_CACHE_STMT(NULL, NULL, NULL, -2)) AS T
WHERE STMT_TEXT LIKE '%ACCOUNTS%UPDATE%'
ORDER BY TOTAL_ACT_TIME DESC
FETCH FIRST 5 ROWS ONLY;

Results:

SQL: UPDATE ACCOUNTS SET BALANCE = ...
Avg Elapsed: 1,480 ms
Avg CPU:        12 ms
Avg Lock Wait: 1,420 ms
Avg Rows Read:    3

Step 3: Identify the Bottleneck Category

Elapsed: 1,480 ms
  CPU:      12 ms  ( 1%)  [GREEN]
  Lock:  1,420 ms  (96%)  [RED]
  I/O:      48 ms  ( 3%)  [GREEN]

Diagnosis: LOCK-BOUND

Step 4: Identify the Root Cause

Lock-bound. Who holds the lock?

-- Find what is blocking FUNDS_TRANSFER transactions
SELECT
    REQ.APPLICATION_NAME AS WAITER,
    REQ.LOCK_OBJECT_TYPE,
    REQ.TABNAME,
    REQ.LOCK_MODE_REQUESTED,
    HLD_CONN.APPLICATION_NAME AS HOLDER,
    HLD_CONN.APPLICATION_HANDLE AS HOLDER_HANDLE
FROM TABLE(MON_GET_APPL_LOCKWAIT(NULL, -2)) AS REQ,
     TABLE(MON_GET_CONNECTION(NULL, -2)) AS HLD_CONN
WHERE REQ.LOCK_OBJECT_TYPE = 'TABLE'
  AND REQ.APPLICATION_NAME = 'FUNDS_TRANSFER';

Result:

WAITER: FUNDS_TRANSFER
TABLE: ACCOUNTS
LOCK_MODE_REQUESTED: IX
HOLDER: INTEREST_CALC
HOLDER_HANDLE: 4587

The interest calculation batch job holds a table-level lock on ACCOUNTS. Every FUNDS_TRANSFER transaction is waiting for that lock.

-- Confirm: Is the interest calc job escalated?
SELECT
    APPLICATION_NAME,
    LOCK_ESCALS,
    ROWS_READ,
    ROWS_MODIFIED,
    UOW_START_TIME,
    TIMESTAMPDIFF(2, CURRENT TIMESTAMP - UOW_START_TIME) AS UOW_SECONDS
FROM TABLE(MON_GET_CONNECTION(4587, -2)) AS T;

Result:

APPLICATION: INTEREST_CALC
LOCK_ESCALS: 3
ROWS_READ: 12,400,000
ROWS_MODIFIED: 8,200,000
UOW_START_TIME: 2025-11-01 23:00:05
UOW_SECONDS: 28,795  (nearly 8 hours, still running!)

Root Cause identified: The interest calculation job has been running for 8 hours (it overran its batch window). It uses RS isolation and has not committed since it started (or commits very infrequently). It accumulated so many row locks that DB2 escalated to a table lock on ACCOUNTS three times. The table lock blocks all OLTP transactions that need to modify any account.

Step 5: Implement the Fix

Immediate fix (triage): 1. Contact the batch operations team. The interest calculation must be modified to COMMIT every 5,000 rows to prevent lock escalation. 2. If the job can be safely killed and restarted: terminate the connection holding the table lock, allowing OLTP to resume immediately. The interest calculation can be restarted with the corrected COMMIT frequency.

-- LUW: Force the application (use with caution!)
-- This rolls back the uncommitted work
db2 "FORCE APPLICATION (4587)"
  1. Restart interest calculation with the fix:
-- Modified interest calculation with frequent COMMIT
-- (simplified pseudocode)
DECLARE GLOBAL TEMPORARY TABLE processed_ids (
    acct_id INTEGER
) ON COMMIT PRESERVE ROWS;

-- Process in batches of 5000
REPEAT
    INSERT INTO SESSION.processed_ids
    SELECT acct_id FROM accounts
    WHERE last_interest_date < '2025-11-01'
      AND acct_id NOT IN (SELECT acct_id FROM SESSION.processed_ids)
    FETCH FIRST 5000 ROWS ONLY;

    UPDATE accounts a
    SET balance = balance * 1.004167,
        last_interest_date = '2025-11-01'
    WHERE acct_id IN (SELECT acct_id FROM SESSION.processed_ids)
      AND last_interest_date < '2025-11-01';

    COMMIT;
UNTIL (SELECT COUNT(*) FROM accounts
       WHERE last_interest_date < '2025-11-01') = 0
END REPEAT;

Long-term fix: 1. Redesign the interest calculation to use partition-level processing with COMMIT every 5,000 rows 2. Change the batch job's isolation level from RS to CS (it does not need read stability — each row is processed independently) 3. Add a monitoring alert that triggers when any transaction's lock count exceeds 50% of the escalation threshold 4. Establish a hard batch window cutoff: if the job has not completed by 5:30 AM, alert the on-call DBA before OLTP traffic begins

Step 6: Verify the Improvement

After killing the interest calc job and resuming OLTP:

-- Immediate verification
SELECT
    SUBSTR(STMT_TEXT, 1, 100),
    TOTAL_ACT_TIME / NUM_EXECUTIONS / 1000 AS AVG_ELAPSED_MS,
    LOCK_WAIT_TIME / NUM_EXECUTIONS / 1000 AS AVG_LOCK_WAIT_MS
FROM TABLE(MON_GET_PKG_CACHE_STMT(NULL, NULL, NULL, -2)) AS T
WHERE STMT_TEXT LIKE '%ACCOUNTS%UPDATE%'
ORDER BY TOTAL_ACT_TIME DESC
FETCH FIRST 5 ROWS ONLY;

Results after fix:

Avg Elapsed: 38 ms
Avg Lock Wait: 2 ms

Normal performance restored. The restarted interest calculation with frequent COMMITs completed in 3 hours without triggering any lock escalation.

Post-incident metrics:

Before (during incident):
  FUNDS_TRANSFER: 1,480 ms avg, 1,420 ms lock wait
  Lock escalations: 3 (table level on ACCOUNTS)
  Lock timeouts: 847 in 2 hours

After (incident resolved):
  FUNDS_TRANSFER: 38 ms avg, 2 ms lock wait
  Lock escalations: 0
  Lock timeouts: 0

After (interest calc with new design):
  Interest calc runtime: 3 hours (vs 8+ hours before)
  Lock escalations during batch: 0
  OLTP impact during batch: <5ms additional lock wait

Lessons Learned

  1. Batch jobs must COMMIT frequently. The root cause was a batch job that held millions of locks because it never committed.
  2. Lock escalation is the bridge between batch and OLTP problems. A single escalation event converts a local problem into a system-wide outage.
  3. Monitoring must detect rising lock counts before escalation occurs. The 8-hour window between batch start and OLTP impact was a missed opportunity.
  4. The fix was straightforward once the problem was diagnosed. The entire investigation took 15 minutes. The fix (killing the batch job) took 30 seconds. The value of the methodology is in the speed of diagnosis.

Spaced Review: Connecting to Earlier Chapters

From Chapter 15 (Buffer Pool Tuning)

Question: In Chapter 15, you learned about buffer pool sizing and hit ratios. In Scenario 1 of Section 27.11 (Monday Morning Slowdown), the hit ratio dropped to 72% on Monday mornings. How does the buffer pool concept from Chapter 15 connect to the diagnosis methodology in this chapter?

Answer: Chapter 15 taught that buffer pool hit ratio is a measure of how effectively DB2 avoids physical I/O. In the diagnostic methodology, a low hit ratio is the signature metric of an I/O-bound problem (Step 3). The fix — dedicating separate buffer pools for batch and OLTP — comes directly from Chapter 15's guidance on workload-specific buffer pool configuration. The methodology here tells you when to look at buffer pools (when I/O wait dominates); Chapter 15 tells you what to do once you determine the pool is the problem.

From Chapter 22 (EXPLAIN and Access Paths)

Question: In Scenario 4 (CPU Spike), the root cause was a changed access path after REORG without RUNSTATS. How does the EXPLAIN analysis from Chapter 22 fit into the diagnosis workflow?

Answer: EXPLAIN is the primary tool for Step 4 (Identify Root Cause) when the bottleneck is CPU-bound. After Step 3 identifies CPU as the bottleneck and Step 5 from the "Finding Worst SQL" analysis identifies the specific statement, you run EXPLAIN to see the access path. In Scenario 4, EXPLAIN would show a tablespace scan where an index scan was expected. The fix — RUNSTATS followed by REBIND — restores the optimizer's ability to choose the correct access path, which is the central theme of Chapter 22.

From Chapter 23 (RUNSTATS, REORG, and Maintenance)

Question: Scenario 3 (Gradual Degradation) showed a 6-month slowdown caused by data growth without updated statistics. How does the maintenance strategy from Chapter 23 prevent this class of problem entirely?

Answer: Chapter 23 established that RUNSTATS should be run on a regular schedule — at minimum after significant data changes. If Meridian Bank had followed the maintenance protocol from Chapter 23 (monthly RUNSTATS on tables with > 10% data change), the optimizer would have had current statistics throughout the 6-month period. The access path would have been re-evaluated with each REBIND, and the gradual degradation would have been caught and addressed incrementally rather than accumulating into a noticeable performance problem. The diagnosis methodology helps you find problems; the maintenance methodology from Chapter 23 prevents them.


Building a Performance Monitoring Infrastructure

The diagnostic methodology described in this chapter assumes that metrics are available when you need them. In practice, this requires a monitoring infrastructure that collects, stores, and makes accessible the data you need for diagnosis.

What to Collect Continuously

Baseline Metrics (collected every 5 minutes): - Database-level: commits, CPU time, wait time, lock wait time, deadlocks, timeouts, escalations - Buffer pool: hit ratios per pool, physical reads, prefetch activity - Top 20 SQL by elapsed time (snapshot of package cache)

Alert Metrics (checked every 1 minute): - Lock escalation count (delta > 0 triggers investigation) - Deadlock count (delta > 5/hour triggers alert) - Lock timeout count (delta > 10/hour triggers alert) - Any transaction running longer than 30 minutes with locks held - Buffer pool hit ratio below 90% on OLTP pools

Trend Metrics (collected daily, retained for 12 months): - Per-table row counts and modification counts - RUNSTATS timestamps per table - Batch job elapsed times - Peak hour transaction throughput

The Monitoring Table Pattern

CREATE TABLE perf_baseline (
    collection_ts   TIMESTAMP NOT NULL,
    metric_type     VARCHAR(30) NOT NULL,
    metric_name     VARCHAR(100) NOT NULL,
    metric_value    DECIMAL(20,4),
    PRIMARY KEY (collection_ts, metric_type, metric_name)
);

-- Scheduled collection (every 5 minutes)
INSERT INTO perf_baseline
SELECT CURRENT TIMESTAMP, 'DATABASE',
    'TOTAL_CPU_TIME', TOTAL_CPU_TIME
FROM TABLE(MON_GET_DATABASE(-2)) AS T;

When a problem occurs, you query the baseline table to determine when the metric changed and what the normal range is. This transforms "the system is slow" into "lock wait time increased from 500ms/sec to 5,000ms/sec starting at 2:47 PM."

Automating the Diagnosis

For organizations with mature DB2 operations, the diagnostic playbook can be partially automated:

  1. Automated bottleneck classification. A script runs every 5 minutes, computes the CPU/IO/Lock/Memory breakdown, and flags any metric that exceeds its historical norm by more than 2 standard deviations.

  2. Automated "worst SQL" reporting. A daily report lists the top 10 SQL statements by each metric (elapsed, CPU, lock wait, I/O) with week-over-week comparison. Statements that have degraded more than 20% are flagged for review.

  3. Automated access path verification. After any RUNSTATS or REBIND, EXPLAIN is run on critical SQL statements and the access path is compared against a stored baseline. Changes trigger a review notification.

  4. Automated batch runtime trending. Batch job elapsed times are tracked daily. If any job's runtime exceeds 120% of its 30-day moving average, an alert is generated before the batch window is at risk of overrunning.

The goal is not to eliminate the DBA from the diagnosis process — human judgment is still required for root cause analysis and fix selection — but to ensure that problems are detected early and the relevant data is already collected when the DBA begins the investigation.

Summary

Performance diagnosis is a systematic discipline, not a guessing game. The six-step methodology — define symptom, collect metrics, identify bottleneck category, find root cause, implement fix, verify improvement — applies to every DB2 performance problem regardless of platform or severity.

The four bottleneck categories — CPU, I/O, lock, and memory — provide a framework for directing your investigation. On z/OS, the accounting trace is your primary diagnostic tool, providing a complete breakdown of where time was spent. On LUW, the MON_GET table functions provide equivalent visibility.

Finding the worst SQL is the highest-leverage diagnostic activity. The Pareto principle ensures that a small number of statements account for the majority of resource consumption. Fixing the top 3-5 statements often resolves the performance problem entirely.

The before/after proof is non-negotiable. Every change must be measured quantitatively. Without measurement, you cannot know whether a fix worked, how much it helped, or whether it introduced regressions.

The five production scenarios and the Meridian Bank investigation in this chapter demonstrate that real-world performance problems are rarely exotic. They are caused by missing indexes, stale statistics, inappropriate isolation levels, insufficient COMMIT frequency, and undersized buffer pools. The methodology makes these ordinary problems findable in minutes rather than hours.

In the next chapter, we will explore query optimization techniques — the specific SQL rewrites, index designs, and access path improvements that address the root causes identified by the diagnostic methodology presented here.