Chapter 27: Key Takeaways — Performance Diagnosis Methodology


The Big Ideas

1. Diagnosis Is a Method, Not a Guess

Follow the six-step playbook every time: Define symptom, Collect metrics, Identify bottleneck category, Find root cause, Implement fix, Verify improvement. Skipping steps leads to wasted effort and incorrect fixes.

2. Every Problem Falls Into One of Four Categories

  • CPU-bound: High getpages, Stage 2 predicates, inefficient access paths
  • I/O-bound: Low buffer pool hit ratio, no prefetch, fragmented data
  • Lock-bound: High lock wait time, escalation, long-running holders
  • Memory-bound: Sort overflows, cache misses, pool thrashing

Classify first, then drill down. An I/O problem is not solved by adding indexes (that is a CPU problem fix). A lock problem is not solved by increasing buffer pools.

3. Find the Worst SQL First

The Pareto principle is ruthless: a small number of SQL statements account for most resource consumption. Finding and fixing the top 3-5 statements yields the greatest return on investment.

4. The Before/After Proof Is Non-Negotiable

Every change must be verified with quantitative metrics. "It feels faster" is not evidence. Collect the same metrics before and after, under comparable conditions, and compare.

5. The Elapsed Time Equation Tells You Where to Look

Elapsed = CPU + I/O Wait + Lock Wait + Other Wait

Whichever component dominates is your starting point. Do not investigate CPU when lock wait is 90% of elapsed time.


Essential Diagnostic Queries

z/OS: Read the Accounting Trace

-START TRACE(ACCTG) CLASS(1,2,3) DEST(SMF)

Key fields:
  QWACESC  = Class 2 elapsed time
  QWACESCC = Class 2 CPU time
  QWACAWTI = I/O suspension time
  QWACAWTL = Lock suspension time
  QWACAGET = Getpages

LUW: Top SQL by Elapsed Time

SELECT SUBSTR(STMT_TEXT, 1, 200), NUM_EXECUTIONS,
       TOTAL_ACT_TIME / NUM_EXECUTIONS / 1000 AS AVG_MS,
       TOTAL_CPU_TIME / NUM_EXECUTIONS / 1000 AS AVG_CPU_MS,
       ROWS_READ, ROWS_RETURNED
FROM TABLE(MON_GET_PKG_CACHE_STMT(NULL, NULL, NULL, -2)) AS T
WHERE NUM_EXECUTIONS > 0
ORDER BY TOTAL_ACT_TIME DESC
FETCH FIRST 10 ROWS ONLY;

LUW: Buffer Pool Hit Ratios

SELECT BP_NAME,
       DECIMAL((1.0 - FLOAT(POOL_DATA_P_READS + POOL_INDEX_P_READS)
            / FLOAT(POOL_DATA_L_READS + POOL_INDEX_L_READS)) * 100, 5, 2)
            AS HIT_RATIO_PCT
FROM TABLE(MON_GET_BUFFERPOOL(NULL, -2)) AS T
WHERE POOL_DATA_L_READS + POOL_INDEX_L_READS > 0;

LUW: Lock Contention

SELECT TABSCHEMA, TABNAME, LOCK_WAITS,
       LOCK_WAIT_TIME / 1000 AS LOCK_WAIT_SEC, LOCK_ESCALS
FROM TABLE(MON_GET_TABLE('', '', -2)) AS T
WHERE LOCK_WAITS > 0
ORDER BY LOCK_WAIT_TIME DESC;

The Read Efficiency Ratio

Ratio = ROWS_READ / ROWS_RETURNED
Ratio Interpretation Action
1-5 Efficient No action needed
5-100 Moderate inefficiency Review index usage
100-10,000 Severe inefficiency Missing or wrong index
> 10,000 Table scan on selective query Create targeted index

Common Root Causes by Category

CPU-Bound

  • Missing index (table scan)
  • Stage 2 predicates (functions on columns)
  • Stale statistics (wrong access path)
  • Inefficient join method (nested loop without index)

I/O-Bound

  • Buffer pool too small
  • Prefetch not triggering (data fragmented, needs REORG)
  • Non-clustering index scan (random I/O)
  • Working set exceeds buffer pool capacity

Lock-Bound

  • Long-running transaction holding locks
  • Lock escalation (batch job without frequent COMMIT)
  • Inappropriate isolation level (RR when CS would suffice)
  • Deadlock from inconsistent lock ordering

Memory-Bound

  • Sort heap too small (sort overflows)
  • Buffer pool too small (hit ratio declining)
  • EDM pool full (z/OS — too many packages/DBDs)
  • Catalog/package cache too small (LUW)

The Verification Checklist

Before declaring a fix successful, confirm:

  • [ ] The specific metric that was the bottleneck has improved
  • [ ] Overall elapsed time has improved
  • [ ] No other metrics have regressed
  • [ ] The improvement is consistent across multiple measurement periods
  • [ ] The improvement holds under peak load (not just off-peak)
  • [ ] The change is documented in the performance change log

Platform Differences

Diagnostic Tool z/OS LUW
Primary diagnostic Accounting trace (IFCID 3) MON_GET_PKG_CACHE_STMT
Time breakdown Class 1/2/3 accounting MON_GET_CONNECTION totals
Per-statement metrics IFCID 318 MON_GET_PKG_CACHE_STMT
Buffer pool stats DISPLAY BUFFERPOOL MON_GET_BUFFERPOOL
Lock analysis DISPLAY DATABASE LOCKS MON_GET_APPL_LOCKWAIT
Real-time monitoring OMEGAMON db2top, MON_GET_ACTIVITY
Access path EXPLAIN + plan_table EXPLAIN + db2exfmt

Connection to Meridian Bank Project

The month-end performance investigation (Section 27.12) demonstrated: - Symptom definition: FUNDS_TRANSFER at 1,500ms vs. normal 40ms - Bottleneck identification: Lock-bound (96% lock wait) - Root cause: Batch interest calculation with no COMMIT, causing table-level lock escalation - Fix: Kill the batch job, restart with COMMIT every 5,000 rows - Verification: 38ms after fix, 0 lock escalations - Prevention: Monitoring alerts for rising lock counts, hard batch window cutoff