Chapter 27: Key Takeaways — Performance Diagnosis Methodology
The Big Ideas
1. Diagnosis Is a Method, Not a Guess
Follow the six-step playbook every time: Define symptom, Collect metrics, Identify bottleneck category, Find root cause, Implement fix, Verify improvement. Skipping steps leads to wasted effort and incorrect fixes.
2. Every Problem Falls Into One of Four Categories
- CPU-bound: High getpages, Stage 2 predicates, inefficient access paths
- I/O-bound: Low buffer pool hit ratio, no prefetch, fragmented data
- Lock-bound: High lock wait time, escalation, long-running holders
- Memory-bound: Sort overflows, cache misses, pool thrashing
Classify first, then drill down. An I/O problem is not solved by adding indexes (that is a CPU problem fix). A lock problem is not solved by increasing buffer pools.
3. Find the Worst SQL First
The Pareto principle is ruthless: a small number of SQL statements account for most resource consumption. Finding and fixing the top 3-5 statements yields the greatest return on investment.
4. The Before/After Proof Is Non-Negotiable
Every change must be verified with quantitative metrics. "It feels faster" is not evidence. Collect the same metrics before and after, under comparable conditions, and compare.
5. The Elapsed Time Equation Tells You Where to Look
Elapsed = CPU + I/O Wait + Lock Wait + Other Wait
Whichever component dominates is your starting point. Do not investigate CPU when lock wait is 90% of elapsed time.
Essential Diagnostic Queries
z/OS: Read the Accounting Trace
-START TRACE(ACCTG) CLASS(1,2,3) DEST(SMF)
Key fields:
QWACESC = Class 2 elapsed time
QWACESCC = Class 2 CPU time
QWACAWTI = I/O suspension time
QWACAWTL = Lock suspension time
QWACAGET = Getpages
LUW: Top SQL by Elapsed Time
SELECT SUBSTR(STMT_TEXT, 1, 200), NUM_EXECUTIONS,
TOTAL_ACT_TIME / NUM_EXECUTIONS / 1000 AS AVG_MS,
TOTAL_CPU_TIME / NUM_EXECUTIONS / 1000 AS AVG_CPU_MS,
ROWS_READ, ROWS_RETURNED
FROM TABLE(MON_GET_PKG_CACHE_STMT(NULL, NULL, NULL, -2)) AS T
WHERE NUM_EXECUTIONS > 0
ORDER BY TOTAL_ACT_TIME DESC
FETCH FIRST 10 ROWS ONLY;
LUW: Buffer Pool Hit Ratios
SELECT BP_NAME,
DECIMAL((1.0 - FLOAT(POOL_DATA_P_READS + POOL_INDEX_P_READS)
/ FLOAT(POOL_DATA_L_READS + POOL_INDEX_L_READS)) * 100, 5, 2)
AS HIT_RATIO_PCT
FROM TABLE(MON_GET_BUFFERPOOL(NULL, -2)) AS T
WHERE POOL_DATA_L_READS + POOL_INDEX_L_READS > 0;
LUW: Lock Contention
SELECT TABSCHEMA, TABNAME, LOCK_WAITS,
LOCK_WAIT_TIME / 1000 AS LOCK_WAIT_SEC, LOCK_ESCALS
FROM TABLE(MON_GET_TABLE('', '', -2)) AS T
WHERE LOCK_WAITS > 0
ORDER BY LOCK_WAIT_TIME DESC;
The Read Efficiency Ratio
Ratio = ROWS_READ / ROWS_RETURNED
| Ratio | Interpretation | Action |
|---|---|---|
| 1-5 | Efficient | No action needed |
| 5-100 | Moderate inefficiency | Review index usage |
| 100-10,000 | Severe inefficiency | Missing or wrong index |
| > 10,000 | Table scan on selective query | Create targeted index |
Common Root Causes by Category
CPU-Bound
- Missing index (table scan)
- Stage 2 predicates (functions on columns)
- Stale statistics (wrong access path)
- Inefficient join method (nested loop without index)
I/O-Bound
- Buffer pool too small
- Prefetch not triggering (data fragmented, needs REORG)
- Non-clustering index scan (random I/O)
- Working set exceeds buffer pool capacity
Lock-Bound
- Long-running transaction holding locks
- Lock escalation (batch job without frequent COMMIT)
- Inappropriate isolation level (RR when CS would suffice)
- Deadlock from inconsistent lock ordering
Memory-Bound
- Sort heap too small (sort overflows)
- Buffer pool too small (hit ratio declining)
- EDM pool full (z/OS — too many packages/DBDs)
- Catalog/package cache too small (LUW)
The Verification Checklist
Before declaring a fix successful, confirm:
- [ ] The specific metric that was the bottleneck has improved
- [ ] Overall elapsed time has improved
- [ ] No other metrics have regressed
- [ ] The improvement is consistent across multiple measurement periods
- [ ] The improvement holds under peak load (not just off-peak)
- [ ] The change is documented in the performance change log
Platform Differences
| Diagnostic Tool | z/OS | LUW |
|---|---|---|
| Primary diagnostic | Accounting trace (IFCID 3) | MON_GET_PKG_CACHE_STMT |
| Time breakdown | Class 1/2/3 accounting | MON_GET_CONNECTION totals |
| Per-statement metrics | IFCID 318 | MON_GET_PKG_CACHE_STMT |
| Buffer pool stats | DISPLAY BUFFERPOOL | MON_GET_BUFFERPOOL |
| Lock analysis | DISPLAY DATABASE LOCKS | MON_GET_APPL_LOCKWAIT |
| Real-time monitoring | OMEGAMON | db2top, MON_GET_ACTIVITY |
| Access path | EXPLAIN + plan_table | EXPLAIN + db2exfmt |
Connection to Meridian Bank Project
The month-end performance investigation (Section 27.12) demonstrated: - Symptom definition: FUNDS_TRANSFER at 1,500ms vs. normal 40ms - Bottleneck identification: Lock-bound (96% lock wait) - Root cause: Batch interest calculation with no COMMIT, causing table-level lock escalation - Fix: Kill the batch job, restart with COMMIT every 5,000 rows - Verification: 38ms after fix, 0 lock escalations - Prevention: Monitoring alerts for rising lock counts, hard batch window cutoff