Chapter 27: Key Takeaways — Performance Diagnosis Methodology

DataField.Dev

Chapter 27: Key Takeaways — Performance Diagnosis Methodology

The Big Ideas

1. Diagnosis Is a Method, Not a Guess

Follow the six-step playbook every time: Define symptom, Collect metrics, Identify bottleneck category, Find root cause, Implement fix, Verify improvement. Skipping steps leads to wasted effort and incorrect fixes.

2. Every Problem Falls Into One of Four Categories

CPU-bound: High getpages, Stage 2 predicates, inefficient access paths
I/O-bound: Low buffer pool hit ratio, no prefetch, fragmented data
Lock-bound: High lock wait time, escalation, long-running holders
Memory-bound: Sort overflows, cache misses, pool thrashing

Classify first, then drill down. An I/O problem is not solved by adding indexes (that is a CPU problem fix). A lock problem is not solved by increasing buffer pools.

3. Find the Worst SQL First

The Pareto principle is ruthless: a small number of SQL statements account for most resource consumption. Finding and fixing the top 3-5 statements yields the greatest return on investment.

4. The Before/After Proof Is Non-Negotiable

Every change must be verified with quantitative metrics. "It feels faster" is not evidence. Collect the same metrics before and after, under comparable conditions, and compare.

5. The Elapsed Time Equation Tells You Where to Look

Elapsed = CPU + I/O Wait + Lock Wait + Other Wait

Whichever component dominates is your starting point. Do not investigate CPU when lock wait is 90% of elapsed time.

Essential Diagnostic Queries

z/OS: Read the Accounting Trace

-START TRACE(ACCTG) CLASS(1,2,3) DEST(SMF)

Key fields:
  QWACESC  = Class 2 elapsed time
  QWACESCC = Class 2 CPU time
  QWACAWTI = I/O suspension time
  QWACAWTL = Lock suspension time
  QWACAGET = Getpages

LUW: Top SQL by Elapsed Time

SELECT SUBSTR(STMT_TEXT, 1, 200), NUM_EXECUTIONS,
       TOTAL_ACT_TIME / NUM_EXECUTIONS / 1000 AS AVG_MS,
       TOTAL_CPU_TIME / NUM_EXECUTIONS / 1000 AS AVG_CPU_MS,
       ROWS_READ, ROWS_RETURNED
FROM TABLE(MON_GET_PKG_CACHE_STMT(NULL, NULL, NULL, -2)) AS T
WHERE NUM_EXECUTIONS > 0
ORDER BY TOTAL_ACT_TIME DESC
FETCH FIRST 10 ROWS ONLY;

LUW: Buffer Pool Hit Ratios

SELECT BP_NAME,
       DECIMAL((1.0 - FLOAT(POOL_DATA_P_READS + POOL_INDEX_P_READS)
            / FLOAT(POOL_DATA_L_READS + POOL_INDEX_L_READS)) * 100, 5, 2)
            AS HIT_RATIO_PCT
FROM TABLE(MON_GET_BUFFERPOOL(NULL, -2)) AS T
WHERE POOL_DATA_L_READS + POOL_INDEX_L_READS > 0;

LUW: Lock Contention

SELECT TABSCHEMA, TABNAME, LOCK_WAITS,
       LOCK_WAIT_TIME / 1000 AS LOCK_WAIT_SEC, LOCK_ESCALS
FROM TABLE(MON_GET_TABLE('', '', -2)) AS T
WHERE LOCK_WAITS > 0
ORDER BY LOCK_WAIT_TIME DESC;

The Read Efficiency Ratio

Ratio = ROWS_READ / ROWS_RETURNED

Ratio	Interpretation	Action
1-5	Efficient	No action needed
5-100	Moderate inefficiency	Review index usage
100-10,000	Severe inefficiency	Missing or wrong index
> 10,000	Table scan on selective query	Create targeted index

Common Root Causes by Category

CPU-Bound

Missing index (table scan)
Stage 2 predicates (functions on columns)
Stale statistics (wrong access path)
Inefficient join method (nested loop without index)

I/O-Bound

Buffer pool too small
Prefetch not triggering (data fragmented, needs REORG)
Non-clustering index scan (random I/O)
Working set exceeds buffer pool capacity

Lock-Bound

Long-running transaction holding locks
Lock escalation (batch job without frequent COMMIT)
Inappropriate isolation level (RR when CS would suffice)
Deadlock from inconsistent lock ordering

Memory-Bound

Sort heap too small (sort overflows)
Buffer pool too small (hit ratio declining)
EDM pool full (z/OS — too many packages/DBDs)
Catalog/package cache too small (LUW)

The Verification Checklist

Before declaring a fix successful, confirm:

[ ] The specific metric that was the bottleneck has improved
[ ] Overall elapsed time has improved
[ ] No other metrics have regressed
[ ] The improvement is consistent across multiple measurement periods
[ ] The improvement holds under peak load (not just off-peak)
[ ] The change is documented in the performance change log

Platform Differences

Diagnostic Tool	z/OS	LUW
Primary diagnostic	Accounting trace (IFCID 3)	MON_GET_PKG_CACHE_STMT
Time breakdown	Class 1/2/3 accounting	MON_GET_CONNECTION totals
Per-statement metrics	IFCID 318	MON_GET_PKG_CACHE_STMT
Buffer pool stats	DISPLAY BUFFERPOOL	MON_GET_BUFFERPOOL
Lock analysis	DISPLAY DATABASE LOCKS	MON_GET_APPL_LOCKWAIT
Real-time monitoring	OMEGAMON	db2top, MON_GET_ACTIVITY
Access path	EXPLAIN + plan_table	EXPLAIN + db2exfmt

Connection to Meridian Bank Project

The month-end performance investigation (Section 27.12) demonstrated: - Symptom definition: FUNDS_TRANSFER at 1,500ms vs. normal 40ms - Bottleneck identification: Lock-bound (96% lock wait) - Root cause: Batch interest calculation with no COMMIT, causing table-level lock escalation - Fix: Kill the batch job, restart with COMMIT every 5,000 rows - Verification: 38ms after fix, 0 lock escalations - Prevention: Monitoring alerts for rising lock counts, hard batch window cutoff