Chapter 27: Exercises — Performance Diagnosis Methodology

DataField.Dev

Chapter 27: Exercises — Performance Diagnosis Methodology

These exercises are structured as "DB2 Clinic" problems. Each presents a set of symptoms and metrics. Your job is to diagnose the bottleneck, identify the root cause, and propose a fix.

Diagnosis Fundamentals (Exercises 1-8)

Exercise 1: Classify the Bottleneck

For each of the following metric profiles, classify the bottleneck as CPU-bound, I/O-bound, lock-bound, or memory-bound:

a)

Elapsed: 2,500 ms
CPU: 2,200 ms
I/O Wait: 150 ms
Lock Wait: 50 ms
Getpages: 450,000
Rows Read: 500,000
Rows Returned: 50

b)

Elapsed: 3,000 ms
CPU: 80 ms
I/O Wait: 2,800 ms
Lock Wait: 20 ms
Buffer Pool Hit Ratio: 62%
Physical Reads: 12,000

c)

Elapsed: 5,000 ms
CPU: 100 ms
I/O Wait: 200 ms
Lock Wait: 4,500 ms
Lock Escalations: 3
Deadlocks: 0

d)

Elapsed: 1,800 ms
CPU: 400 ms
I/O Wait: 300 ms
Lock Wait: 50 ms
Sort Overflows: 45 per minute
Sort Heap Used: 100%

Exercise 2: The Elapsed Time Equation

A transaction has the following accounting trace output:

Class 2 Elapsed: 1.250 seconds
Class 2 CPU:     0.080 seconds
I/O Suspension:  0.150 seconds
Lock Suspension: 0.020 seconds

a) What is the "Other Wait" time? b) What category of bottleneck does this suggest? c) What additional information would you collect to identify the root cause of the "Other Wait" time?

Exercise 3: Reading Accounting Traces (z/OS)

Interpret this z/OS accounting trace summary:

Plan Name: STMT_GEN   Connection Type: BATCH
Occurrences: 1

Class 1 Elapsed:    14,400.0 sec  (4 hours)
Class 2 Elapsed:    13,800.0 sec
Class 2 CPU:           890.0 sec
I/O Suspensions:    11,200.0 sec
Lock Suspensions:    1,500.0 sec
Other Suspensions:     210.0 sec

Getpages:        890,000,000
Sync I/O:         45,000,000
Prefetch Requests:   120,000
Lock Requests:    25,000,000
Lock Escals:              5
COMMITs:                  1

a) Is this job CPU-bound, I/O-bound, or lock-bound? b) What is suspicious about the COMMIT count? c) What is the prefetch-to-sync-I/O ratio, and what does it tell you? d) List three recommendations to improve this job's performance.

Exercise 4: MON_GET_PKG_CACHE_STMT Analysis (LUW)

You run the "top SQL by elapsed time" query and get these results:

SQL_TEXT                          EXECS    TOTAL_MS  AVG_MS  CPU_MS  ROWS_READ  ROWS_RET
SELECT * FROM TRANS WHERE...      80000   4800000    60      55      800000000  80000
UPDATE ACCOUNTS SET BAL...        20000    600000    30       5         20000   20000
INSERT INTO AUDIT_LOG...         100000    300000     3       2        100000  100000
SELECT COUNT(*) FROM ACCOUNTS...    500    200000   400      380    50000000     500

a) Which statement should you optimize first and why? b) What is the read efficiency ratio for the first statement? What does it indicate? c) The fourth statement has high CPU relative to its average elapsed time. What does this suggest about its bottleneck category? d) Design an optimization strategy for each of the top two statements.

Exercise 5: Buffer Pool Hit Ratio Analysis

You collect these buffer pool metrics:

BP_NAME      LOGICAL_READS  PHYSICAL_READS  HIT_RATIO  PREFETCH_READS
BP_OLTP        50,000,000        500,000     99.00%        100,000
BP_BATCH        2,000,000      1,200,000     40.00%      1,100,000
BP_INDEX       30,000,000         90,000     99.70%         10,000
BP_TEMP         5,000,000      4,500,000     10.00%      4,000,000

a) Which buffer pool has a problem? b) For BP_BATCH, is the low hit ratio necessarily a problem? Explain. c) What is happening with BP_TEMP? What causes a 10% hit ratio on a temp buffer pool? d) What action would you take for each pool?

Exercise 6: The Read Efficiency Ratio

Explain what each of these read efficiency ratios tells you about the underlying query:

a) ROWS_READ = 100, ROWS_RETURNED = 100, Ratio = 1.0 b) ROWS_READ = 50,000, ROWS_RETURNED = 50, Ratio = 1,000 c) ROWS_READ = 1,000,000, ROWS_RETURNED = 1,000,000, Ratio = 1.0 d) ROWS_READ = 500,000, ROWS_RETURNED = 0, Ratio = undefined (DELETE with no qualifying rows)

Exercise 7: Sort Overflow Diagnosis

Your monitoring shows:

TOTAL_SORTS: 50,000
SORT_OVERFLOWS: 2,500
OVERFLOW_PCT: 5.0%
SORTHEAP: 256 (4KB pages = 1MB)

a) Is the overflow rate acceptable? b) What is the current sort heap size in bytes? c) If you double the sort heap, will the overflow rate necessarily decrease? Why or why not? d) What alternative approaches (besides increasing sort heap) can reduce sort overflows?

Exercise 8: Symptom Definition Practice

Rewrite each of the following vague complaints as a properly defined symptom (specifying what, when, and how much):

a) "The database is slow." b) "Our reports take too long." c) "Users are complaining about the system." d) "The batch window is overrunning."

DB2 Clinic: Diagnostic Scenarios (Exercises 9-20)

Exercise 9: The Overnight Regression

Symptom: The TRANSFER_FUNDS transaction averaged 35ms yesterday. Today it averages 280ms. No application changes were deployed.

Metrics:

Yesterday: Avg elapsed 35ms, CPU 8ms, I/O 12ms, Lock wait 5ms, Getpages 120
Today:     Avg elapsed 280ms, CPU 8ms, I/O 250ms, Lock wait 5ms, Getpages 120

Diagnose the problem. What happened overnight? What additional data would you collect?

Exercise 10: The 2 PM Spike

Symptom: Every day at 2 PM, OLTP response times spike for approximately 30 minutes.

Metrics during the spike:

Buffer Pool BP_OLTP: Hit ratio drops from 99% to 85%
Physical reads/sec: increases 5x
CPU utilization: unchanged
Lock waits: unchanged

What is causing the spike? What would you look for in the job schedule?

Exercise 11: The Growing Query

Symptom: A dashboard query that used to run in 200ms now takes 3 seconds. It has been gradually degrading over 4 months.

Metrics:

Month 1: Elapsed 200ms, Rows read 5,000
Month 2: Elapsed 600ms, Rows read 15,000
Month 3: Elapsed 1,500ms, Rows read 45,000
Month 4: Elapsed 3,000ms, Rows read 120,000

The rows returned is constant at 100. Diagnose the problem.

Exercise 12: The Deadlock Storm

Symptom: After deploying a new "batch import" feature, deadlocks increased from 2/hour to 150/hour.

Deadlock graph from event monitor:

Transaction A (IMPORT_BATCH): Holds X on PRODUCTS row 500, waits for X on INVENTORY row 500
Transaction B (ORDER_PROCESS): Holds X on INVENTORY row 500, waits for X on PRODUCTS row 500

Diagnose and propose a fix.

Exercise 13: The Memory Crisis

Symptom: Overall database throughput has dropped 40% over the past week with no schema or application changes.

Metrics:

This week:
  Buffer pool hit ratio: 92% (was 99%)
  Sort overflows/min: 25 (was 2)
  Package cache hit ratio: 85% (was 99%)
  Active connections: 800 (was 200)

Last week:
  Buffer pool hit ratio: 99%
  Sort overflows/min: 2
  Package cache hit ratio: 99%
  Active connections: 200

Diagnose the problem. What changed?

Exercise 14: The CPU Ceiling

Symptom: The system is at 98% CPU utilization. Response times are increasing across all transactions.

Top 3 SQL by CPU:

#1: SELECT * FROM AUDIT_LOG WHERE event_date > '2025-01-01'
    Avg CPU: 500ms, Execs/hour: 200, Total CPU/hour: 100 sec
    Access path: Table scan on AUDIT_LOG (50M rows)

#2: SELECT acct_id, balance FROM ACCOUNTS WHERE branch_id = :brnch
    Avg CPU: 2ms, Execs/hour: 100,000, Total CPU/hour: 200 sec
    Access path: Index scan on IX_ACCT_BRANCH

#3: SELECT COUNT(*) FROM TRANSACTIONS WHERE trans_date = CURRENT DATE
    Avg CPU: 15ms, Execs/hour: 10,000, Total CPU/hour: 150 sec
    Access path: Index scan on IX_TRANS_DATE

a) Which statement should you optimize first and why? b) What fix would you apply to statement #1? c) Statement #2 already uses an index. Why is it the #2 CPU consumer?

Exercise 15: The Lock Escalation Cascade

Symptom: At 10:30 AM every Monday, OLTP throughput drops from 5,000 TPS to 200 TPS for approximately 15 minutes.

Metrics at 10:30 AM Monday:

Lock escalations: 8 (on TRANSACTIONS table)
Table lock mode: S (held by WEEKLY_REPORT application)
OLTP lock waits: 4,500 (all on TRANSACTIONS table, requesting IX)
WEEKLY_REPORT isolation: RS
WEEKLY_REPORT rows read: 12,000,000

Diagnose and propose a fix.

Exercise 16: Post-REORG Regression

Symptom: After running REORG on the ACCOUNTS table over the weekend, the ACCOUNT_LOOKUP query is 20x slower on Monday.

Before REORG: Avg elapsed 5ms, access path: index scan on IX_ACCT_ID After REORG: Avg elapsed 100ms, access path: table scan

What happened? What step was missed? What is the fix?

Exercise 17: The Intermittent Timeout

Symptom: Approximately 10 times per day, random OLTP transactions receive SQLCODE -911 with reason code 68 (timeout). There is no predictable pattern.

Metrics:

LOCKTIMEOUT setting: 30 seconds
Lock timeouts/day: ~10
Deadlocks/day: 0
Lock escalations/day: 0

What is the most likely cause? How would you investigate?

Exercise 18: The Batch Window Crisis

Symptom: The nightly batch window (11 PM - 5 AM) used to complete at 4 AM. It now finishes at 6:30 AM, overlapping with morning OLTP traffic.

Batch job breakdown:

Job 1: REORG ACCOUNTS          45 min (unchanged)
Job 2: RUNSTATS ACCOUNTS       15 min (unchanged)
Job 3: Interest Calculation   240 min (was 180 min)
Job 4: Statement Generation   120 min (was 90 min)
Job 5: Archive old data        30 min (unchanged)
Total:                        450 min (was 360 min)

a) Which jobs grew the most? b) What is the likely cause of the growth? c) Propose a redesigned batch schedule that fits within the 6-hour window. d) Which jobs can run concurrently and which cannot?

Exercise 19: The False Positive

Symptom: A monitoring alert fires because the buffer pool hit ratio dropped to 88%.

Context:

Normal hit ratio: 99%
Current hit ratio: 88%
Time: 2:00 AM (batch window)
Active batch: Full table scan for month-end reporting
OLTP traffic: Near zero at this hour

Is this a real problem? Explain your reasoning. What would you do?

Exercise 20: The Before/After Challenge

You added an index to improve a query. Here are the before and after metrics:

Before: Avg elapsed 500ms, CPU 200ms, I/O wait 250ms, Lock wait 50ms After: Avg elapsed 300ms, CPU 350ms, I/O wait 20ms, Lock wait 30ms

a) Did overall performance improve? b) Why did CPU time increase when you added an index? c) Is this a net positive or net negative change? d) What additional metric would help you make a definitive judgment?

Advanced Scenarios (Exercises 21-28)

Exercise 21: Multi-Layer Diagnosis

Symptom: End-to-end response time for the TRANSFER_FUNDS API call is 2,000ms. The API team says "the database is slow."

Metrics:

API total response time: 2,000 ms
Network round-trip to DB2: 5 ms x 4 calls = 20 ms
DB2 Class 2 elapsed (sum of 4 SQL calls): 80 ms
Application processing between SQL calls: 1,900 ms

a) Is the database the bottleneck? b) Where should the investigation focus? c) What specific application-level problems could cause 1,900ms of processing?

Exercise 22: The EXPLAIN Disconnect

You have a slow query. EXPLAIN shows an efficient access path (index scan, 50 estimated rows). But the actual runtime reads 500,000 rows and takes 3 seconds.

a) Why might the EXPLAIN estimate be so different from reality? b) What maintenance operation would fix the disconnect? c) What metrics would confirm that the statistics are stale?

Exercise 23: Correlation Analysis

You have time-series data showing that every time BATCH_JOB_A starts, OLTP transaction QUERY_B slows down. But they access different tables.

a) What shared resources could explain this correlation? b) How would you prove or disprove each hypothesis? c) Design a test to isolate the root cause.

Exercise 24: The Capacity Wall

Symptom: As concurrent users increased from 500 to 800, throughput stopped increasing and response times started climbing.

Metrics at 500 users: TPS = 5,000, Avg response = 50ms, CPU = 70% Metrics at 800 users: TPS = 5,200, Avg response = 150ms, CPU = 95%

a) What is the system's effective maximum throughput? b) What is the bottleneck? c) What are your options for increasing capacity?

Exercise 25: z/OS Accounting Deep Dive

Interpret this accounting trace and provide three actionable recommendations:

Plan: ACCTINQ   Conn: CICS   Occurrences: 500,000

                    Total          Average
Class 2 Elapsed:    95,000 sec     0.190 sec
Class 2 CPU:         5,000 sec     0.010 sec
Suspensions:        90,000 sec     0.180 sec
  I/O Susp:         15,000 sec     0.030 sec
  Lock Susp:        70,000 sec     0.140 sec
  Other Susp:        5,000 sec     0.010 sec

Getpages:       500,000,000        1,000
Sync I/O:        25,000,000           50
Prefetch:         1,000,000            2
Lock Requests:  750,000,000        1,500
Lock Escals:            250
Deadlocks:                5
Timeouts:             1,200

Exercise 26: The Regression Test

You made a change (new index) and need to verify it worked. Design a regression test plan:

a) What metrics will you collect before and after? b) How long should each measurement period be? c) How will you ensure the before and after periods are comparable? d) What constitutes "success"? e) What constitutes "regression" that should trigger a rollback?

Exercise 27: Production Triage

It is 9:00 AM Monday. Multiple OLTP transactions are timing out. The CEO's dashboard is not loading. You have 15 minutes to diagnose and resolve the issue.

Write the exact sequence of diagnostic commands you would run (for your platform) in the first 5 minutes. For each command, explain what you are looking for and what the result would tell you.

Exercise 28: End-to-End Investigation

Meridian Bank's LOAN_PAYMENT transaction has degraded from 50ms to 800ms over the past month. Conduct a complete investigation:

a) Define the symptom precisely. b) List the metrics you would collect. c) For each bottleneck category (CPU, I/O, lock, memory), describe the specific metrics that would confirm or rule out that category. d) Propose the most likely root cause (given that the degradation is gradual over one month). e) Describe the fix and the verification procedure.

Challenge Problems (Exercises 29-32)

Exercise 29: The Phantom Bottleneck

A transaction shows the following metrics:

Elapsed: 5,000 ms
CPU: 50 ms
I/O Wait: 100 ms
Lock Wait: 200 ms

The sum of CPU + I/O + Lock = 350ms, but elapsed is 5,000ms. Where are the other 4,650ms?

List five possible causes of this "phantom" time and how you would investigate each one.

Exercise 30: Cross-Subsystem Diagnosis

Meridian Bank's CICS online transaction calls DB2, MQ, and a REST API. The end-to-end response time is 3 seconds. How do you determine which component is the bottleneck? Design a diagnostic approach that measures the time spent in each component.

Exercise 31: Predictive Performance Analysis

Using 6 months of historical performance data, design a methodology to predict when the current hardware will be unable to support the growing workload. What metrics would you trend? What thresholds would trigger a capacity planning exercise?

Exercise 32: The Complete Investigation Report

Using the Meridian Bank month-end scenario from Section 27.12, write a complete post-incident report including: a) Executive summary (3 sentences) b) Timeline of events c) Root cause analysis d) Impact assessment e) Immediate actions taken f) Long-term remediation plan g) Lessons learned h) Monitoring improvements to prevent recurrence