Case Study 01: End-to-End Performance Investigation

Background

Regional Health Alliance (RHA) operates a DB2-based patient management system serving 12 hospitals and 200 clinics. The system handles patient registration, appointment scheduling, lab results, and prescription processing. The database runs on DB2 11.5 for LUW on a 4-node pureScale cluster.

The Incident

On a Tuesday morning at 8:15 AM, the help desk received 47 calls within 30 minutes reporting that the patient lookup screen was "spinning" and eventually timing out. By 8:45 AM, three hospital emergency departments reported that they could not access patient records. The incident was escalated to the DBA team as a Severity 1 production outage.

The Investigation

Step 1: Define the Symptom

What: PATIENT_LOOKUP transaction (searches for a patient by name, date of birth, or medical record number) is timing out. Normal response time is 200ms. Current response time exceeds the 30-second application timeout.

When: Started at approximately 8:00 AM Tuesday. No deployments or maintenance performed since the previous Friday.

Impact: All 12 hospitals affected. Emergency departments unable to access patient records. Clinical safety risk.

Step 2: Collect Metrics

The DBA team ran the diagnostic triage queries immediately:

Overall system health:

SELECT
    TOTAL_CPU_TIME / 1000000 AS CPU_SEC,
    TOTAL_WAIT_TIME / 1000000 AS WAIT_SEC,
    LOCK_WAIT_TIME / 1000000 AS LOCK_WAIT_SEC,
    LOCK_ESCALS,
    DEADLOCKS,
    LOCK_TIMEOUTS
FROM TABLE(MON_GET_DATABASE(-2)) AS T;
CPU_SEC:       4,500    (normal range: 3,000-5,000)
WAIT_SEC:    890,000    (normal range: 10,000-20,000)
LOCK_WAIT_SEC: 2,300    (normal range: 500-1,000)
LOCK_ESCALS:       0
DEADLOCKS:         0
LOCK_TIMEOUTS:   312    (normal range: < 5)

Observation: WAIT_SEC is 44x higher than normal, but LOCK_WAIT_SEC is only 2x normal. The wait is not primarily lock-related.

Patient lookup statement metrics:

SELECT
    NUM_EXECUTIONS,
    TOTAL_ACT_TIME / 1000 AS TOTAL_MS,
    TOTAL_ACT_TIME / NULLIF(NUM_EXECUTIONS, 0) / 1000 AS AVG_MS,
    TOTAL_CPU_TIME / NULLIF(NUM_EXECUTIONS, 0) / 1000 AS AVG_CPU_MS,
    POOL_DATA_P_READS AS PHYSICAL_READS,
    ROWS_READ,
    ROWS_RETURNED
FROM TABLE(MON_GET_PKG_CACHE_STMT(NULL, NULL, NULL, -2)) AS T
WHERE STMT_TEXT LIKE '%PATIENT_MASTER%LOOKUP%'
ORDER BY TOTAL_ACT_TIME DESC;
NUM_EXECUTIONS: 3,400  (since midnight)
TOTAL_MS:       8,500,000
AVG_MS:         2,500
AVG_CPU_MS:     45
PHYSICAL_READS: 15,200,000
ROWS_READ:      3,400,000 (avg 1,000 per execution)
ROWS_RETURNED:  3,400     (avg 1 per execution)

Analysis of the time breakdown:

Average Elapsed:   2,500 ms
Average CPU:          45 ms  ( 2%)   [GREEN]
I/O Wait:          2,400 ms  (96%)   [RED]
Lock Wait:            55 ms  ( 2%)   [GREEN]

Diagnosis: I/O-BOUND

The problem is I/O-bound. Each execution reads 1,000 rows but returns only 1. The read efficiency ratio is 1,000:1.

Step 3: Identify the Bottleneck — Buffer Pool Analysis

SELECT
    BP_NAME,
    POOL_DATA_L_READS,
    POOL_DATA_P_READS,
    CASE WHEN POOL_DATA_L_READS > 0
         THEN DECIMAL((1.0 - FLOAT(POOL_DATA_P_READS) /
              FLOAT(POOL_DATA_L_READS)) * 100, 5, 2)
         ELSE 100 END AS HIT_RATIO_PCT
FROM TABLE(MON_GET_BUFFERPOOL(NULL, -2)) AS T;
BP_NAME     LOGICAL_READS    PHYSICAL_READS   HIT_RATIO
BP_DATA     250,000,000      18,000,000        92.80%
BP_INDEX     50,000,000         200,000        99.60%
BP_TEMP       5,000,000       4,800,000         4.00%

BP_DATA hit ratio is 92.8% — low for OLTP (should be > 98%). But more critically, BP_TEMP has a 4% hit ratio with 4.8 million physical reads.

Why is BP_TEMP so active? Sort overflows.

SELECT SORT_OVERFLOWS, TOTAL_SORTS
FROM TABLE(MON_GET_DATABASE(-2)) AS T;
SORT_OVERFLOWS: 3,200
TOTAL_SORTS:    45,000
OVERFLOW_PCT:   7.1%

Step 4: Identify the Root Cause

The PATIENT_LOOKUP query was examined with EXPLAIN:

EXPLAIN PLAN FOR
SELECT patient_id, patient_name, dob, mrn
FROM patient_master
WHERE UPPER(last_name) = UPPER(:search_name)
  AND dob = :search_dob;

EXPLAIN output revealed:

Access Plan:
  1. TBSCAN on PATIENT_MASTER (25 million rows)
  2. FILTER: UPPER(LAST_NAME) = UPPER('SMITH') AND DOB = '1985-03-15'
  3. SORT for ORDER BY

Root cause identified: Table scan on a 25-million-row table.

The query uses UPPER(last_name) which prevents the use of the existing index on last_name. DB2 must scan all 25 million rows, apply the UPPER function to each, and filter. This explains: - 1,000 rows read per execution (1,000 pages, each containing multiple rows) - Massive physical I/O (buffer pool cannot hold the entire table) - Sort overflows (sorting a large intermediate result)

But why did this start today? The query has always used UPPER(). Investigation of the EXPLAIN history revealed that until last Friday, DB2 was using a different access path:

Previous access path (before Friday):
  1. Index scan on IX_PATIENT_UPPER_NAME (expression-based index)
  2. Data page access for matching rows

The expression-based index IX_PATIENT_UPPER_NAME on UPPER(last_name) was dropped on Friday by a schema migration script that was supposed to drop a different index. The drop was not noticed because the QA environment does not have expression-based indexes.

Step 5: Implement the Fix

Immediate fix (9:02 AM — 47 minutes after escalation):

CREATE INDEX IX_PATIENT_UPPER_NAME
ON patient_master (UPPER(last_name), dob);

RUNSTATS ON TABLE rha.patient_master
  AND INDEXES ALL;

Index creation took 8 minutes on the 25-million-row table. During creation, the system remained slow but functional (CREATE INDEX does not lock the table exclusively on LUW with online index creation).

Post-fix verification query:

EXPLAIN PLAN FOR
SELECT patient_id, patient_name, dob, mrn
FROM patient_master
WHERE UPPER(last_name) = UPPER(:search_name)
  AND dob = :search_dob;
New access path:
  1. Index scan on IX_PATIENT_UPPER_NAME
  2. Data page access for matching rows (estimated: 3 rows)

Step 6: Verify the Improvement

Before fix (8:00 AM - 9:00 AM):

PATIENT_LOOKUP: Avg elapsed 2,500ms
Physical reads/min: 250,000
Buffer pool hit ratio: 92.8%
Sort overflows/min: 53
Lock timeouts: 312 in 1 hour
User complaints: 47 in 30 minutes

After fix (9:30 AM - 10:30 AM):

PATIENT_LOOKUP: Avg elapsed 15ms
Physical reads/min: 1,200
Buffer pool hit ratio: 99.4%
Sort overflows/min: 1
Lock timeouts: 0 in 1 hour
User complaints: 0

Performance was restored to better than the pre-incident baseline (15ms vs. historical 200ms) because the new index includes DOB as a second column, making the lookup more precise than the original single-column expression index.

Post-Incident Analysis

Timeline

Time Event
Friday 6:00 PM Schema migration script runs, accidentally drops IX_PATIENT_UPPER_NAME
Friday 6:01 PM No monitoring alert fires (index drop is not monitored)
Saturday - Monday Low weekend/holiday traffic; degradation not noticed
Tuesday 8:00 AM Full hospital shift begins; high PATIENT_LOOKUP volume triggers I/O storm
Tuesday 8:15 AM Help desk receives first complaint
Tuesday 8:45 AM Severity 1 escalation to DBA team
Tuesday 8:50 AM DBA team begins diagnosis
Tuesday 8:58 AM Root cause identified (missing index)
Tuesday 9:02 AM Index recreation started
Tuesday 9:10 AM Index creation complete
Tuesday 9:12 AM RUNSTATS complete, access path restored
Tuesday 9:15 AM Normal performance confirmed

Root Cause Summary

A schema migration script accidentally dropped an expression-based index (IX_PATIENT_UPPER_NAME) required for the primary patient lookup query. Without the index, DB2 used a table scan on a 25-million-row table, causing massive I/O and sort overflow.

The incident was exacerbated by: 1. No monitoring for unexpected index drops 2. Weekend timing — low traffic masked the problem for 3 days 3. The QA environment did not have the expression-based index, so the migration was tested without it

Preventive Measures Implemented

  1. Index drop monitoring. A nightly job compares the current index inventory against a baseline and alerts on any unexpected drops.

  2. Migration script review. All schema migration scripts must be reviewed by a DBA before execution in production. The review includes verifying the exact objects affected.

  3. QA environment parity. Expression-based indexes and other performance-critical objects are now included in the QA schema definition.

  4. Access path monitoring. A weekly job runs EXPLAIN on all critical SQL statements and compares access paths against a baseline. Any access path change triggers a review.

  5. Performance regression testing. The deployment pipeline now includes a performance test that measures critical transaction response times before and after schema changes.

Discussion Questions

  1. The DBA team took 8 minutes to diagnose the problem (8:50 AM to 8:58 AM). Walk through the exact queries they would have run and the reasoning at each step.

  2. Could this incident have been prevented entirely? Design a monitoring system that would have detected the problem on Friday evening.

  3. If index creation had taken 2 hours instead of 8 minutes (due to a larger table), what interim measures could the DBA team have taken to restore service?

  4. The lock timeout count was 312 — much higher than normal — even though the problem was I/O-bound, not lock-bound. Explain this secondary effect.