Case Study 2: Midnight Crisis — Diagnosing a Sudden Slowdown
The Alert
It was 11:47 PM on a Thursday when David Park's phone buzzed with a PagerDuty alert:
CRITICAL: MERIDIAN LUW Production
Buffer pool hit ratio MERIDIAN_BP_DATA: 71.3%
Threshold: 93% (3 consecutive readings)
First breach: 23:32:14 EST
David, the on-call DBA for Meridian National Bank, was still awake — reviewing the next day's change management tickets. He immediately opened his laptop and connected to the monitoring dashboard via VPN. What he saw made him sit up straight.
The buffer pool hit ratio chart showed a steep cliff: at 11:28 PM, the ratio had been a normal 98.4%. By 11:32 PM, it had plummeted to 74%. It was still falling, now at 69.2%.
Transaction response times were climbing in parallel. The average had gone from 45 milliseconds to 1,200 milliseconds. Online banking was still operational, but customers would start noticing within minutes.
The Investigation
Step 1: Confirm the Problem (11:49 PM)
David's first action was to confirm the monitoring data was accurate and to get a current snapshot:
db2pd -db MERIDIAN -bufferpools
Output confirmed the crisis:
Bufferpool Name: MERIDIAN_BP_DATA Id: 4 PageSize: 8192
nPages nPagesUsed nPagesClean nPagesDirty HitRatio
50000 50000 234 49766 69.14
The buffer pool was completely full (nPagesUsed = nPages), and 49,766 of 50,000 pages were dirty. The page cleaners were overwhelmed. But the immediate question was: why had a stable buffer pool suddenly degraded?
Step 2: Identify What Changed (11:51 PM)
David checked what was running:
db2pd -db MERIDIAN -applications
He spotted an unfamiliar application:
AppHandl AuthID Application Name Status StatusChgTime
-------- -------- ---------------- ---------- -------------------------
1847 RPTUSER python3.9 UOW Execute 2025-01-16-23.27.45.123456
1201 WEBUSER MeridianOnline UOW Wait 2025-01-16-23.49.02.345678
1202 WEBUSER MeridianOnline UOW Wait 2025-01-16-23.49.05.678901
... (47 more WEBUSER connections in UOW Wait)
Application handle 1847 — python3.9 running under RPTUSER — had started at 11:27 PM, right before the degradation began. The timing was too precise to be coincidental.
Step 3: Find the SQL (11:53 PM)
David needed to know what SQL this application was executing:
SELECT SUBSTR(STMT_TEXT, 1, 500) AS SQL_TEXT,
ROWS_READ,
ROWS_RETURNED,
TOTAL_ACT_TIME
FROM TABLE(MON_GET_PKG_CACHE_STMT(NULL, NULL, NULL, -2)) AS STMT
WHERE EXECUTABLE_ID IN (
SELECT EXECUTABLE_ID
FROM TABLE(MON_GET_ACTIVITY(NULL, -2)) AS ACT
WHERE APPLICATION_HANDLE = 1847
)
ORDER BY TOTAL_ACT_TIME DESC;
The result was revealing:
SQL_TEXT: SELECT T.*, A.ACCOUNT_NAME, A.ACCOUNT_TYPE, C.CUSTOMER_NAME,
C.CUSTOMER_SINCE FROM MERIDIAN.TRANSACTIONS T
JOIN MERIDIAN.ACCOUNTS A ON T.ACCOUNT_ID = A.ACCOUNT_ID
JOIN MERIDIAN.CUSTOMERS C ON A.CUSTOMER_ID = C.CUSTOMER_ID
WHERE T.TRANSACTION_DATE BETWEEN '2024-01-01' AND '2024-12-31'
ORDER BY T.TRANSACTION_DATE, T.TRANSACTION_AMOUNT DESC
ROWS_READ: 145,234,567
ROWS_RETURNED: 0 (still executing)
TOTAL_ACT_TIME: 1,523,456 ms (25+ minutes)
Someone was running a full-year transaction report that was reading 145 million rows, performing a three-table join, and sorting the entire result set. This query was thrashing the buffer pool — every page it read displaced cached pages that the online banking application needed, and those pages then had to be re-read from disk.
Step 4: Assess the Impact (11:55 PM)
Before taking action, David quantified the impact on the online banking workload:
SELECT APPLICATION_NAME,
COUNT(*) AS CONNECTION_COUNT,
AVG(LOCK_WAIT_TIME) AS AVG_LOCK_WAIT_MS,
SUM(ROWS_READ) AS TOTAL_ROWS_READ
FROM TABLE(MON_GET_CONNECTION(NULL, -2)) AS CONN
WHERE APPLICATION_NAME = 'MeridianOnline'
GROUP BY APPLICATION_NAME;
Result: 49 active online banking connections, average lock wait time elevated but not critical. The primary problem was I/O contention, not lock contention. The online transactions were competing with the report query for buffer pool space and disk bandwidth.
David also checked the transaction log:
SELECT LOG_UTILIZATION_PERCENT,
TOTAL_LOG_USED_KB,
APPLH_WITH_OLDEST_XACT
FROM TABLE(MON_GET_TRANSACTION_LOG(-2)) AS LOG;
Log utilization was at 12% — not a concern. The report query was read-only and not consuming log space.
Step 5: Take Action (11:58 PM)
David had three options:
Option A: Terminate the query. This would immediately free buffer pool resources and restore online performance. But RPTUSER might be running an authorized report.
Option B: Reduce the query's priority. Using DB2 workload management, he could throttle the query to reduce its buffer pool impact. But this would not recover the pages already displaced.
Option C: Force the application off. More aggressive than Option A — disconnects the entire application, not just the current SQL statement.
David chose a combination. First, he checked the change management system — no approved change for RPTUSER to run queries during online hours. He then checked with the operations team, who confirmed no one was expecting this query. He decided to force the application:
db2 "FORCE APPLICATION (1847)"
He then watched the buffer pool recovery:
db2pd -db MERIDIAN -bufferpools -repeat 10
Within 30 seconds, the hit ratio began climbing. Within 3 minutes, it was back above 95%. Within 8 minutes, it had returned to 98.1%. The online banking response times dropped back below 50 milliseconds.
Total user-visible impact: approximately 25 minutes of degraded (but not unavailable) online banking performance.
Step 6: Root Cause Analysis (Post-Incident, Friday Morning)
The next morning, David conducted a full root cause analysis.
Who ran the query? The RPTUSER account was a shared service account used by the data analytics team. Badge access logs and the Python script's source IP address traced it to Jennifer Walsh, a data analyst who had recently joined the team.
Why was it running at 11:27 PM? Jennifer had set up a cron job on her workstation to run the query overnight, thinking that late-night execution would avoid impacting daytime users. She did not realize that the online banking system operated 24x7 and that 11 PM to 5 AM was the batch processing window with its own performance requirements.
Why did it cause such severe impact? The query read the entire TRANSACTIONS table for the past year — approximately 145 million rows. With an 8K page size and roughly 10 rows per page, the query needed to read approximately 14.5 million pages. The MERIDIAN_BP_DATA buffer pool had only 50,000 pages. The query effectively cycled through the entire buffer pool content 290 times, evicting every cached page multiple times.
Why was there no index to reduce the read volume? There was an index on TRANSACTION_DATE, but the date range '2024-01-01' to '2024-12-31' covered approximately 60% of all rows. The optimizer correctly determined that a table scan was more efficient than an index scan for this selectivity.
Step 7: Preventive Measures
David implemented the following changes:
1. Workload Management (immediate):
-- Create a workload for reporting users
CREATE WORKLOAD WL_REPORTING
APPLNAME ('python*')
SERVICE CLASS SC_REPORTING UNDER SYSDEFAULTSUBCLASS;
-- Create a threshold to prevent buffer-pool-killing queries
CREATE THRESHOLD TH_RPT_ROWS_READ
FOR WORKLOAD WL_REPORTING ACTIVITIES
ENFORCEMENT DATABASE
WHEN SQLROWSREAD > 10000000
STOP EXECUTION;
-- During online hours, limit reporting concurrency
CREATE THRESHOLD TH_RPT_CONCURRENCY
FOR WORKLOAD WL_REPORTING ACTIVITIES
ENFORCEMENT DATABASE
WHEN CONCURRENTDBCOORDACTIVITIES > 2
QUEUE AFTER 300;
2. Separate buffer pool for reporting (implemented that weekend):
CREATE BUFFERPOOL MERIDIAN_BP_RPT SIZE 20000 PAGESIZE 8192;
-- Assign reporting table spaces to this buffer pool
ALTER TABLESPACE MERIDIAN_RPT_TS BUFFERPOOL MERIDIAN_BP_RPT;
This way, even if a report query thrashed its buffer pool, it would not impact the pool used by online banking.
3. Account governance: - RPTUSER account was replaced with individual named accounts for each analyst - Each account was mapped to the WL_REPORTING workload - Connection from non-approved IP addresses was blocked via CONNECT restrictions
4. Query governance: - The analytics team was provided a read replica for heavy queries instead of the production database - Queries exceeding 10 million rows read were automatically stopped on production - The data analyst team received training on query impact awareness
5. Enhanced monitoring: - A new alert was added for any single application reading more than 5 million rows in a 5-minute window - Buffer pool hit ratio monitoring was enhanced with a rate-of-change alert: if the hit ratio dropped more than 5 percentage points in 5 minutes, an immediate alert was triggered regardless of absolute value
Key Takeaways
-
A single query can bring down an entire workload. Buffer pool thrashing is one of the fastest ways for a read-only query to impact an entire system. The query does not need locks or log space — just I/O volume.
-
Time of day is not a sufficient control. "Run it at night" is not a valid strategy when the system operates 24x7. Workload management and resource separation are the proper controls.
-
Monitoring must include rate-of-change detection. An absolute threshold of 93% would not have triggered until 4 minutes after the degradation began. A rate-of-change alert (5-point drop in 5 minutes) would have triggered at 11:30 PM — 17 minutes earlier.
-
Shared service accounts are dangerous. When everyone uses the same account, accountability is impossible. Named accounts with workload mappings provide both accountability and resource control.
-
Buffer pool separation is a defense-in-depth measure. Even with workload management thresholds, having separate buffer pools for different workload types prevents one workload from evicting another's cached data.
Discussion Questions
- Was David's decision to force the application appropriate? What additional information might have changed his decision?
- If the query had been an authorized month-end report critical to regulatory compliance, how would the resolution differ?
- The buffer pool took 8 minutes to recover after the query was terminated. What determines the recovery time, and how could it be shortened?
- Design a workload management configuration that would allow reporting queries to run during online hours but with controlled resource consumption.
- If this incident had occurred on z/OS instead of LUW, what different tools and commands would David have used? Walk through the same investigation steps using z/OS equivalents.