Case Study 2: Join Performance — When 3 Seconds Became 3 Hours
Background
Meridian National Bank's monthly compliance report had run reliably for two years. Every first business day of the month, a batch job generated a Suspicious Activity Report (SAR) that cross-referenced transactions, customer profiles, branch data, and employee records. The report typically completed in 3 to 4 seconds.
In October 2024, the report ran for 3 hours and 12 minutes before the batch scheduler killed it. The compliance team could not file their regulatory submission on time. Marcus Webb, a senior DBA, was paged at 5:47 AM to investigate.
The Query
The offending SQL was a five-table join that had been written by a contractor two years earlier and never revisited:
SELECT c.CUSTOMER_ID,
c.FIRST_NAME || ' ' || c.LAST_NAME AS CUSTOMER_NAME,
c.CITY,
c.STATE,
a.ACCOUNT_ID,
a.ACCOUNT_TYPE,
b.BRANCH_NAME,
e.FIRST_NAME || ' ' || e.LAST_NAME AS BRANCH_MANAGER,
t.TXN_ID,
t.TXN_TYPE,
t.AMOUNT,
t.TXN_DATE,
t.DESCRIPTION
FROM CUSTOMER c,
ACCOUNT a,
BRANCH b,
EMPLOYEE e,
TRANSACTION t
WHERE c.CUSTOMER_ID = a.CUSTOMER_ID
AND a.BRANCH_ID = b.BRANCH_ID
AND b.MANAGER_ID = e.EMPLOYEE_ID
AND a.ACCOUNT_ID = t.ACCOUNT_ID
AND t.TXN_DATE >= CURRENT DATE - 30 DAYS
AND t.AMOUNT > 10000.00
AND (t.TXN_TYPE = 'WITHDRAWAL'
OR t.TXN_TYPE = 'TRANSFER')
ORDER BY t.AMOUNT DESC;
At first glance, the query looked reasonable. Five tables, clear join conditions, sensible filters. So why did it suddenly take 3 hours?
The Investigation
Step 1: Check the Data Volume
Marcus checked the TRANSACTION table:
SELECT COUNT(*) AS TOTAL_ROWS,
COUNT(CASE WHEN TXN_DATE >= CURRENT DATE - 30 DAYS THEN 1 END) AS RECENT_ROWS
FROM TRANSACTION;
| TOTAL_ROWS | RECENT_ROWS |
|---|---|
| 28,400,000 | 1,240,000 |
The bank had grown significantly. Two years ago, TRANSACTION had 3 million rows. Now it had 28.4 million. But 1.24 million recent rows still was not massive — something else was wrong.
Step 2: Check the EXPLAIN Plan
Marcus ran EXPLAIN PLAN on the query and examined the access plan:
EXPLAIN PLAN FOR
SELECT c.CUSTOMER_ID, ...
[the full query]
;
SELECT * FROM EXPLAIN_STATEMENT
ORDER BY EXPLAIN_LEVEL;
The EXPLAIN output revealed the problem. The optimizer had chosen a nested loop join for the TRANSACTION-to-ACCOUNT join, scanning the ACCOUNT table for every qualifying transaction row. With 1.24 million recent transactions and no index on TRANSACTION(TXN_DATE), the access path was:
- Full table scan of TRANSACTION (28.4M rows) to find recent high-value withdrawals/transfers
- For each qualifying transaction, nested loop lookup into ACCOUNT
- For each account, nested loop lookup into CUSTOMER
- For each account, nested loop lookup into BRANCH
- For each branch, nested loop lookup into EMPLOYEE
The root cause: the index on TRANSACTION(TXN_DATE) had been dropped during a schema reorganization three weeks earlier, and nobody had recreated it. Without the index, DB2 fell back to a full table scan as the starting point. More critically, the table statistics were stale — RUNSTATS had not been run since the index was dropped, so the optimizer was making decisions based on outdated cardinality estimates.
Step 3: The Missing Index
Two years ago, the query had worked because:
- An index on
TRANSACTION(TXN_DATE, AMOUNT, TXN_TYPE)allowed DB2 to jump directly to the relevant date range. - The resulting row count (around 200 rows matching all three filters) was small enough for nested loop joins to be efficient.
- An index on
TRANSACTION(ACCOUNT_ID)supported the join to ACCOUNT.
With the composite index gone, DB2 scanned 28.4 million rows just to apply the date filter, then processed over a million rows through nested loop joins.
Step 4: The Comma-Join Hidden Risk
Marcus also noted a more subtle issue. The query used the old-style comma-separated FROM clause:
FROM CUSTOMER c,
ACCOUNT a,
BRANCH b,
EMPLOYEE e,
TRANSACTION t
WHERE c.CUSTOMER_ID = a.CUSTOMER_ID
AND a.BRANCH_ID = b.BRANCH_ID
...
While functionally correct, this syntax made it harder to verify that all join conditions were present. Marcus counted: 5 tables require 4 join conditions, and all 4 were there. But in a larger query, this counting exercise becomes error-prone. Explicit JOIN...ON syntax makes missing conditions immediately visible.
The Fix
Marcus applied three changes:
Fix 1: Recreate the Index
CREATE INDEX IDX_TXN_DATE_AMT_TYPE
ON TRANSACTION (TXN_DATE, AMOUNT, TXN_TYPE)
INCLUDE (ACCOUNT_ID, DESCRIPTION);
The INCLUDE clause adds ACCOUNT_ID and DESCRIPTION to the index leaf pages, allowing the query to be satisfied entirely from the index without accessing the base table (an index-only access path).
Fix 2: Update Statistics
RUNSTATS ON TABLE MERIDIAN.TRANSACTION
WITH DISTRIBUTION AND DETAILED INDEXES ALL;
RUNSTATS ON TABLE MERIDIAN.ACCOUNT
WITH DISTRIBUTION AND DETAILED INDEXES ALL;
Fresh statistics gave the optimizer accurate row counts and data distribution information.
Fix 3: Rewrite with Explicit Join Syntax
Marcus rewrote the query to use explicit JOIN...ON syntax:
SELECT c.CUSTOMER_ID,
c.FIRST_NAME || ' ' || c.LAST_NAME AS CUSTOMER_NAME,
c.CITY,
c.STATE,
a.ACCOUNT_ID,
a.ACCOUNT_TYPE,
b.BRANCH_NAME,
e.FIRST_NAME || ' ' || e.LAST_NAME AS BRANCH_MANAGER,
t.TXN_ID,
t.TXN_TYPE,
t.AMOUNT,
t.TXN_DATE,
t.DESCRIPTION
FROM TRANSACTION t
INNER JOIN ACCOUNT a
ON t.ACCOUNT_ID = a.ACCOUNT_ID
INNER JOIN CUSTOMER c
ON a.CUSTOMER_ID = c.CUSTOMER_ID
INNER JOIN BRANCH b
ON a.BRANCH_ID = b.BRANCH_ID
INNER JOIN EMPLOYEE e
ON b.MANAGER_ID = e.EMPLOYEE_ID
WHERE t.TXN_DATE >= CURRENT DATE - 30 DAYS
AND t.AMOUNT > 10000.00
AND t.TXN_TYPE IN ('WITHDRAWAL', 'TRANSFER')
ORDER BY t.AMOUNT DESC;
Key changes beyond syntax:
- Replaced OR with IN for cleaner readability (optimizer treats them identically, but IN is clearer for humans)
- Listed TRANSACTION first to signal that it is the driving table (the optimizer does not care, but future maintainers will understand the intent)
- Each JOIN...ON pair is visually linked, making it trivial to verify that all relationships are covered
The Results
After the fixes:
| Metric | Before Fix | After Fix |
|---|---|---|
| Execution time | 3 hours 12 minutes | 1.8 seconds |
| Rows scanned (TRANSACTION) | 28,400,000 | 847 |
| Join method (TXN→ACCT) | Nested loop (full scan) | Index lookup |
| Temporary space used | 4.2 GB | 0 |
The new EXPLAIN plan showed:
- Index scan on IDX_TXN_DATE_AMT_TYPE to find the ~847 qualifying transactions
- Index lookup on ACCOUNT for each transaction (nested loop, but only 847 iterations)
- Index lookup on CUSTOMER for each account
- Index lookup on BRANCH for each account
- Index lookup on EMPLOYEE for each branch
Total estimated cost dropped from 18 million timeron units to 42.
The Deeper Problem
Marcus wrote a post-incident report identifying systemic issues:
-
No index change management. The schema reorganization had dropped and recreated tables but missed recreating some indexes. There was no automated check that all expected indexes existed after the reorganization.
-
Stale statistics. RUNSTATS was scheduled weekly on Sundays but had been failing silently for three weeks due to a disk space issue on the SYSTOOLSPACE tablespace.
-
No query performance baseline. The compliance report had no SLA defined and no historical execution time tracking. The 3-hour runtime could have been caught earlier if there had been alerting on query duration.
-
Legacy code review gap. The comma-separated join syntax was a code smell that should have been flagged in review. While not the direct cause of the performance issue, it made the query harder to maintain and audit.
Preventive Measures Implemented
- Index inventory script: A weekly job comparing expected indexes (from a master list) against actual indexes in the catalog.
SELECT expected.INDEX_NAME, expected.TABLE_NAME,
CASE WHEN actual.INDNAME IS NULL THEN 'MISSING' ELSE 'EXISTS' END AS STATUS
FROM INDEX_INVENTORY expected
LEFT OUTER JOIN SYSCAT.INDEXES actual
ON expected.INDEX_NAME = actual.INDNAME
AND expected.TABLE_NAME = actual.TABNAME
WHERE actual.INDNAME IS NULL;
-
RUNSTATS monitoring: Alerts when RUNSTATS has not completed successfully for any table with more than 1 million rows in the past 7 days.
-
Query performance baselining: Key batch queries now log their execution time. An alert fires if any query exceeds 3x its 90th-percentile historical runtime.
-
Code standards update: All new SQL must use explicit JOIN...ON syntax. Existing comma-join queries are refactored during maintenance windows.
Key Takeaways for Students
-
The query itself was not wrong. It produced correct results both before and after the performance regression. The problem was environmental: a missing index and stale statistics.
-
Join performance depends on the execution plan, not the SQL text. The same SQL produced two radically different plans depending on available indexes and statistics. Understanding EXPLAIN output is essential (covered in Chapter 22).
-
RUNSTATS is not optional. The optimizer makes decisions based on statistics. Stale statistics lead to bad plans. Run RUNSTATS after any significant data change or schema modification.
-
Explicit JOIN syntax is a maintenance advantage. It does not change performance, but it makes the query's intent clearer and missing join conditions easier to spot.
-
Monitoring and alerting prevent 5 AM pages. If the team had tracked query execution times, they would have noticed the degradation days earlier when the batch schedule still had slack time.
Discussion Questions
- Why did DB2 choose nested loop joins even when the inner table (TRANSACTION) had 28.4 million rows? What information was the optimizer missing?
- How does the INCLUDE clause on the index help performance? What is an "index-only access path"?
- If the compliance report needed to run faster than 2 seconds even without the composite index, what other approaches could Marcus consider (hints, materialized query tables, etc.)?
- The query joins 5 tables but the business only needs data from 4 (EMPLOYEE is only used for the branch manager name). Could a LEFT OUTER JOIN on EMPLOYEE improve robustness? What would happen if a branch had no manager assigned?
- Design a monitoring query that compares each query's current runtime against its historical average, using joins between a runtime_log table and a baseline table.