Case Study 2: Cost Estimation Deep Dive — When the Optimizer Gets It Wrong

Background

Meridian National Bank's data analytics team has deployed a new customer segmentation query that runs nightly as part of the marketing data pipeline. The query identifies high-value customers who have had significant recent transaction activity and segments them by account type and geographic region.

The query ran successfully in development and QA testing with a subset of data (10% sample). When deployed to the production environment with the full dataset, the query took 4 hours — well beyond the 30-minute batch window allocated for it.

The Problem Query

SELECT c.STATE, a.ACCOUNT_TYPE,
       COUNT(DISTINCT c.CUSTOMER_ID) AS CUSTOMER_COUNT,
       AVG(a.BALANCE) AS AVG_BALANCE,
       SUM(recent.TXN_TOTAL) AS SEGMENT_VOLUME
FROM CUSTOMER c
JOIN ACCOUNT a ON c.CUSTOMER_ID = a.CUSTOMER_ID
JOIN (
    SELECT ACCOUNT_ID, SUM(AMOUNT) AS TXN_TOTAL, COUNT(*) AS TXN_COUNT
    FROM TRANSACTION
    WHERE TXN_DATE >= CURRENT DATE - 90 DAYS
    GROUP BY ACCOUNT_ID
    HAVING COUNT(*) >= 10
) recent ON a.ACCOUNT_ID = recent.ACCOUNT_ID
WHERE c.STATE IN ('IL', 'WI', 'IN', 'MI', 'OH')
  AND a.STATUS = 'ACTIVE'
  AND a.BALANCE > 10000
GROUP BY c.STATE, a.ACCOUNT_TYPE
ORDER BY SEGMENT_VOLUME DESC;

The query involves three base tables and an inline view with aggregation. The HAVING clause in the subquery filters for accounts with at least 10 transactions in the past 90 days — these are the "active" accounts the marketing team cares about.

The Investigation

Marcus Williams, a mid-level DBA, is assigned to investigate. He follows the systematic approach he learned in Chapter 22.

Step 1: Capture the EXPLAIN Output

Marcus runs EXPLAIN and examines the access plan. The optimizer has chosen:

  1. Table scan on TRANSACTION (80 million rows), filter on TXN_DATE, group by ACCOUNT_ID, apply HAVING — estimated to produce 200,000 rows.
  2. Hash join between the TRANSACTION subquery result and ACCOUNT — estimated 50,000 rows after filtering on STATUS = 'ACTIVE' and BALANCE > 10000.
  3. Hash join to CUSTOMER — estimated 40,000 rows after filtering on STATE IN (...).
  4. Hash aggregate for the outer GROUP BY.

The estimated total cost is moderate — the optimizer believes this plan will run quickly. The actual execution reveals the problem.

Step 2: Compare Estimates to Actuals

Marcus uses the DB2 monitoring tools to capture actual row counts at each stage. The comparison is revealing:

Plan Step Optimizer Estimate Actual Rows Error Factor
TRANSACTION scan (after date filter) 16,000,000 19,500,000 1.2x
After GROUP BY/HAVING 200,000 1,450,000 7.3x
After join to ACCOUNT + filters 50,000 890,000 17.8x
After join to CUSTOMER + filter 40,000 420,000 10.5x

The initial date filter estimate is reasonably accurate (1.2x off). But the GROUP BY/HAVING estimate is off by a factor of 7.3, and the errors cascade through subsequent joins.

Step 3: Trace the Root Cause

Marcus investigates each estimation error.

Error 1: The HAVING clause estimate. The optimizer estimated that the HAVING COUNT(*) >= 10 clause would eliminate 98.75% of accounts (reducing 16 million transaction-bearing accounts to 200,000). In reality, only about 92.6% were eliminated (1.45 million accounts had 10+ transactions in 90 days).

The optimizer has no direct statistics for predicting HAVING clause selectivity. It uses a default filter factor for the HAVING predicate, which is often a fixed percentage. On this system, the optimizer assumed approximately 1.25% of groups would survive the HAVING clause. The actual survival rate was 7.4% — almost 6 times higher.

Why was the HAVING estimate so wrong? The bank's customer base has become more digitally active over the past year. Mobile banking adoption increased from 40% to 78%, and average transaction frequency per account rose from 6 per quarter to 18 per quarter. The optimizer's default HAVING filter factor was calibrated for a less active customer base.

Error 2: The join to ACCOUNT. The optimizer estimated that after joining the 200,000 subquery rows to ACCOUNT and filtering on STATUS = 'ACTIVE' and BALANCE > 10000, only 50,000 rows would remain. The actual row count was 890,000.

This error has two components: - The input cardinality was already wrong (1,450,000 instead of 200,000). - The filter factor for BALANCE > 10000 was overestimated.

Marcus checks the BALANCE column statistics:

SELECT COLCARD, HIGH2KEY, LOW2KEY
FROM SYSIBM.SYSCOLUMNS
WHERE TBNAME = 'ACCOUNT' AND NAME = 'BALANCE';

Result: LOW2KEY = '100.00', HIGH2KEY = '500000.00'. Under uniform distribution:

FF(BALANCE > 10000) = (500000 - 10000) / (500000 - 100) = 490000 / 499900 ≈ 0.98

The optimizer estimates that 98% of accounts have a balance above 10,000. But this is the uniform distribution estimate — it assumes balances are evenly spread between 100 and 500,000. In reality, account balances follow a heavily right-skewed distribution. Examining the actual data:

Balance Range Accounts Percentage
0 - 1,000 400,000 20%
1,000 - 5,000 500,000 25%
5,000 - 10,000 300,000 15%
10,000 - 50,000 500,000 25%
50,000 - 100,000 200,000 10%
100,000 - 500,000 100,000 5%

The actual percentage of accounts with BALANCE > 10000 is 40%, not 98%. The optimizer overestimates by a factor of 2.45 for this predicate alone.

But more importantly, there is a correlation between transaction activity and account balance. Accounts with 10+ transactions per quarter tend to be active checking accounts, which tend to have higher balances. So the filter BALANCE > 10000 is less selective when applied to the population of accounts that already passed the HAVING COUNT(*) >= 10 filter than when applied to the general account population. The optimizer has no way to know this without column group statistics that span the subquery boundary.

Error 3: The join to CUSTOMER. The STATE IN (...) predicate was expected to filter out 90% of customers (5 states out of 50). The actual filter was 79% (21% of customers are in these 5 Midwestern states). Marcus checks frequency values:

SELECT COLVALUE, VALCOUNT
FROM SYSIBM.SYSCOLDIST
WHERE TBNAME = 'CUSTOMER' AND NAME = 'STATE' AND TYPE = 'F';

No frequency values exist for STATE. The optimizer used 1/50 = 0.02 per state, or 0.10 for 5 states. But Illinois alone has 12% of the bank's customers (the bank is headquartered in Chicago), and the 5 Midwestern states together account for 21%. Without frequency values, the optimizer underestimates by more than 2x.

Step 4: The Cascading Effect

The individual errors compound:

Step 1: HAVING estimate off by 7.3x
Step 2: Join estimate off by 17.8x (7.3x from input × ~2.4x from BALANCE filter)
Step 3: Final estimate off by 10.5x

Result: Hash join allocations were sized for 40,000 rows.
        Actual volume was 420,000 rows.
        Hash tables overflowed to disk repeatedly.
        The hash aggregate for GROUP BY also spilled.

The 4-hour runtime is almost entirely caused by hash overflow — the hash tables were sized for the optimizer's estimates, not the actual row volumes. Each overflow to temporary tablespace adds enormous I/O overhead.

The Fix

Marcus implements a three-part fix:

Part 1: Collect Missing Statistics

-- Frequency values for STATE
RUNSTATS TABLESPACE MERIDIAN.CUSTOMER_TS
  TABLE(CUSTOMER)
    COLUMN(STATE) FREQVAL NUMCOLS 1 COUNT 50
  UPDATE ALL;

-- Histogram for BALANCE
RUNSTATS TABLESPACE MERIDIAN.ACCOUNT_TS
  TABLE(ACCOUNT)
    COLUMN(BALANCE) HISTOGRAM NUMCOLS 1 NUMQUANTILES 50
    COLUMN(STATUS) FREQVAL NUMCOLS 1 COUNT 10
  UPDATE ALL;

-- Column group statistics for correlated columns
RUNSTATS TABLESPACE MERIDIAN.ACCOUNT_TS
  TABLE(ACCOUNT)
    COLUMN(STATUS, BALANCE) FREQVAL NUMCOLS 2 COUNT 20
  UPDATE ALL;

Part 2: Add an Index for the Subquery

The TRANSACTION table scan processes 80 million rows to find the 19.5 million in the last 90 days. An index on TXN_DATE could reduce this to an index-based range scan:

CREATE INDEX IX_TXN_DATE ON TRANSACTION(TXN_DATE);
RUNSTATS TABLESPACE MERIDIAN.TXN_TS
  INDEX(IX_TXN_DATE) UPDATE ALL;

Part 3: Rewrite the Query

Marcus rewrites the subquery to give the optimizer more information:

SELECT c.STATE, a.ACCOUNT_TYPE,
       COUNT(DISTINCT c.CUSTOMER_ID) AS CUSTOMER_COUNT,
       AVG(a.BALANCE) AS AVG_BALANCE,
       SUM(recent_total) AS SEGMENT_VOLUME
FROM CUSTOMER c
JOIN ACCOUNT a ON c.CUSTOMER_ID = a.CUSTOMER_ID
JOIN LATERAL (
    SELECT SUM(t.AMOUNT) AS recent_total
    FROM TRANSACTION t
    WHERE t.ACCOUNT_ID = a.ACCOUNT_ID
      AND t.TXN_DATE >= CURRENT DATE - 90 DAYS
    HAVING COUNT(*) >= 10
) recent ON 1=1
WHERE c.STATE IN ('IL', 'WI', 'IN', 'MI', 'OH')
  AND a.STATUS = 'ACTIVE'
  AND a.BALANCE > 10000
GROUP BY c.STATE, a.ACCOUNT_TYPE
ORDER BY SEGMENT_VOLUME DESC;

The LATERAL join allows the optimizer to filter CUSTOMER and ACCOUNT first (using indexes and the now-accurate statistics) and then probe TRANSACTION only for qualifying accounts. This inverts the execution order — instead of scanning all transactions and then filtering, the query finds the qualifying accounts first and then looks up their transactions.

Results

After collecting statistics, adding the index, and rewriting the query:

Metric Before After
Elapsed time 4 hours 3 minutes 8 minutes
CPU time 47 minutes 3 minutes
I/O (page reads) 12,400,000 340,000
Temp space used 28 GB 0.5 GB

The query now fits within the 30-minute batch window with ample margin.

Key Takeaways

  1. Default filter factors for HAVING clauses are unreliable. The optimizer has limited ability to estimate how many groups will survive a HAVING predicate, especially for complex expressions like COUNT(*) >= 10. When you use HAVING, be aware that cardinality estimates downstream may be significantly wrong.

  2. The uniform distribution assumption fails for financial data. Account balances, transaction amounts, and similar financial measures are almost always heavily skewed. Always collect histogram statistics on financial amount columns.

  3. Correlated columns break the independence assumption. Active accounts (high transaction count) tend to have higher balances. The optimizer cannot detect this correlation without explicit multi-column statistics.

  4. Cardinality errors cascade multiplicatively. A 7x error at step 1 became a 17.8x error at step 2 and a 10.5x error at step 3. Each join step amplifies the error.

  5. Hash join performance degrades dramatically when estimates are wrong. Hash joins sized for 50,000 rows but receiving 890,000 rows overflow to disk, converting an efficient in-memory algorithm into a slow disk-based one.

  6. Query structure affects optimizer freedom. The original subquery forced the optimizer to process all transactions before joining. The LATERAL rewrite allowed the optimizer to filter first and probe second.

  7. The three-part fix — statistics, indexes, and query rewrite — is a standard pattern. Rarely does a single change solve a severe performance problem. The combination of accurate statistics (so the optimizer can estimate correctly), appropriate indexes (so efficient access paths are available), and clear query structure (so the optimizer has freedom to choose the best plan) is the formula for consistent query performance.

Discussion Questions

  1. If Marcus had only collected the missing statistics (Part 1) without rewriting the query, would the performance have improved? By how much?
  2. The LATERAL join rewrite changes the query's execution model fundamentally. Under what data conditions would the original subquery approach actually be faster than the LATERAL approach?
  3. How would you design an automated monitoring system to detect cardinality estimation errors before they cause production problems?
  4. The marketing team plans to run this query with different state lists and balance thresholds. How would you advise them to parameterize the query, and what REOPT strategy would you recommend?