Case Study 1: Optimizer Behavior Change After DB2 Migration

Background

Meridian National Bank is migrating its core banking system from DB2 11 for z/OS to DB2 12 for z/OS. The migration has been in progress for six months. The infrastructure team has completed the technical migration in the test environment, and the DBA team is now in the critical phase: verifying that application performance has not regressed.

The test environment mirrors production: 850,000 customer records, 2 million accounts, 80 million transactions, and 50 branches. All RUNSTATS jobs have been executed with the same parameters used in production, and the catalog statistics match production within normal variance.

The Discovery

During migration acceptance testing, the performance monitoring team flags an alert. The nightly batch cycle, which processes end-of-day account reconciliation, has increased from 45 minutes to 3 hours and 20 minutes.

Sarah Chen, the senior DBA leading the migration, isolates the problem to a single SQL statement within the reconciliation package — a four-table join that compares transaction totals against account balances:

SELECT a.ACCOUNT_ID, a.BALANCE,
       SUM(t.AMOUNT) AS COMPUTED_BALANCE,
       a.BALANCE - SUM(t.AMOUNT) AS DISCREPANCY
FROM ACCOUNT a
JOIN TRANSACTION t ON a.ACCOUNT_ID = t.ACCOUNT_ID
JOIN CUSTOMER c ON a.CUSTOMER_ID = c.CUSTOMER_ID
JOIN BRANCH b ON a.BRANCH_ID = b.BRANCH_ID
WHERE a.STATUS = 'ACTIVE'
  AND t.TXN_DATE >= a.OPENED_DATE
  AND b.STATE = 'IL'
GROUP BY a.ACCOUNT_ID, a.BALANCE
HAVING a.BALANCE - SUM(t.AMOUNT) <> 0;

This query scans all active accounts for Illinois branches, sums their complete transaction history, and flags any accounts where the current balance does not match the computed sum of transactions.

Investigation

Sarah begins her standard investigation process.

Step 1: Capture and Compare EXPLAIN Output

She runs EXPLAIN on the statement in both DB2 11 and DB2 12 environments. The results reveal dramatically different plans:

DB2 11 Plan: 1. Matching index scan on IX_BRANCH_STATE (BRANCH table, STATE = 'IL') — 5 rows. 2. Nested loop join to ACCOUNT on BRANCH_ID using IX_ACCT_BRANCH — ~80,000 rows for Illinois branches. 3. Filter on STATUS = 'ACTIVE' — reduces to ~76,000 rows. 4. Merge scan join to TRANSACTION on ACCOUNT_ID — uses IX_TXN_ACCT index, sorted merge. 5. Nested loop join to CUSTOMER on CUSTOMER_ID (for output columns). 6. Sort and aggregate for GROUP BY.

DB2 12 Plan: 1. Table scan on TRANSACTION — all 80 million rows. 2. Hash join to ACCOUNT on ACCOUNT_ID. 3. Filter on STATUS = 'ACTIVE' and BRANCH_ID match. 4. Hash join to BRANCH, filter on STATE = 'IL'. 5. Hash join to CUSTOMER. 6. Hash aggregate for GROUP BY.

The DB2 12 plan starts with a full table scan of the 80-million-row TRANSACTION table. This is the source of the regression.

Step 2: Understand Why the Optimizer Changed Its Mind

Sarah examines the cost estimates in both plans. In DB2 12, the optimizer estimated that the merge scan join approach from DB2 11 would cost more than the hash join approach because:

  1. New cost formulas for merge scan join: DB2 12 updated the cost model for merge scan join to more accurately account for the sort cost when the inner input (TRANSACTION) is very large. In DB2 11, the sort cost was slightly underestimated, making merge scan look cheaper than it actually was.

  2. Hash join improvements: DB2 12 introduced improvements to hash join that reduce its estimated cost. The optimizer now correctly identifies that hash join can exploit available memory more effectively.

  3. Join order change: The new cost model caused the optimizer to reconsider the join order. In DB2 11, starting with BRANCH (5 rows) and working outward was clearly cheapest. In DB2 12, the optimizer estimated that starting with a hash-based approach on the large tables would be more efficient overall.

The problem is that while the DB2 12 optimizer's cost model is theoretically more accurate, the specific data characteristics of this query make the BRANCH-first approach far superior. Starting with 5 Illinois branches and using indexes to navigate to their accounts (80,000 out of 2 million) and then to those accounts' transactions is vastly more efficient than scanning all 80 million transactions.

Step 3: Examine the Predicate Interaction

Sarah notices something subtle. The predicate t.TXN_DATE >= a.OPENED_DATE is a join-dependent predicate — it references columns from two different tables. This predicate cannot be evaluated until the join between ACCOUNT and TRANSACTION has occurred. In the DB2 11 plan, this predicate was applied during the merge scan join (a Stage 1 predicate in the merge context). In the DB2 12 plan, the full TRANSACTION table scan happens first, and this predicate can only be applied after the hash join.

This means the DB2 12 plan reads all 80 million transaction rows, loads them into a hash table, joins them to ACCOUNT, and only then filters out transactions that occurred before the account was opened. The DB2 11 plan, by reaching TRANSACTION through an ordered index scan per account, naturally applied the date filtering during the merge.

Step 4: Verify Statistics

Sarah confirms that statistics are current and complete. The issue is not stale statistics — it is a genuine optimizer modeling difference between versions.

Resolution

Sarah considers three options:

Option A: Optimization hint (OPTHINT). She could force the DB2 11 plan using a hint. This would work immediately but would prevent the optimizer from adapting to future data changes.

Option B: Statistical adjustments. She could try collecting more detailed statistics (frequency values for BRANCH.STATE, multi-column statistics for ACCOUNT.BRANCH_ID and ACCOUNT.STATUS) to give the optimizer better information. This is worth trying but may not change the join order decision.

Option C: Query rewrite. She could rewrite the query to guide the optimizer toward the better plan without using hints.

Sarah chooses a combination of B and C. First, she collects column group statistics:

RUNSTATS TABLESPACE MERIDIAN.ACCOUNT_TS
  TABLE(ACCOUNT)
    COLUMN(BRANCH_ID, STATUS) FREQVAL NUMCOLS 2 COUNT 100
    COLUMN(STATUS) FREQVAL NUMCOLS 1 COUNT 10
  UPDATE ALL;

This tells the optimizer the actual joint distribution of (BRANCH_ID, STATUS), helping it understand that Illinois branches have a specific number of active accounts.

Second, she rewrites the query to make the filtering more explicit:

SELECT a.ACCOUNT_ID, a.BALANCE,
       SUM(t.AMOUNT) AS COMPUTED_BALANCE,
       a.BALANCE - SUM(t.AMOUNT) AS DISCREPANCY
FROM ACCOUNT a
JOIN BRANCH b ON a.BRANCH_ID = b.BRANCH_ID
JOIN TRANSACTION t ON a.ACCOUNT_ID = t.ACCOUNT_ID
    AND t.TXN_DATE >= a.OPENED_DATE
JOIN CUSTOMER c ON a.CUSTOMER_ID = c.CUSTOMER_ID
WHERE a.STATUS = 'ACTIVE'
  AND b.STATE = 'IL'
GROUP BY a.ACCOUNT_ID, a.BALANCE
HAVING a.BALANCE - SUM(t.AMOUNT) <> 0;

The rewrite moves the TXN_DATE >= OPENED_DATE predicate into the join ON clause and reorders the FROM clause to hint at the desired join sequence (though the optimizer is not required to follow FROM-clause order, it sometimes uses it as a tiebreaker).

After REBIND with the new statistics and rewritten query, the optimizer produces a plan that starts with BRANCH, joins to ACCOUNT via index, and then joins to TRANSACTION via index — similar to the DB2 11 plan but using DB2 12's improved join algorithms. The batch completes in 42 minutes.

Lessons Learned

  1. Version migrations change optimizer behavior. New cost formulas and new algorithms cause plan changes, even with identical statistics. Always test critical SQL during migration.

  2. Test with production-scale data. Small test datasets would not have revealed this regression because the plan difference only matters at scale.

  3. The optimizer's cost model is an approximation. DB2 12's model may be better on average, but for specific queries with unusual predicate patterns (join-dependent predicates, extreme selectivity differences between tables), the older model may have chosen better by accident.

  4. Column group statistics are powerful. They help the optimizer understand the actual joint distribution of correlated columns, which is essential for accurate join order decisions.

  5. PLANMGMT is your safety net. Sarah had used REBIND PLANMGMT(EXTENDED) for all packages before testing, which would have allowed an immediate SWITCH(PREVIOUS) if needed in production. Always use PLANMGMT during migration.

  6. The goal is not to prevent all plan changes. Many of the 200 critical statements actually ran faster under DB2 12. The goal is to identify and fix the regressions while keeping the improvements.

Discussion Questions

  1. Why did moving TXN_DATE >= a.OPENED_DATE into the ON clause help the optimizer choose a better join order?
  2. If the column group statistics alone had not fixed the problem, what would Sarah's next step be?
  3. How would you build an automated regression test for optimizer plan changes during a migration?
  4. In what situations might the DB2 12 hash-join-first plan actually be better than the DB2 11 index-first plan?