Case Study 2: Access Path Regression After REBIND

Case Study 2: Access Path Regression After REBIND

The Situation

Meridian National Bank had a critical online query that powered the customer account summary screen in their internet banking application. The query retrieved a customer's accounts with their most recent transaction date and balance. It was executed approximately 200,000 times per day during peak hours and had a strict SLA of under 50 milliseconds per execution.

For over a year, the query had performed flawlessly, averaging 8 milliseconds. Then, on a Monday morning following a weekend maintenance window, the application team reported that the account summary screen was loading in 3-4 seconds instead of the usual instant response. The call center was flooded with complaints.

The weekend maintenance had included a REBIND of all application packages as part of a routine DB2 maintenance procedure.

The Query

SELECT A.ACCT_ID,
       A.ACCT_TYPE,
       A.BALANCE,
       A.ACCT_STATUS,
       (SELECT MAX(T.TRANS_DATE)
        FROM MERIDIAN.TRANSACTIONS T
        WHERE T.ACCT_ID = A.ACCT_ID) AS LAST_TRANS_DATE
FROM MERIDIAN.ACCOUNTS A
WHERE A.CUST_ID = :hv-cust-id
ORDER BY A.ACCT_TYPE;

This query was bound in a static SQL package with EXPLAIN(YES), so PLAN_TABLE data was available from both the old bind and the new bind.

Step 1: Compare the Plans

DBA Marcus pulled both the old and new PLAN_TABLE entries using the bind timestamps.

The Old Plan (Working — Before REBIND)

PLANNO	QBLOCKNO	METHOD	TNAME	ACCESSTYPE	MATCHCOLS	ACCESSNAME	INDEXONLY
1	1	0	ACCOUNTS	I	1	IX_ACCT_CUST	N
2	2	0	TRANSACTIONS	I	1	IX_TRANS_ACCT	Y

Interpretation of the old plan: - Step 1: Access ACCOUNTS via IX_ACCT_CUST with MATCHCOLS = 1 on CUST_ID. For a typical customer with 3-5 accounts, this returns a handful of rows. - Step 2: The correlated subquery accesses TRANSACTIONS via IX_TRANS_ACCT(ACCT_ID, TRANS_DATE) with MATCHCOLS = 1. Because the subquery needs only MAX(TRANS_DATE), and TRANS_DATE is the second column of the index, DB2 can read the last index entry for each ACCT_ID to get the maximum date. INDEXONLY = 'Y' confirms no data page access.

This plan was extremely efficient: a few index probes into ACCOUNTS data pages, plus a few index-only lookups into TRANSACTIONS.

The New Plan (Broken — After REBIND)

PLANNO	QBLOCKNO	METHOD	TNAME	ACCESSTYPE	MATCHCOLS	ACCESSNAME	INDEXONLY	SORTN_ORDERBY
1	1	0	ACCOUNTS	I	1	IX_ACCT_TYPE	N	N
2	1	1	TRANSACTIONS	I	1	IX_TRANS_ACCT	N	N

Interpretation of the new plan: - Step 1: Access ACCOUNTS via IX_ACCT_TYPE (an index on ACCT_TYPE). MATCHCOLS = 1 — but wait, the predicate is on CUST_ID, not ACCT_TYPE. Why is the optimizer using this index? - Step 2: The subquery has been transformed into a join (METHOD = 1, not a separate QBLOCKNO). TRANSACTIONS is accessed via IX_TRANS_ACCT with MATCHCOLS = 1, but INDEXONLY = 'N' — it is now reading data pages.

The new plan was wrong in two ways. First, it was using IX_ACCT_TYPE instead of IX_ACCT_CUST to access ACCOUNTS. Second, the optimizer had transformed the scalar subquery into a join, which changed the semantics of the MAX() operation and forced INDEXONLY = 'N' because additional columns were now needed.

Step 2: Investigate the Root Cause

Marcus investigated why the optimizer changed its decision. He checked several factors:

Statistics Check

He queried the catalog to compare the statistics that were available at the old bind time versus the new bind time:

SELECT CARD, NPAGES, FPAGES, OVERFLOW, STATS_TIME
FROM SYSIBM.SYSTABLES
WHERE CREATOR = 'MERIDIAN' AND NAME = 'ACCOUNTS';

The result showed that RUNSTATS had been executed on ACCOUNTS during the maintenance window, approximately 2 hours before the REBIND. The previous RUNSTATS was 6 months old.

The old statistics showed: - CARD (cardinality) = 1,800,000 - NPAGES = 45,000

The new statistics showed: - CARD = 2,200,000 - NPAGES = 62,000

The table had grown by 22%, and the new RUNSTATS captured distribution statistics that had not been collected before — including frequent value statistics on ACCT_TYPE.

Index Statistics Check

SELECT NAME, FIRSTKEYCARD, FULLKEYCARD, CLUSTERRATIO, NLEAF, NLEVELS
FROM SYSIBM.SYSINDEXES
WHERE TBCREATOR = 'MERIDIAN' AND TBNAME = 'ACCOUNTS';

IX_ACCT_CUST now showed a CLUSTERRATIO of 12% (down from the 85% recorded 6 months ago), because massive batch inserts had added accounts in CUST_ID order that did not match the physical page order.

IX_ACCT_TYPE showed a CLUSTERRATIO of 78% (high), because accounts tended to be physically ordered roughly by type due to the way they were originally loaded.

The Optimizer's Reasoning

With the updated statistics, the optimizer saw: - IX_ACCT_CUST: MATCHCOLS would be 1 for CUST_ID, but CLUSTERRATIO = 12% meant that fetching data pages after the index scan would involve mostly random I/O. - IX_ACCT_TYPE: Could be used with the ORDER BY ACCT_TYPE to avoid a sort, and CLUSTERRATIO = 78% meant sequential data page access.

The optimizer calculated that avoiding the sort and using the better-clustered index would be cheaper overall. For the average customer, this seemed reasonable — but the optimizer was not accounting for the fact that IX_ACCT_TYPE with no CUST_ID predicate would require scanning a much larger portion of the index, applying CUST_ID as a non-matching predicate.

The cost difference was slim in the optimizer's estimate, but the real-world impact was enormous because the ACCOUNTS table had 2.2 million rows, and the IX_ACCT_TYPE scan without a matching CUST_ID predicate was reading thousands of index pages to find 3-5 qualifying rows.

Step 3: Implement the Fix

Marcus had several options:

Option A: REBIND with PLANMGMT(EXTENDED) and Switch Back

On DB2 for z/OS 10 and later, plan management allows you to keep old access plans:

REBIND PACKAGE(MERIDIAN.ACCTPKG) -
       PLANMGMT(EXTENDED) -
       SWITCH(PREVIOUS)

This would revert to the previous plan immediately. Marcus used this as an emergency fix to restore service within minutes.

Option B: Fix the Root Cause

The long-term fix required addressing the underlying issue — the poor cluster ratio on IX_ACCT_CUST.

-- Reorganize the tablespace to restore clustering on IX_ACCT_CUST
-- (z/OS REORG utility, clustering on IX_ACCT_CUST)

-- After REORG, run RUNSTATS to capture the improved cluster ratio
-- RUNSTATS TABLESPACE MERIDIAN.ACCTTS INDEX(ALL) TABLE(ALL)

-- REBIND to pick up the corrected statistics
-- REBIND PACKAGE(MERIDIAN.ACCTPKG) EXPLAIN(YES)

Option C: Create a Better Index

Alternatively, create a composite index that serves both the CUST_ID predicate and the ORDER BY:

CREATE INDEX MERIDIAN.IX_ACCT_CUST_TYPE
  ON MERIDIAN.ACCOUNTS(CUST_ID, ACCT_TYPE);

With this index, MATCHCOLS = 1 on CUST_ID would be guaranteed, and the data would come out in ACCT_TYPE order (within each CUST_ID), potentially eliminating the sort.

Step 4: Resolution and Prevention

Marcus implemented Option A immediately (Monday morning, 9:15 AM) to restore service. Response times dropped back to 8 milliseconds.

Over the following weekend, he implemented both Options B and C: 1. REORG of the ACCOUNTS tablespace clustered on IX_ACCT_CUST 2. Creation of IX_ACCT_CUST_TYPE as a more targeted index 3. RUNSTATS on all ACCOUNTS indexes 4. REBIND with EXPLAIN(YES)

The new plan after all fixes:

PLANNO	QBLOCKNO	METHOD	TNAME	ACCESSTYPE	MATCHCOLS	ACCESSNAME	INDEXONLY
1	1	0	ACCOUNTS	I	1	IX_ACCT_CUST_TYPE	N
2	2	0	TRANSACTIONS	I	1	IX_TRANS_ACCT	Y

MATCHCOLS = 1 on CUST_ID, data returned in ACCT_TYPE order (no sort needed), and the scalar subquery preserved as a separate query block with index-only access on TRANSACTIONS. Average response time: 6 milliseconds.

Prevention Measures Implemented

Marcus put three safeguards in place:

1. Access Plan Monitoring

He created a process that compares PLAN_TABLE entries before and after every REBIND. If any critical query changes its access path (different ACCESSTYPE, different ACCESSNAME, or significantly different estimated cost), an alert is sent to the DBA team before the package is activated in production.

-- Simplified comparison query
SELECT OLD.QUERYNO,
       OLD.PLANNO,
       CASE WHEN OLD.ACCESSTYPE <> NEW.ACCESSTYPE THEN 'ACCESS CHANGED'
            WHEN OLD.ACCESSNAME <> NEW.ACCESSNAME THEN 'INDEX CHANGED'
            WHEN OLD.MATCHCOLS <> NEW.MATCHCOLS THEN 'MATCHCOLS CHANGED'
            WHEN OLD.INDEXONLY <> NEW.INDEXONLY THEN 'INDEXONLY CHANGED'
            ELSE 'OK'
       END AS REGRESSION_CHECK
FROM your_schema.PLAN_TABLE OLD
JOIN your_schema.PLAN_TABLE NEW
  ON OLD.QUERYNO = NEW.QUERYNO
 AND OLD.PLANNO = NEW.PLANNO
WHERE OLD.TIMESTAMP = :old_bind_time
  AND NEW.TIMESTAMP = :new_bind_time
  AND (OLD.ACCESSTYPE <> NEW.ACCESSTYPE
    OR OLD.ACCESSNAME <> NEW.ACCESSNAME
    OR OLD.MATCHCOLS <> NEW.MATCHCOLS
    OR OLD.INDEXONLY <> NEW.INDEXONLY);

2. PLANMGMT Policy

All critical packages are now bound with PLANMGMT(EXTENDED), which retains the two most recent access plans. If a regression is detected, the DBA can switch back to the previous plan instantly without rebinding.

3. Regular REORG Schedule

The ACCOUNTS tablespace is now reorganized monthly to maintain cluster ratios above 80% on the primary access indexes.

Lessons Learned

REBIND is not risk-free. A REBIND with new statistics can change access plans in unexpected ways. Always EXPLAIN after REBIND and compare against the previous plan.
Cluster ratio matters enormously. A low cluster ratio on an index makes the optimizer estimate high random I/O costs, which can cause it to prefer a different (worse) index or a table scan.
PLANMGMT is essential for production. The ability to instantly revert to a previous access plan is the difference between a 5-minute fix and a multi-hour investigation during an outage.
RUNSTATS before REBIND can be dangerous. Running RUNSTATS and REBIND in the same maintenance window means the optimizer sees new statistics it has never used before. Consider running RUNSTATS a few days before REBIND, monitoring for any plan changes, and only then proceeding with the REBIND.
Composite indexes that serve multiple purposes are more stable. The IX_ACCT_CUST_TYPE index serves both the CUST_ID lookup and the ACCT_TYPE ordering, making the optimizer's choice unambiguous.

Discussion Questions

Why did the optimizer choose IX_ACCT_TYPE over IX_ACCT_CUST when IX_ACCT_CUST had a matching predicate?
If Marcus had not had PLANMGMT(EXTENDED) available, what would his emergency fix options have been?
Could this regression have been prevented by running EXPLAIN between the RUNSTATS and the REBIND? How?
What other queries on the ACCOUNTS table might be affected by the low cluster ratio on IX_ACCT_CUST?
Should the old IX_ACCT_CUST index be dropped now that IX_ACCT_CUST_TYPE exists? What factors would you consider?