Case Study 1: CNB's Overnight Plan Change Investigation

Lisa Tran's Deep Dive into a Production Access Path Regression


The Setting

Continental National Bank processes 500 million transactions daily across four LPARs. The nightly batch cycle — a sequence of 47 jobs that must complete within a 4-hour window between 11:00 PM and 3:00 AM — is the heartbeat of the bank's operations. Every account balance, every reconciliation, every regulatory report depends on this cycle completing on time.

Job ACCTRCN7, the account reconciliation step, is the single most critical job in the cycle. It joins the day's transactions against the account master to validate that running balances are correct, flag discrepancies, and produce the reconciliation report that audit requires every morning by 6:00 AM.

For eighteen months, ACCTRCN7 has been completing in 20–24 minutes. It's been a non-event — exactly what production operations people want.

Until Tuesday.


The Incident

03:47 AM — Rob Calloway, batch operations lead, notices ACCTRCN7 has been running for 88 minutes. Normally it would have completed 66 minutes ago. The downstream jobs — regulatory reporting, GL posting, statement generation — are all queued and waiting.

03:52 AM — Rob calls Lisa Tran, senior DBA. She's been at CNB for eleven years and has seen hundreds of incidents. Her first question is always the same.

"What changed?"

"Nothing," Rob says. "No code deploy this week. No DDL. Same JCL. I checked the change log."

Lisa knows that "nothing changed" never means nothing changed. Something always changed. The question is what.

03:55 AM — Lisa connects to LPAR DB2A (the production DB2 subsystem where ACCTRCN7 runs) from her home workstation via encrypted VPN. She pulls up her diagnostic toolkit — a set of SQL queries she's refined over the years.

Step 1: Confirm the Problem

-- Check current thread activity for ACCTRCN7
SELECT SUBSTR(PKGNAME,1,8) AS PACKAGE,
       SUBSTR(CONTOKEN,1,8) AS CONTOKEN,
       STATUS,
       ELAPSED_TIME_SEC,
       CPU_TIME_SEC,
       GETPAGES,
       SYNC_IO
FROM   TABLE(MON_GET_ACTIVITY(NULL,-2))
WHERE  PKGNAME LIKE 'ACCTRCN7%';

Results:

PACKAGE   CONTOKEN  STATUS   ELAPSED_SEC  CPU_SEC  GETPAGES      SYNC_IO
--------  --------  ------   -----------  -------  -----------   --------
ACCTRCN7  D5F2A1B3  ACTIVE   5340         1892     487,291,034   3,847,221

487 million getpages. 3.8 million synchronous I/Os. Those numbers are catastrophic — normal is around 80 million getpages and 200,000 sync I/Os. Something is fundamentally different about how DB2 is accessing the data.

Step 2: Check the Access Path

-- Get the current EXPLAIN for ACCTRCN7, query 15 (the big reconciliation join)
SELECT QUERYNO, PLANNO, METHOD,
       SUBSTR(TNAME,1,25) AS TABLE_NAME,
       ACCESSTYPE, MATCHCOLS,
       SUBSTR(ACCESSNAME,1,25) AS INDEX_NAME,
       INDEXONLY, PREFETCH,
       SORTN_JOIN, SORTC_JOIN
FROM   PLAN_TABLE
WHERE  PROGNAME = 'ACCTRCN7'
  AND  QUERYNO = 15
  AND  BIND_TIME = (SELECT MAX(BIND_TIME) FROM PLAN_TABLE
                    WHERE PROGNAME = 'ACCTRCN7')
ORDER BY PLANNO;

Results:

QUERYNO  PLANNO  METHOD  TABLE_NAME                 ACCESSTYPE  MATCHCOLS  INDEX_NAME                 INDEXONLY  PREFETCH  SORTN_JOIN  SORTC_JOIN
-------  ------  ------  -------------------------  ----------  ---------  -------------------------  --------  --------  ----------  ----------
15       1       0       DAILY_TRANSACTIONS         R           0                                     N         S         Y           N
15       2       2       ACCOUNT_MASTER             I           1          IX_AMAST_ACCTID            N         S         N           Y

Lisa's stomach drops. ACCESSTYPE = 'R' on DAILY_TRANSACTIONS — a tablespace scan on a 180-GB tablespace. METHOD = 2 — merge scan join. SORTN_JOIN = 'Y' — the system is sorting the entire qualifying set from DAILY_TRANSACTIONS for the merge.

She pulls up her baseline comparison — she keeps PLAN_TABLE snapshots from every production BIND in a history table:

-- Previous EXPLAIN (from the last BIND on Dec 1)
SELECT QUERYNO, PLANNO, METHOD,
       SUBSTR(TNAME,1,25), ACCESSTYPE, MATCHCOLS,
       SUBSTR(ACCESSNAME,1,25), INDEXONLY, PREFETCH,
       SORTN_JOIN, SORTC_JOIN
FROM   PLAN_TABLE_HISTORY
WHERE  PROGNAME = 'ACCTRCN7'
  AND  QUERYNO = 15
  AND  CAPTURE_DATE = '2025-12-01'
ORDER BY PLANNO;
QUERYNO  PLANNO  METHOD  TABLE_NAME                 ACCESSTYPE  MATCHCOLS  INDEX_NAME                 INDEXONLY  PREFETCH  SORTN_JOIN  SORTC_JOIN
-------  ------  ------  -------------------------  ----------  ---------  -------------------------  --------  --------  ----------  ----------
15       1       0       DAILY_TRANSACTIONS         I           4          IX_DTRAN_ACCTDT            N         S         N           N
15       2       1       ACCOUNT_MASTER             I           1          IX_AMAST_ACCTID            N         S         N           N

The previous plan: matching index scan with 4 matching columns on IX_DTRAN_ACCTDT, nested loop join. The current plan: tablespace scan, merge scan join. A complete change.

Step 3: What Changed?

Lisa checks the BIND log:

SELECT BIND_TIME, SUBSTR(BOUNDBY,1,8) AS BOUND_BY
FROM   PLAN_TABLE
WHERE  PROGNAME = 'ACCTRCN7'
ORDER BY BIND_TIME DESC
FETCH FIRST 5 ROWS ONLY;
BIND_TIME                   BOUND_BY
--------------------------  --------
2025-12-15-14.30.22.000000  AUTORBND
2025-12-01-09.15.44.000000  LISATRAN

AUTORBND — automatic rebind. DB2 detected that the package needed to be rebound. This happens when RUNSTATS invalidates the package or when a package is in INVALID/INOPERATIVE state. But what triggered it?

Lisa checks the RUNSTATS history:

SELECT SUBSTR(TSNAME,1,16) AS TABLESPACE,
       SUBSTR(IXNAME,1,20) AS INDEX_NAME,
       LASTTIME_RUNSTATS
FROM   SYSIBM.SYSTABLESPACESTATS
WHERE  DBNAME = 'DBPROD01'
  AND  TSNAME = 'TSDTRAN'
ORDER BY LASTTIME_RUNSTATS DESC;

RUNSTATS ran Sunday night, December 14, as part of the weekly maintenance window. Two days ago. And the package was automatically rebound on December 15 at 14:30 — Monday afternoon, the first time it was executed after the statistics changed.

Step 4: What Did RUNSTATS Find?

This is where Lisa goes deep.

-- Current statistics for TRAN_DATE
SELECT COLCARDF, HIGH2KEY, LOW2KEY
FROM   SYSIBM.SYSCOLUMNS
WHERE  TBNAME = 'DAILY_TRANSACTIONS'
  AND  TBCREATOR = 'PROD'
  AND  NAME = 'TRAN_DATE';
COLCARDF    HIGH2KEY      LOW2KEY
----------  ------------  ------------
1095        2025-12-14    2022-12-15

COLCARDF = 1095. Three years of dates. But DAILY_TRANSACTIONS is supposed to hold only one rolling year — 365 days.

Lisa remembers. Last Thursday, the data migration team loaded three years of historical transactions for a regulatory analysis project. They loaded the data Thursday night. The analysis ran Friday and Saturday. The data was supposed to be deleted before Sunday's RUNSTATS window. But the deletion job failed — she can see it in the job log — and was re-run successfully Monday morning.

So the timeline is: - Thursday night: 3 years of historical data loaded (COLCARDF blooms to ~1095) - Friday–Saturday: Historical data present in table - Sunday night: RUNSTATS runs and captures COLCARDF = 1095 - Monday morning: Historical data deleted, returning table to 1 year of data - Monday afternoon: ACCTRCN7 executes, triggers automatic rebind using the inflated statistics - Tuesday 03:47 AM: The bad plan is still in effect for the nightly batch

The statistics say 1,095 distinct TRAN_DATE values. The data only has 366. RUNSTATS captured a transient state.

Step 5: Understanding the Cost Model Impact

Lisa now works through the cost model math to understand exactly why the optimizer switched plans.

With COLCARDF = 365 (correct): - FF for TRAN_DATE = :WS-PROCESS-DATE: 1/365 = 0.00274 - FF for ACCT_STATUS IN ('A', 'P', 'H'): 3/6 = 0.50 - Combined FF: 0.00274 * 0.50 = 0.00137 - Estimated rows from DAILY_TRANSACTIONS: 500,000,000 * 0.00137 = 685,000 - Index access cost (MATCHCOLS=4, CLUSTERRATIO=95%): mostly sequential I/O - Estimated cost: ~45,000 timerons

With COLCARDF = 1095 (inflated): - FF for TRAN_DATE = :WS-PROCESS-DATE: 1/1095 = 0.000913 - FF for ACCT_STATUS IN ('A', 'P', 'H'): 3/6 = 0.50 - Combined FF: 0.000913 * 0.50 = 0.000457 - Estimated rows from DAILY_TRANSACTIONS: 500,000,000 * 0.000457 = 228,500 - But the optimizer compared this against the tablespace scan cost...

The critical insight: with the lower row estimate, the optimizer recalculated the merge scan join cost as cheaper than nested loop with 228,500 index probes. The tablespace scan feeds the merge directly with sequential I/O. The optimizer thought: "228,500 rows through a sort-merge is cheaper than 228,500 individual index probes into ACCOUNT_MASTER."

The problem: the actual number of qualifying rows was still ~685,000 (the real data hadn't changed, only the statistics). And a tablespace scan on 180 GB to find 685,000 rows is wildly more expensive than an index scan.


The Fix

Immediate Resolution (04:08 AM)

Lisa submits a RUNSTATS job from the emergency JCL library:

//RNSTFIX  JOB (EMERGENCY),'LISA TRAN',CLASS=A,MSGCLASS=X,
//         NOTIFY=LISATRAN
//RUNSTATS EXEC DSNUPROC,SYSTEM=DB2A,UID='EMRG',UTPROC=''
//SYSIN    DD *
  RUNSTATS TABLESPACE DBPROD01.TSDTRAN
    TABLE(PROD.DAILY_TRANSACTIONS)
    USE PROFILE
    SHRLEVEL CHANGE
    REPORT YES
    UPDATE ALL
/*

RUNSTATS completes in 12 minutes (it's a big tablespace). The statistics now show:

COLCARDF = 366
HIGH2KEY = 2025-12-15
LOW2KEY  = 2024-12-16

Correct.

Lisa then rebinds the package:

REBIND PACKAGE(ACCTRCNC.ACCTRCN7) EXPLAIN(YES)

She verifies the PLAN_TABLE:

PLANNO  METHOD  TABLE_NAME              ACCESSTYPE  MATCHCOLS  INDEX_NAME           PREFETCH  SORTN_JOIN
------  ------  ----------------------  ----------  ---------  -------------------  --------  ----------
1       0       DAILY_TRANSACTIONS      I           4          IX_DTRAN_ACCTDT      S         N
2       1       ACCOUNT_MASTER          I           1          IX_AMAST_ACCTID      S         N

The good plan is back. MATCHCOLS = 4, nested loop join, no sorts.

04:22 AM — Lisa restarts ACCTRCN7 from the beginning (it has built-in checkpoint/restart, so she can restart from the last commit point — a design decision from Chapter 10 of this book).

04:48 AM — ACCTRCN7 completes. 26 minutes. Within SLA.

05:15 AM — All downstream jobs complete. The batch window holds. Audit gets their reconciliation report on time.

Long-Term Prevention (Implemented the Following Week)

Lisa presents to Kwame Mensah, the chief architect, and together they implement four changes:

1. RUNSTATS Scheduling Gate

A new pre-RUNSTATS validation step checks for transient data:

-- Pre-RUNSTATS validation query
-- Compare current row count to last known count
-- If delta > 20%, flag for manual review
SELECT T.NAME, T.CARDF AS CATALOG_CARDF,
       (SELECT COUNT(*) FROM PROD.DAILY_TRANSACTIONS) AS ACTUAL_COUNT,
       ABS(T.CARDF - (SELECT COUNT(*) FROM PROD.DAILY_TRANSACTIONS))
         / NULLIF(T.CARDF, 0) * 100 AS PCT_DELTA
FROM   SYSIBM.SYSTABLES T
WHERE  T.NAME = 'DAILY_TRANSACTIONS'
  AND  T.CREATOR = 'PROD';

If PCT_DELTA > 20%, the RUNSTATS job abends with a warning code and pages the on-call DBA.

2. PLANMGMT(EXTENDED) on All Critical Packages

All 47 batch cycle packages are rebound with PLANMGMT(EXTENDED). This preserves the PREVIOUS and ORIGINAL plans, enabling instant fallback via REBIND ... SWITCH(PREVIOUS).

3. Automated EXPLAIN Baseline Comparison

A new post-BIND process captures PLAN_TABLE output and compares against a baseline:

-- Detect critical changes: TS scan replacing index on large tables
SELECT B.PROGNAME, B.QUERYNO,
       B.ACCESSTYPE AS BASELINE_ACCESS,
       N.ACCESSTYPE AS NEW_ACCESS,
       B.MATCHCOLS AS BASELINE_MC,
       N.MATCHCOLS AS NEW_MC,
       B.METHOD AS BASELINE_METHOD,
       N.METHOD AS NEW_METHOD
FROM   PLAN_TABLE_BASELINE B
       INNER JOIN PLAN_TABLE N
         ON B.PROGNAME = N.PROGNAME
         AND B.QUERYNO = N.QUERYNO
         AND B.PLANNO = N.PLANNO
WHERE  N.BIND_TIME = (SELECT MAX(BIND_TIME) FROM PLAN_TABLE
                      WHERE PROGNAME = B.PROGNAME)
  AND  (B.ACCESSTYPE <> N.ACCESSTYPE
        OR B.MATCHCOLS <> N.MATCHCOLS
        OR B.METHOD <> N.METHOD);

Changes are classified: - CRITICAL (page immediately): ACCESSTYPE changed from 'I' to 'R' on tablespace > 1 GB - WARNING (email DBA team): METHOD changed, or MATCHCOLS decreased by 2+ - INFO (daily report): Minor changes, cost estimate shifts > 20%

4. Data Migration Protocol

All temporary data loads for analysis must: 1. Use a separate tablespace (not the production tablespace) 2. If production tablespace must be used, RUNSTATS must be re-run after the data is removed 3. The RUNSTATS window must be shifted if it falls between the load and the cleanup


Key Lessons

For DBAs

  1. RUNSTATS captures a point-in-time snapshot. If the data is in a transient state, the statistics will be wrong. Schedule RUNSTATS when data is in its steady state.
  2. Automatic rebind (AUTOBIND) amplifies the risk. The bad plan was activated by AUTOBIND, not by a deliberate action. Consider disabling AUTOBIND for critical packages and controlling rebind timing.
  3. PLANMGMT is cheap insurance. The overhead of storing two extra plan copies is negligible. The benefit of instant fallback during a 3 AM crisis is immense.

For Architects

  1. Operational processes must account for optimizer sensitivity. Data migration, archiving, purging — any process that changes data distribution should trigger a RUNSTATS reassessment.
  2. Invest in automated plan monitoring. The EXPLAIN Baseline System caught 23 regressions in its first two years. The cost of building it was three weeks of development. The cost of one undetected regression averages 4 hours of batch delay and associated business impact.

For COBOL Developers

  1. Understand that your SQL is a request, not an instruction. The access path your query gets depends on statistics, not on how you wrote the SQL. Two identical queries can get different plans if the statistics change.
  2. Design your SQL to help the optimizer. Sargable predicates, correct data types on host variables, appropriate use of EXPLAIN during development — these aren't DBA concerns. They're your concerns.

Discussion Questions

  1. If the historical data load had been done to a separate tablespace (as the revised protocol requires), would the incident have occurred? Under what circumstances might it still occur?

  2. The pre-RUNSTATS validation gate checks for row count delta > 20%. Is this threshold too aggressive (too many false positives) or too lenient (might miss problems)? What would you set it to?

  3. Lisa used REBIND ... EXPLAIN(YES) to restore the plan. Could she instead have used REBIND ... SWITCH(PREVIOUS) if PLANMGMT(EXTENDED) had been in place? What are the trade-offs?

  4. The incident happened because AUTOBIND triggered a rebind with bad statistics. Would it be better to disable AUTOBIND entirely for critical packages and handle all rebinds manually? What are the operational implications?

  5. The EXPLAIN Baseline System is a custom-built tool. IBM provides some similar capabilities through DB2 Administration Tool and Optim Query Workload Tuner. Discuss the trade-offs between building custom tooling and purchasing vendor solutions.