Case Study 02: Batch Window Crisis — 8 Hours to 12 Hours

Background

Continental Insurance Group (CIG) runs a DB2 for z/OS system that processes insurance claims, policy renewals, and premium calculations. The system handles 2 million claims per year and manages 8 million active policies.

The nightly batch window runs from 10:00 PM to 6:00 AM (8 hours). This window was originally designed to accommodate 5 hours of batch processing with a 3-hour buffer. Over the past 18 months, batch runtime has grown steadily, and last month the batch window overran for the first time, finishing at 8:00 AM and causing OLTP degradation during the morning rush.

This month, the batch is projected to require 12 hours — far exceeding the 8-hour window.

The Batch Jobs

Job Description Runtime 18 Months Ago Runtime Now Tables
CLMPURGE Archive claims older than 7 years 30 min 45 min CLAIMS, CLAIM_ARCHIVE
PRMSYNC Synchronize premium calculations 45 min 90 min POLICIES, PREMIUMS
POLRENEW Process policy renewals due in 30 days 60 min 180 min POLICIES, RENEWALS, BILLING
CLMSTATS Generate claims statistics report 30 min 120 min CLAIMS, CLAIM_DETAILS
RUNSTATS Update catalog statistics 20 min 25 min All major tables
REORG Reorganize high-activity tablespaces 60 min 75 min CLAIMS, POLICIES
IMGCOPY Image copy of critical tablespaces 45 min 50 min All critical tablespaces

Total 18 months ago: 290 min (4 hr 50 min) Total now: 585 min (9 hr 45 min) — and growing

The Investigation

Step 1: Define the Symptoms

Three batch jobs have grown dramatically: - POLRENEW: 60 min to 180 min (3x) - CLMSTATS: 30 min to 120 min (4x) - PRMSYNC: 45 min to 90 min (2x)

The remaining jobs grew modestly (proportional to data growth), which is expected and acceptable.

Step 2: Collect Metrics — Per-Job Analysis

POLRENEW Accounting Trace:

Plan: POLRENEW   Occurrences: 1

                      18 Months Ago        Now
Class 2 Elapsed:      3,600 sec           10,800 sec
Class 2 CPU:            450 sec            2,100 sec
I/O Suspensions:      2,800 sec            7,500 sec
Lock Suspensions:       200 sec              800 sec

Getpages:          120,000,000          580,000,000
Sync I/O:           25,000,000          130,000,000
Prefetch Requests:     500,000              600,000
Rows Processed:      2,000,000            8,500,000

COMMITs:                     1                    1

Analysis: - CPU increased 4.7x (450 to 2,100 sec) - Getpages increased 4.8x (120M to 580M) - Sync I/O increased 5.2x (25M to 130M) - Rows processed increased 4.25x (2M to 8.5M) - Prefetch barely changed (500K to 600K) - COMMIT count: still 1 — the job commits only at the end

The getpage and row increases are roughly proportional to each other, suggesting that the data volume has grown ~4x and the access paths are proportionally scanning more data.

CLMSTATS Accounting Trace:

Plan: CLMSTATS   Occurrences: 1

                      18 Months Ago        Now
Class 2 Elapsed:      1,800 sec            7,200 sec
Class 2 CPU:            300 sec            3,800 sec
I/O Suspensions:      1,200 sec            2,800 sec
Lock Suspensions:       100 sec              200 sec

Getpages:           80,000,000          950,000,000
Sync I/O:           15,000,000           55,000,000
Rows Processed:      5,000,000          120,000,000

COMMITs:                     5                    5

Analysis: - Getpages increased 11.9x — far more than the 4x data growth - Rows processed increased 24x — something fundamentally changed in the access path - CPU increased 12.7x — consistent with the getpage explosion - This is CPU-bound with I/O as a secondary factor

The disproportionate increase in getpages and rows processed (relative to data growth) signals an access path problem.

PRMSYNC Accounting Trace:

Plan: PRMSYNC   Occurrences: 1

                      18 Months Ago        Now
Class 2 Elapsed:      2,700 sec            5,400 sec
Class 2 CPU:            200 sec              400 sec
I/O Suspensions:      2,200 sec            4,500 sec
Lock Suspensions:       150 sec              300 sec

Getpages:           50,000,000          100,000,000
Sync I/O:           20,000,000           42,000,000
Prefetch Requests:   2,000,000            2,200,000

COMMITs:                   100                  100

Analysis: - Everything roughly doubled, which is proportional to data growth - I/O-bound: I/O suspension is 83% of elapsed time - Prefetch ratio is low: 2.2M prefetch vs 42M sync I/O = only 5% prefetch - This job should benefit from prefetch but is doing mostly synchronous I/O

Step 3: Root Cause Analysis

POLRENEW — Data Growth Without COMMIT Optimization:

The POLICIES table grew from 5 million to 8 million rows. The RENEWALS processing scans policies expiring in the next 30 days, which grew from 2M to 8.5M rows (a new product line added 6M policies with monthly renewal cycles).

The single-COMMIT pattern means the job holds all locks from start to finish. With 8.5M rows processed, this accumulates massive lock counts and prevents any concurrent access to the POLICIES tablespace.

CLMSTATS — Access Path Degradation:

Running EXPLAIN on the CLMSTATS main query revealed the problem:

Current access path:
  1. TBSCAN on CLAIM_DETAILS (tablespace scan)
  2. NLJOIN to CLAIMS on CLAIM_ID
  3. SORT for GROUP BY

Eighteen months ago, the same query used:

Previous access path:
  1. IXSCAN on IX_CLMDET_STATUS for CLAIM_DETAILS
  2. NLJOIN to CLAIMS on CLAIM_ID
  3. SORT for GROUP BY

The switch from index scan to table scan occurred because RUNSTATS was run against the CLAIM_DETAILS table after a bulk load that temporarily skewed the data distribution. The optimizer chose a table scan because the statistics indicated low cardinality on the STATUS column. A subsequent RUNSTATS was never run after the data normalized.

With a table scan on 120 million rows instead of an index scan on a filtered subset, getpages exploded.

PRMSYNC — Sequential Prefetch Not Triggering:

The PREMIUMS table was last reorganized 14 months ago. Since then, significant insert and update activity has fragmented the data. DB2's sequential prefetch algorithm does not trigger because the data pages are no longer physically sequential.

Step 4: Implement Fixes

Fix 1: POLRENEW — COMMIT Frequency and Partition Processing

// Modified POLRENEW to commit every 5,000 rows
// and process by partition (policy type)

// Process partition 1: Individual policies
//STEP01  EXEC PGM=POLRENEW,PARM='PTYPE=IND,COMMITFREQ=5000'
// Process partition 2: Commercial policies
//STEP02  EXEC PGM=POLRENEW,PARM='PTYPE=COM,COMMITFREQ=5000'
// Process partition 3: Monthly cycle policies (new)
//STEP03  EXEC PGM=POLRENEW,PARM='PTYPE=MON,COMMITFREQ=5000'

The job was also redesigned to process only the rows that actually need renewal (adding a WHERE clause on renewal_date instead of scanning all policies):

-- Before (scans all policies)
SELECT * FROM POLICIES WHERE policy_type = :ptype;

-- After (scans only policies due for renewal)
SELECT * FROM POLICIES
WHERE policy_type = :ptype
  AND renewal_date BETWEEN CURRENT DATE AND CURRENT DATE + 30 DAYS
  AND status = 'ACTIVE';

Fix 2: CLMSTATS — RUNSTATS and REBIND

// Run RUNSTATS to update statistics
//STEP01  EXEC PGM=DSNUTILB
//SYSIN DD *
  RUNSTATS TABLESPACE CIGDB.TS_CLMDET
    TABLE(CIG.CLAIM_DETAILS)
    INDEX(ALL)
    UPDATE ALL
    REPORT YES
/*

// Rebind the CLMSTATS package
//STEP02  EXEC PGM=IKJEFT01
//SYSTSIN DD *
  DSN SYSTEM(DB2P)
  BIND PACKAGE(COLL1) MEMBER(CLMSTATS) -
    ACTION(REPLACE) ISOLATION(CS) -
    EXPLAIN(YES)
  END
/*

After RUNSTATS and REBIND, EXPLAIN confirmed the index scan access path was restored.

Fix 3: PRMSYNC — REORG the PREMIUMS Tablespace

// Online REORG to restore physical sequencing
//STEP01  EXEC PGM=DSNUTILB
//SYSIN DD *
  REORG TABLESPACE CIGDB.TS_PREMIUMS
    SHRLEVEL CHANGE
    LOG YES
    KEEPDICTIONARY
/*

// Follow with RUNSTATS
//STEP02  EXEC PGM=DSNUTILB
//SYSIN DD *
  RUNSTATS TABLESPACE CIGDB.TS_PREMIUMS
    TABLE(CIG.PREMIUMS)
    INDEX(ALL)
    UPDATE ALL
/*

Fix 4: Batch Schedule Redesign

The original schedule ran all jobs sequentially. The redesigned schedule runs independent jobs in parallel:

22:00 - 22:30  CLMPURGE (archive old claims)
22:00 - 22:25  RUNSTATS (all tables) [parallel with CLMPURGE — different tables]
22:30 - 23:15  REORG (CLAIMS, POLICIES)
23:15 - 23:25  RUNSTATS on REORG'd tables
23:15 - 23:55  IMGCOPY [parallel with RUNSTATS — uses SHRLEVEL REFERENCE]
23:25 - 00:10  POLRENEW (optimized — now 45 min)
23:55 - 00:25  CLMSTATS (optimized — now 30 min) [parallel with POLRENEW — different tables]
00:10 - 00:40  PRMSYNC (after REORG — now 45 min)
00:40          Batch window complete. Total: 2 hr 40 min.

Step 5: Verification

POLRENEW:

Before optimization:  10,800 sec elapsed, 580M getpages, 1 COMMIT
After optimization:    2,700 sec elapsed, 45M getpages, 1,700 COMMITs
Improvement: 75% reduction in elapsed time

CLMSTATS:

Before optimization:  7,200 sec elapsed, 950M getpages, table scan
After optimization:   1,800 sec elapsed, 85M getpages, index scan
Improvement: 75% reduction in elapsed time

PRMSYNC:

Before optimization:  5,400 sec elapsed, 42M sync I/O, 5% prefetch
After optimization:   2,700 sec elapsed, 8M sync I/O, 85% prefetch
Improvement: 50% reduction in elapsed time

Total Batch Window:

Before: 585 min (9 hr 45 min) — sequential execution
After:  160 min (2 hr 40 min) — parallel execution + optimized jobs
Buffer: 5 hr 20 min of spare capacity in the 8-hour window

Long-Term Remediation

1. Automated RUNSTATS Schedule

Implemented a weekly RUNSTATS on all tables with > 5% data change since last RUNSTATS. This prevents the statistical drift that caused the CLMSTATS access path regression.

2. REORG Monitoring

Created a monthly REORG evaluation that checks CLUSTERRATIO and FARINDREF metrics. When CLUSTERRATIO drops below 80%, the table is automatically added to the next REORG cycle.

Implemented a batch runtime tracking table:

CREATE TABLE batch_runtime_history (
    job_name     VARCHAR(20),
    run_date     DATE,
    elapsed_sec  INTEGER,
    cpu_sec      INTEGER,
    getpages     BIGINT,
    commits      INTEGER
);

A weekly report compares each job's runtime trend against its 90-day moving average. If a job exceeds 120% of its moving average, it is flagged for investigation.

4. Access Path Baseline

EXPLAIN is run on all critical batch SQL after every REBIND, and the access paths are stored in a baseline table. A weekly comparison detects access path regressions before they cause runtime problems.

Lessons Learned

  1. Batch runtime growth is rarely sudden. POLRENEW and CLMSTATS grew gradually over 18 months. Monthly trending would have caught the degradation when it was a 20% increase rather than a 300% increase.

  2. A single COMMIT in a batch job is a design flaw. POLRENEW processed 8.5 million rows with one COMMIT. This accumulates locks, prevents concurrent access, and makes restart-after-failure impossible.

  3. RUNSTATS after bulk operations is non-negotiable. The CLMSTATS access path regression was caused by a single stale RUNSTATS after a bulk load. Automated RUNSTATS scheduling prevents this.

  4. REORG frequency must match modification frequency. The PREMIUMS table was reorganized 14 months ago but is modified daily. A REORG evaluation should run monthly.

  5. Parallel batch execution is the highest-leverage optimization. The batch schedule redesign (from sequential to parallel) reduced the window from 9:45 to 2:40 — a 73% improvement — before any individual job optimization.

Discussion Questions

  1. CIG's batch window grew from 4:50 to 9:45 over 18 months — a 100% increase. If data grows at the same rate, when will the optimized batch window (2:40) exceed the 8-hour window again? What proactive measures should CIG take?

  2. The POLRENEW fix added a WHERE clause on renewal_date. This changes the job's semantics — previously it scanned all policies, now it scans only those due for renewal. Under what circumstances could this narrower scan miss a policy that should be renewed?

  3. REORG with SHRLEVEL CHANGE allows concurrent read/write access during reorganization. What are the costs of online REORG compared to offline REORG? When would you choose SHRLEVEL NONE instead?

  4. The parallel batch schedule requires CLMPURGE and RUNSTATS to access different tables. What would happen if a future schema change caused them to share a table? How would you detect this conflict?