Case Study 02: Batch Window Crisis — 8 Hours to 12 Hours
Background
Continental Insurance Group (CIG) runs a DB2 for z/OS system that processes insurance claims, policy renewals, and premium calculations. The system handles 2 million claims per year and manages 8 million active policies.
The nightly batch window runs from 10:00 PM to 6:00 AM (8 hours). This window was originally designed to accommodate 5 hours of batch processing with a 3-hour buffer. Over the past 18 months, batch runtime has grown steadily, and last month the batch window overran for the first time, finishing at 8:00 AM and causing OLTP degradation during the morning rush.
This month, the batch is projected to require 12 hours — far exceeding the 8-hour window.
The Batch Jobs
| Job | Description | Runtime 18 Months Ago | Runtime Now | Tables |
|---|---|---|---|---|
| CLMPURGE | Archive claims older than 7 years | 30 min | 45 min | CLAIMS, CLAIM_ARCHIVE |
| PRMSYNC | Synchronize premium calculations | 45 min | 90 min | POLICIES, PREMIUMS |
| POLRENEW | Process policy renewals due in 30 days | 60 min | 180 min | POLICIES, RENEWALS, BILLING |
| CLMSTATS | Generate claims statistics report | 30 min | 120 min | CLAIMS, CLAIM_DETAILS |
| RUNSTATS | Update catalog statistics | 20 min | 25 min | All major tables |
| REORG | Reorganize high-activity tablespaces | 60 min | 75 min | CLAIMS, POLICIES |
| IMGCOPY | Image copy of critical tablespaces | 45 min | 50 min | All critical tablespaces |
Total 18 months ago: 290 min (4 hr 50 min) Total now: 585 min (9 hr 45 min) — and growing
The Investigation
Step 1: Define the Symptoms
Three batch jobs have grown dramatically: - POLRENEW: 60 min to 180 min (3x) - CLMSTATS: 30 min to 120 min (4x) - PRMSYNC: 45 min to 90 min (2x)
The remaining jobs grew modestly (proportional to data growth), which is expected and acceptable.
Step 2: Collect Metrics — Per-Job Analysis
POLRENEW Accounting Trace:
Plan: POLRENEW Occurrences: 1
18 Months Ago Now
Class 2 Elapsed: 3,600 sec 10,800 sec
Class 2 CPU: 450 sec 2,100 sec
I/O Suspensions: 2,800 sec 7,500 sec
Lock Suspensions: 200 sec 800 sec
Getpages: 120,000,000 580,000,000
Sync I/O: 25,000,000 130,000,000
Prefetch Requests: 500,000 600,000
Rows Processed: 2,000,000 8,500,000
COMMITs: 1 1
Analysis: - CPU increased 4.7x (450 to 2,100 sec) - Getpages increased 4.8x (120M to 580M) - Sync I/O increased 5.2x (25M to 130M) - Rows processed increased 4.25x (2M to 8.5M) - Prefetch barely changed (500K to 600K) - COMMIT count: still 1 — the job commits only at the end
The getpage and row increases are roughly proportional to each other, suggesting that the data volume has grown ~4x and the access paths are proportionally scanning more data.
CLMSTATS Accounting Trace:
Plan: CLMSTATS Occurrences: 1
18 Months Ago Now
Class 2 Elapsed: 1,800 sec 7,200 sec
Class 2 CPU: 300 sec 3,800 sec
I/O Suspensions: 1,200 sec 2,800 sec
Lock Suspensions: 100 sec 200 sec
Getpages: 80,000,000 950,000,000
Sync I/O: 15,000,000 55,000,000
Rows Processed: 5,000,000 120,000,000
COMMITs: 5 5
Analysis: - Getpages increased 11.9x — far more than the 4x data growth - Rows processed increased 24x — something fundamentally changed in the access path - CPU increased 12.7x — consistent with the getpage explosion - This is CPU-bound with I/O as a secondary factor
The disproportionate increase in getpages and rows processed (relative to data growth) signals an access path problem.
PRMSYNC Accounting Trace:
Plan: PRMSYNC Occurrences: 1
18 Months Ago Now
Class 2 Elapsed: 2,700 sec 5,400 sec
Class 2 CPU: 200 sec 400 sec
I/O Suspensions: 2,200 sec 4,500 sec
Lock Suspensions: 150 sec 300 sec
Getpages: 50,000,000 100,000,000
Sync I/O: 20,000,000 42,000,000
Prefetch Requests: 2,000,000 2,200,000
COMMITs: 100 100
Analysis: - Everything roughly doubled, which is proportional to data growth - I/O-bound: I/O suspension is 83% of elapsed time - Prefetch ratio is low: 2.2M prefetch vs 42M sync I/O = only 5% prefetch - This job should benefit from prefetch but is doing mostly synchronous I/O
Step 3: Root Cause Analysis
POLRENEW — Data Growth Without COMMIT Optimization:
The POLICIES table grew from 5 million to 8 million rows. The RENEWALS processing scans policies expiring in the next 30 days, which grew from 2M to 8.5M rows (a new product line added 6M policies with monthly renewal cycles).
The single-COMMIT pattern means the job holds all locks from start to finish. With 8.5M rows processed, this accumulates massive lock counts and prevents any concurrent access to the POLICIES tablespace.
CLMSTATS — Access Path Degradation:
Running EXPLAIN on the CLMSTATS main query revealed the problem:
Current access path:
1. TBSCAN on CLAIM_DETAILS (tablespace scan)
2. NLJOIN to CLAIMS on CLAIM_ID
3. SORT for GROUP BY
Eighteen months ago, the same query used:
Previous access path:
1. IXSCAN on IX_CLMDET_STATUS for CLAIM_DETAILS
2. NLJOIN to CLAIMS on CLAIM_ID
3. SORT for GROUP BY
The switch from index scan to table scan occurred because RUNSTATS was run against the CLAIM_DETAILS table after a bulk load that temporarily skewed the data distribution. The optimizer chose a table scan because the statistics indicated low cardinality on the STATUS column. A subsequent RUNSTATS was never run after the data normalized.
With a table scan on 120 million rows instead of an index scan on a filtered subset, getpages exploded.
PRMSYNC — Sequential Prefetch Not Triggering:
The PREMIUMS table was last reorganized 14 months ago. Since then, significant insert and update activity has fragmented the data. DB2's sequential prefetch algorithm does not trigger because the data pages are no longer physically sequential.
Step 4: Implement Fixes
Fix 1: POLRENEW — COMMIT Frequency and Partition Processing
// Modified POLRENEW to commit every 5,000 rows
// and process by partition (policy type)
// Process partition 1: Individual policies
//STEP01 EXEC PGM=POLRENEW,PARM='PTYPE=IND,COMMITFREQ=5000'
// Process partition 2: Commercial policies
//STEP02 EXEC PGM=POLRENEW,PARM='PTYPE=COM,COMMITFREQ=5000'
// Process partition 3: Monthly cycle policies (new)
//STEP03 EXEC PGM=POLRENEW,PARM='PTYPE=MON,COMMITFREQ=5000'
The job was also redesigned to process only the rows that actually need renewal (adding a WHERE clause on renewal_date instead of scanning all policies):
-- Before (scans all policies)
SELECT * FROM POLICIES WHERE policy_type = :ptype;
-- After (scans only policies due for renewal)
SELECT * FROM POLICIES
WHERE policy_type = :ptype
AND renewal_date BETWEEN CURRENT DATE AND CURRENT DATE + 30 DAYS
AND status = 'ACTIVE';
Fix 2: CLMSTATS — RUNSTATS and REBIND
// Run RUNSTATS to update statistics
//STEP01 EXEC PGM=DSNUTILB
//SYSIN DD *
RUNSTATS TABLESPACE CIGDB.TS_CLMDET
TABLE(CIG.CLAIM_DETAILS)
INDEX(ALL)
UPDATE ALL
REPORT YES
/*
// Rebind the CLMSTATS package
//STEP02 EXEC PGM=IKJEFT01
//SYSTSIN DD *
DSN SYSTEM(DB2P)
BIND PACKAGE(COLL1) MEMBER(CLMSTATS) -
ACTION(REPLACE) ISOLATION(CS) -
EXPLAIN(YES)
END
/*
After RUNSTATS and REBIND, EXPLAIN confirmed the index scan access path was restored.
Fix 3: PRMSYNC — REORG the PREMIUMS Tablespace
// Online REORG to restore physical sequencing
//STEP01 EXEC PGM=DSNUTILB
//SYSIN DD *
REORG TABLESPACE CIGDB.TS_PREMIUMS
SHRLEVEL CHANGE
LOG YES
KEEPDICTIONARY
/*
// Follow with RUNSTATS
//STEP02 EXEC PGM=DSNUTILB
//SYSIN DD *
RUNSTATS TABLESPACE CIGDB.TS_PREMIUMS
TABLE(CIG.PREMIUMS)
INDEX(ALL)
UPDATE ALL
/*
Fix 4: Batch Schedule Redesign
The original schedule ran all jobs sequentially. The redesigned schedule runs independent jobs in parallel:
22:00 - 22:30 CLMPURGE (archive old claims)
22:00 - 22:25 RUNSTATS (all tables) [parallel with CLMPURGE — different tables]
22:30 - 23:15 REORG (CLAIMS, POLICIES)
23:15 - 23:25 RUNSTATS on REORG'd tables
23:15 - 23:55 IMGCOPY [parallel with RUNSTATS — uses SHRLEVEL REFERENCE]
23:25 - 00:10 POLRENEW (optimized — now 45 min)
23:55 - 00:25 CLMSTATS (optimized — now 30 min) [parallel with POLRENEW — different tables]
00:10 - 00:40 PRMSYNC (after REORG — now 45 min)
00:40 Batch window complete. Total: 2 hr 40 min.
Step 5: Verification
POLRENEW:
Before optimization: 10,800 sec elapsed, 580M getpages, 1 COMMIT
After optimization: 2,700 sec elapsed, 45M getpages, 1,700 COMMITs
Improvement: 75% reduction in elapsed time
CLMSTATS:
Before optimization: 7,200 sec elapsed, 950M getpages, table scan
After optimization: 1,800 sec elapsed, 85M getpages, index scan
Improvement: 75% reduction in elapsed time
PRMSYNC:
Before optimization: 5,400 sec elapsed, 42M sync I/O, 5% prefetch
After optimization: 2,700 sec elapsed, 8M sync I/O, 85% prefetch
Improvement: 50% reduction in elapsed time
Total Batch Window:
Before: 585 min (9 hr 45 min) — sequential execution
After: 160 min (2 hr 40 min) — parallel execution + optimized jobs
Buffer: 5 hr 20 min of spare capacity in the 8-hour window
Long-Term Remediation
1. Automated RUNSTATS Schedule
Implemented a weekly RUNSTATS on all tables with > 5% data change since last RUNSTATS. This prevents the statistical drift that caused the CLMSTATS access path regression.
2. REORG Monitoring
Created a monthly REORG evaluation that checks CLUSTERRATIO and FARINDREF metrics. When CLUSTERRATIO drops below 80%, the table is automatically added to the next REORG cycle.
3. Batch Runtime Trending
Implemented a batch runtime tracking table:
CREATE TABLE batch_runtime_history (
job_name VARCHAR(20),
run_date DATE,
elapsed_sec INTEGER,
cpu_sec INTEGER,
getpages BIGINT,
commits INTEGER
);
A weekly report compares each job's runtime trend against its 90-day moving average. If a job exceeds 120% of its moving average, it is flagged for investigation.
4. Access Path Baseline
EXPLAIN is run on all critical batch SQL after every REBIND, and the access paths are stored in a baseline table. A weekly comparison detects access path regressions before they cause runtime problems.
Lessons Learned
-
Batch runtime growth is rarely sudden. POLRENEW and CLMSTATS grew gradually over 18 months. Monthly trending would have caught the degradation when it was a 20% increase rather than a 300% increase.
-
A single COMMIT in a batch job is a design flaw. POLRENEW processed 8.5 million rows with one COMMIT. This accumulates locks, prevents concurrent access, and makes restart-after-failure impossible.
-
RUNSTATS after bulk operations is non-negotiable. The CLMSTATS access path regression was caused by a single stale RUNSTATS after a bulk load. Automated RUNSTATS scheduling prevents this.
-
REORG frequency must match modification frequency. The PREMIUMS table was reorganized 14 months ago but is modified daily. A REORG evaluation should run monthly.
-
Parallel batch execution is the highest-leverage optimization. The batch schedule redesign (from sequential to parallel) reduced the window from 9:45 to 2:40 — a 73% improvement — before any individual job optimization.
Discussion Questions
-
CIG's batch window grew from 4:50 to 9:45 over 18 months — a 100% increase. If data grows at the same rate, when will the optimized batch window (2:40) exceed the 8-hour window again? What proactive measures should CIG take?
-
The POLRENEW fix added a WHERE clause on renewal_date. This changes the job's semantics — previously it scanned all policies, now it scans only those due for renewal. Under what circumstances could this narrower scan miss a policy that should be renewed?
-
REORG with SHRLEVEL CHANGE allows concurrent read/write access during reorganization. What are the costs of online REORG compared to offline REORG? When would you choose SHRLEVEL NONE instead?
-
The parallel batch schedule requires CLMPURGE and RUNSTATS to access different tables. What would happen if a future schema change caused them to share a table? How would you detect this conflict?