Case Study 1: CNB's REORG Strategy for High-Volume Tables
Background
City National Bank's core banking platform processes 2.5 million account transactions daily through DB2 on z/OS. The ACCT_TRANSACTION table — 180 GB across 24 monthly partitions — is the heart of the system. Every deposit, withdrawal, transfer, payment, and fee generates a row. Every balance inquiry reads from it. Every regulatory report scans it.
For three years, the REORG strategy was simple: REORG all 24 partitions every Sunday night, offline, SHRLEVEL NONE. It worked when the table was 40 GB. At 180 GB, it's a crisis.
The Problem
Kwame Asante got the call on a Monday morning. The Sunday REORG had overrun the batch window by 2 hours and 17 minutes. The online banking system came up late. Customers couldn't see their weekend transactions. The CTO wanted answers.
Kwame pulled the numbers:
| Metric | Value |
|---|---|
| Total REORG elapsed time (24 partitions) | 9 hours 42 minutes |
| Batch window available | 7 hours 30 minutes |
| Sort work DASD consumed | 420 GB |
| Log data generated | 540 GB |
| Active log near-full events | 3 |
| CPU time consumed | 847 CPU-minutes |
The problem wasn't just elapsed time. The log volume had triggered three near-full events, each requiring archive log offload before processing could continue. The sort work had consumed nearly all available temporary DASD, causing other batch jobs to wait for space.
Lisa Park analyzed the REORG metrics for each partition:
| Partition (Month) | CLUSTERRATIOF | PERCDROP | Daily Inserts | REORG Benefit |
|---|---|---|---|---|
| Current month (Mar 2026) | 0.58 | 0.12 | 1,800,000 | High |
| Prior month (Feb 2026) | 0.71 | 0.08 | 450,000 | Medium |
| 2 months ago (Jan 2026) | 0.82 | 0.05 | 50,000 | Low |
| 3 months ago (Dec 2025) | 0.89 | 0.03 | 12,000 | Minimal |
| 4-12 months ago | 0.93-0.98 | 0.01-0.02 | < 5,000 | None |
| 13-24 months ago | 0.97-0.99 | 0.00-0.01 | < 100 | None |
The data was clear: 18 of the 24 partitions didn't need REORG at all. They were wasting 70% of their REORG window on partitions with near-perfect clustering.
Analysis
Lisa presented the analysis to Kwame and Rob Tanner:
"We're REORGing partitions that don't need it." Partitions older than 3 months have CLUSTERRATIOF above 0.89. The clustering index on TRAN_DATE naturally keeps these partitions well-organized because they receive almost no new inserts. REORGing them is pure overhead.
"We're running offline when we don't need to." SHRLEVEL NONE blocks all access. Even if they reduced the REORG to 6 partitions, the offline window still impacts the online system. SHRLEVEL CHANGE would keep applications running during the BUILD phase, with only seconds of unavailability during the SWITCH phase.
"We're not taking advantage of inline operations." The current process runs REORG, then a separate COPY job, then a separate RUNSTATS job. Three separate utility passes over the same data. Inline COPY and RUNSTATS within the REORG would eliminate two full passes.
"We have no threshold monitoring." The decision to REORG is calendar-based, not metric-based. Some weeks the current partition's CLUSTERRATIOF drops to 0.55; other weeks it's still at 0.75. A fixed weekly schedule is either too frequent or not frequent enough.
Rob raised a concern: "What about the mapping tables for SHRLEVEL CHANGE? We don't have them, and the online REORG will generate even more log data than offline."
Lisa had the numbers: "Online REORG generates roughly 2.5x the log volume of offline for the same data volume. But since we're REORGing 6 partitions instead of 24, total log volume drops from 540 GB to about 135 GB. That's a 75% reduction."
The New Strategy
Kwame approved a three-tier strategy:
Tier 1: Current Month Partition — Weekly, Metric-Gated
The current month's partition gets REORG treatment weekly, but only if metrics warrant it:
-- Threshold check: run before REORG decision
SELECT PARTITION, CLUSTERRATIOF, PERCDROP,
CARD, NPAGES
FROM SYSIBM.SYSTABLEPART
WHERE DBNAME = 'DBCNB01'
AND TSNAME = 'TSTRANAC'
AND PARTITION = :CURRENT_PART
AND (CLUSTERRATIOF < 0.75 OR PERCDROP > 0.10);
If the query returns a row, REORG runs. If not, it skips.
REORG specification:
REORG TABLESPACE DBCNB01.TSTRANAC
PART :CURRENT_PART
SHRLEVEL CHANGE
MAPPINGTABLE CNBADM.MAP_TSTRANAC
SORTDATA YES
SORTKEYS YES
STATISTICS YES
TABLE (CNBADM.ACCT_TRANSACTION
COLGROUP (ACCT_ID, TRAN_DATE) FREQVAL COUNT 20
COLGROUP (TRAN_TYPE, POST_STATUS) FREQVAL COUNT 10
COLUMN (TRAN_TYPE) FREQVAL COUNT 10
)
INDEX (ALL)
COPYDDN (SYSCOPY1, SYSCOPY2)
DRAIN_WAIT 120
RETRY 5
RETRY_DELAY 30
TIMEOUT TERM
Key decisions: - DRAIN_WAIT 120 — Two minutes. The nightly batch processes long-running units of work. 120 seconds accommodates most batch commit intervals. - RETRY 5 — Five retries with 30-second delay. Total drain attempt window: 4.5 minutes. If a batch job can't commit in 4.5 minutes, something is wrong with the batch job. - TIMEOUT TERM — If drain fails, terminate the REORG. The batch job is more important than the REORG. Reschedule for the next night. - Inline RUNSTATS with COLGROUP on the two most common predicate combinations - Inline COPY — dual copies to DASD and tape
Tier 2: Prior 1-3 Month Partitions — Biweekly, Metric-Gated
These partitions receive moderate update activity (corrections, adjustments). REORG runs biweekly only if CLUSTERRATIOF drops below 0.80 or PERCDROP exceeds 0.08.
Same REORG parameters as Tier 1, but with the specific partition numbers for months M-1 through M-3.
Tier 3: Historical Partitions (4+ Months) — Monthly Assessment, Rarely Executed
A monthly monitoring job checks all historical partitions. REORG runs only if CLUSTERRATIOF drops below 0.85 — which happens perhaps once per quarter for a single partition, usually after a regulatory correction batch.
Infrastructure Changes
Mapping table creation. Lisa created the mapping table and its tablespace:
CREATE TABLESPACE TSMAP01
IN DBCNB01
USING STOGROUP SGCNB01
SEGSIZE 64
BUFFERPOOL BP2
CLOSE NO;
CREATE TABLE CNBADM.MAP_TSTRANAC
(SOURCE_RID CHAR(5) NOT NULL,
TARGET_XRID CHAR(5) NOT NULL,
SOURCE_XRID CHAR(5) NOT NULL)
IN DBCNB01.TSMAP01
CCSID EBCDIC;
Log capacity. The operations team increased the active log allocation from 40 GB to 60 GB and added a second offload device. Even with reduced REORG volume, the margin of safety was important.
Monitoring automation. Rob built a COBOL program (UTLMONPR) that runs daily at 05:30, queries the catalog statistics for all 24 partitions, and writes REORG/RUNSTATS/COPY recommendations to an output dataset. The scheduling system (CA-7) reads this dataset and conditionally submits the appropriate utility jobs for the next batch window.
IDENTIFICATION DIVISION.
PROGRAM-ID. UTLMONPR.
*
* DB2 Utility Monitoring - Threshold Check
* Queries SYSTABLEPART for REORG/RUNSTATS indicators
* Writes recommendations to UTILREC output file
*
...
EXEC SQL
SELECT PARTITION,
CLUSTERRATIOF,
PERCDROP,
STATSTIME,
CARD
INTO :WS-PARTITION,
:WS-CLUSTER-RATIO,
:WS-PERC-DROP,
:WS-STATS-TIME,
:WS-CARDINALITY
FROM SYSIBM.SYSTABLEPART
WHERE DBNAME = :WS-DBNAME
AND TSNAME = :WS-TSNAME
AND PARTITION = :WS-PART-NUM
END-EXEC
Post-REORG REBIND
Every REORG job has a successor job that rebinds the five packages accessing ACCT_TRANSACTION:
//REBIND EXEC PGM=IKJEFT01
//SYSTSPRT DD SYSOUT=*
//SYSTSIN DD *
DSN SYSTEM(DB2P)
REBIND PACKAGE (CNBCOLL.PGMTRAN1.(*)) EXPLAIN(YES)
REBIND PACKAGE (CNBCOLL.PGMTRAN2.(*)) EXPLAIN(YES)
REBIND PACKAGE (CNBCOLL.PGMBAL01.(*)) EXPLAIN(YES)
REBIND PACKAGE (CNBCOLL.PGMINQ01.(*)) EXPLAIN(YES)
REBIND PACKAGE (CNBCOLL.PGMRPT01.(*)) EXPLAIN(YES)
END
/*
Results
After four weeks on the new strategy:
| Metric | Before | After | Change |
|---|---|---|---|
| Partitions REORGed weekly | 24 | 2-4 | -83% |
| REORG elapsed time | 9h 42m | 1h 15m | -87% |
| Log volume (REORG) | 540 GB | 95 GB | -82% |
| Sort work consumed | 420 GB | 55 GB | -87% |
| CPU time | 847 min | 128 min | -85% |
| Active log near-full events | 3/week | 0 | -100% |
| Batch window overruns | 1-2/month | 0 | -100% |
| Average CLUSTERRATIOF (current partition) | 0.58 | 0.72 | +24% |
The average CLUSTERRATIOF for the current partition actually improved, despite fewer REORGs. The reason: with the metric-gated approach, REORG now runs when it's most needed (when clustering drops to 0.75) rather than on a fixed schedule that might catch it at 0.90 (wasted REORG) or at 0.55 (waited too long).
Lessons Learned
Lesson 1: Measure before you REORG. The biggest efficiency gain came from not REORGing partitions that didn't need it. Calendar-based scheduling is lazy. Metric-based scheduling is smart.
Lesson 2: Online REORG is not always more expensive. The per-partition cost of online REORG (SHRLEVEL CHANGE) is higher than offline. But the total cost is lower when you REORG fewer partitions and eliminate the separate COPY and RUNSTATS jobs through inline execution.
Lesson 3: The COBOL developer is part of the solution. Rob's monitoring program (UTLMONPR) replaced a manual DBA process that ran "when someone remembered." The developer who understands the data patterns is the right person to design the monitoring logic.
Lesson 4: DRAIN_WAIT and RETRY settings matter. The initial deployment used DRAIN_WAIT 30 with RETRY 2. The first week, two REORGs terminated because a batch program had a 45-second commit interval. Increasing DRAIN_WAIT to 120 and RETRY to 5 eliminated the problem. Know your batch commit patterns.
Lesson 5: Inline operations are free performance. Inline COPY and RUNSTATS add negligible time to the REORG (the data is already being read) and eliminate two separate utility passes. There is no reason to run them separately unless you have a specific scheduling constraint.
Discussion Questions
-
CNB's strategy is metric-gated: REORG runs only when thresholds are breached. What are the risks of this approach compared to a fixed schedule? How would you mitigate them?
-
The DRAIN_WAIT and RETRY settings were adjusted after the first week. How would you determine the optimal values for your own environment without trial and error?
-
Rob's monitoring program runs daily. Could real-time monitoring (checking metrics continuously) be better? What would be the trade-offs?
-
The mapping table for online REORG resides in the same database as the production data. Is this a good practice? What would happen if the mapping table's tablespace had a media failure during REORG?
-
CNB chose TIMEOUT TERM, which terminates the REORG if drain fails. Under what circumstances might TIMEOUT ABEND (terminate the blocking application) be the better choice?