Case Study 1: CNB's REORG Strategy for High-Volume Tables

DataField.Dev

Case Study 1: CNB's REORG Strategy for High-Volume Tables

Background

Continental National Bank's core banking platform processes 2.5 million account transactions daily through DB2 on z/OS. The ACCT_TRANSACTION table — 180 GB across 24 monthly partitions — is the heart of the system. Every deposit, withdrawal, transfer, payment, and fee generates a row. Every balance inquiry reads from it. Every regulatory report scans it.

For three years, the REORG strategy was simple: REORG all 24 partitions every Sunday night, offline, SHRLEVEL NONE. It worked when the table was 40 GB. At 180 GB, it's a crisis.

The Problem

Kwame Mensah got the call on a Monday morning. The Sunday REORG had overrun the batch window by 2 hours and 17 minutes. The online banking system came up late. Customers couldn't see their weekend transactions. The CTO wanted answers.

Kwame pulled the numbers:

Metric	Value
Total REORG elapsed time (24 partitions)	9 hours 42 minutes
Batch window available	7 hours 30 minutes
Sort work DASD consumed	420 GB
Log data generated	540 GB
Active log near-full events	3
CPU time consumed	847 CPU-minutes

The problem wasn't just elapsed time. The log volume had triggered three near-full events, each requiring archive log offload before processing could continue. The sort work had consumed nearly all available temporary DASD, causing other batch jobs to wait for space.

Lisa Park analyzed the REORG metrics for each partition:

Partition (Month)	CLUSTERRATIOF	PERCDROP	Daily Inserts	REORG Benefit
Current month (Mar 2026)	0.58	0.12	1,800,000	High
Prior month (Feb 2026)	0.71	0.08	450,000	Medium
2 months ago (Jan 2026)	0.82	0.05	50,000	Low
3 months ago (Dec 2025)	0.89	0.03	12,000	Minimal
4-12 months ago	0.93-0.98	0.01-0.02	< 5,000	None
13-24 months ago	0.97-0.99	0.00-0.01	< 100	None

The data was clear: 18 of the 24 partitions didn't need REORG at all. They were wasting 70% of their REORG window on partitions with near-perfect clustering.

Analysis

Lisa presented the analysis to Kwame and Rob Tanner:

"We're REORGing partitions that don't need it." Partitions older than 3 months have CLUSTERRATIOF above 0.89. The clustering index on TRAN_DATE naturally keeps these partitions well-organized because they receive almost no new inserts. REORGing them is pure overhead.

"We're running offline when we don't need to." SHRLEVEL NONE blocks all access. Even if they reduced the REORG to 6 partitions, the offline window still impacts the online system. SHRLEVEL CHANGE would keep applications running during the BUILD phase, with only seconds of unavailability during the SWITCH phase.

"We're not taking advantage of inline operations." The current process runs REORG, then a separate COPY job, then a separate RUNSTATS job. Three separate utility passes over the same data. Inline COPY and RUNSTATS within the REORG would eliminate two full passes.

"We have no threshold monitoring." The decision to REORG is calendar-based, not metric-based. Some weeks the current partition's CLUSTERRATIOF drops to 0.55; other weeks it's still at 0.75. A fixed weekly schedule is either too frequent or not frequent enough.

Rob raised a concern: "What about the mapping tables for SHRLEVEL CHANGE? We don't have them, and the online REORG will generate even more log data than offline."

Lisa had the numbers: "Online REORG generates roughly 2.5x the log volume of offline for the same data volume. But since we're REORGing 6 partitions instead of 24, total log volume drops from 540 GB to about 135 GB. That's a 75% reduction."

The New Strategy

Kwame approved a three-tier strategy:

Tier 1: Current Month Partition — Weekly, Metric-Gated

The current month's partition gets REORG treatment weekly, but only if metrics warrant it:

-- Threshold check: run before REORG decision
SELECT PARTITION, CLUSTERRATIOF, PERCDROP,
       CARD, NPAGES
FROM SYSIBM.SYSTABLEPART
WHERE DBNAME = 'DBCNB01'
  AND TSNAME = 'TSTRANAC'
  AND PARTITION = :CURRENT_PART
  AND (CLUSTERRATIOF < 0.75 OR PERCDROP > 0.10);

If the query returns a row, REORG runs. If not, it skips.

REORG specification:

  REORG TABLESPACE DBCNB01.TSTRANAC
    PART :CURRENT_PART
    SHRLEVEL CHANGE
    MAPPINGTABLE CNBADM.MAP_TSTRANAC
    SORTDATA YES
    SORTKEYS YES
    STATISTICS YES
    TABLE (CNBADM.ACCT_TRANSACTION
      COLGROUP (ACCT_ID, TRAN_DATE) FREQVAL COUNT 20
      COLGROUP (TRAN_TYPE, POST_STATUS) FREQVAL COUNT 10
      COLUMN  (TRAN_TYPE) FREQVAL COUNT 10
    )
    INDEX (ALL)
    COPYDDN  (SYSCOPY1, SYSCOPY2)
    DRAIN_WAIT 120
    RETRY 5
    RETRY_DELAY 30
    TIMEOUT TERM

Key decisions: - DRAIN_WAIT 120 — Two minutes. The nightly batch processes long-running units of work. 120 seconds accommodates most batch commit intervals. - RETRY 5 — Five retries with 30-second delay. Total drain attempt window: 4.5 minutes. If a batch job can't commit in 4.5 minutes, something is wrong with the batch job. - TIMEOUT TERM — If drain fails, terminate the REORG. The batch job is more important than the REORG. Reschedule for the next night. - Inline RUNSTATS with COLGROUP on the two most common predicate combinations - Inline COPY — dual copies to DASD and tape

Tier 2: Prior 1-3 Month Partitions — Biweekly, Metric-Gated

These partitions receive moderate update activity (corrections, adjustments). REORG runs biweekly only if CLUSTERRATIOF drops below 0.80 or PERCDROP exceeds 0.08.

Same REORG parameters as Tier 1, but with the specific partition numbers for months M-1 through M-3.

Tier 3: Historical Partitions (4+ Months) — Monthly Assessment, Rarely Executed

A monthly monitoring job checks all historical partitions. REORG runs only if CLUSTERRATIOF drops below 0.85 — which happens perhaps once per quarter for a single partition, usually after a regulatory correction batch.

Infrastructure Changes

Mapping table creation. Lisa created the mapping table and its tablespace:

CREATE TABLESPACE TSMAP01
  IN DBCNB01
  USING STOGROUP SGCNB01
  SEGSIZE 64
  BUFFERPOOL BP2
  CLOSE NO;

CREATE TABLE CNBADM.MAP_TSTRANAC
  (SOURCE_RID  CHAR(5) NOT NULL,
   TARGET_XRID CHAR(5) NOT NULL,
   SOURCE_XRID CHAR(5) NOT NULL)
  IN DBCNB01.TSMAP01
  CCSID EBCDIC;

Log capacity. The operations team increased the active log allocation from 40 GB to 60 GB and added a second offload device. Even with reduced REORG volume, the margin of safety was important.

Monitoring automation. Rob built a COBOL program (UTLMONPR) that runs daily at 05:30, queries the catalog statistics for all 24 partitions, and writes REORG/RUNSTATS/COPY recommendations to an output dataset. The scheduling system (CA-7) reads this dataset and conditionally submits the appropriate utility jobs for the next batch window.

       IDENTIFICATION DIVISION.
       PROGRAM-ID. UTLMONPR.
      *
      * DB2 Utility Monitoring - Threshold Check
      * Queries SYSTABLEPART for REORG/RUNSTATS indicators
      * Writes recommendations to UTILREC output file
      *
       ...
           EXEC SQL
             SELECT PARTITION,
                    CLUSTERRATIOF,
                    PERCDROP,
                    STATSTIME,
                    CARD
               INTO :WS-PARTITION,
                    :WS-CLUSTER-RATIO,
                    :WS-PERC-DROP,
                    :WS-STATS-TIME,
                    :WS-CARDINALITY
               FROM SYSIBM.SYSTABLEPART
              WHERE DBNAME  = :WS-DBNAME
                AND TSNAME  = :WS-TSNAME
                AND PARTITION = :WS-PART-NUM
           END-EXEC

Post-REORG REBIND

Every REORG job has a successor job that rebinds the five packages accessing ACCT_TRANSACTION:

//REBIND   EXEC PGM=IKJEFT01
//SYSTSPRT DD SYSOUT=*
//SYSTSIN  DD *
  DSN SYSTEM(DB2P)
  REBIND PACKAGE (CNBCOLL.PGMTRAN1.(*)) EXPLAIN(YES)
  REBIND PACKAGE (CNBCOLL.PGMTRAN2.(*)) EXPLAIN(YES)
  REBIND PACKAGE (CNBCOLL.PGMBAL01.(*)) EXPLAIN(YES)
  REBIND PACKAGE (CNBCOLL.PGMINQ01.(*)) EXPLAIN(YES)
  REBIND PACKAGE (CNBCOLL.PGMRPT01.(*)) EXPLAIN(YES)
  END
/*

Results

After four weeks on the new strategy:

Metric	Before	After	Change
Partitions REORGed weekly	24	2-4	-83%
REORG elapsed time	9h 42m	1h 15m	-87%
Log volume (REORG)	540 GB	95 GB	-82%
Sort work consumed	420 GB	55 GB	-87%
CPU time	847 min	128 min	-85%
Active log near-full events	3/week	0	-100%
Batch window overruns	1-2/month	0	-100%
Average CLUSTERRATIOF (current partition)	0.58	0.72	+24%

The average CLUSTERRATIOF for the current partition actually improved, despite fewer REORGs. The reason: with the metric-gated approach, REORG now runs when it's most needed (when clustering drops to 0.75) rather than on a fixed schedule that might catch it at 0.90 (wasted REORG) or at 0.55 (waited too long).

Lessons Learned

Lesson 1: Measure before you REORG. The biggest efficiency gain came from not REORGing partitions that didn't need it. Calendar-based scheduling is lazy. Metric-based scheduling is smart.

Lesson 2: Online REORG is not always more expensive. The per-partition cost of online REORG (SHRLEVEL CHANGE) is higher than offline. But the total cost is lower when you REORG fewer partitions and eliminate the separate COPY and RUNSTATS jobs through inline execution.

Lesson 3: The COBOL developer is part of the solution. Rob's monitoring program (UTLMONPR) replaced a manual DBA process that ran "when someone remembered." The developer who understands the data patterns is the right person to design the monitoring logic.

Lesson 4: DRAIN_WAIT and RETRY settings matter. The initial deployment used DRAIN_WAIT 30 with RETRY 2. The first week, two REORGs terminated because a batch program had a 45-second commit interval. Increasing DRAIN_WAIT to 120 and RETRY to 5 eliminated the problem. Know your batch commit patterns.

Lesson 5: Inline operations are free performance. Inline COPY and RUNSTATS add negligible time to the REORG (the data is already being read) and eliminate two separate utility passes. There is no reason to run them separately unless you have a specific scheduling constraint.

Discussion Questions

CNB's strategy is metric-gated: REORG runs only when thresholds are breached. What are the risks of this approach compared to a fixed schedule? How would you mitigate them?
The DRAIN_WAIT and RETRY settings were adjusted after the first week. How would you determine the optimal values for your own environment without trial and error?
Rob's monitoring program runs daily. Could real-time monitoring (checking metrics continuously) be better? What would be the trade-offs?
The mapping table for online REORG resides in the same database as the production data. Is this a good practice? What would happen if the mapping table's tablespace had a media failure during REORG?
CNB chose TIMEOUT TERM, which terminates the REORG if drain fails. Under what circumstances might TIMEOUT ABEND (terminate the blocking application) be the better choice?