Case Study 2: Emergency REORG — When Fragmentation Hits 90%

The Incident

It was 7:42 AM on a Tuesday when Marcus Chen, a senior application developer at Meridian National Bank, sent a priority message to the DBA team: "The end-of-day settlement batch is running at 4x normal duration. The ACCOUNT_RECONCILIATION job usually completes in 45 minutes. It's been running for 3 hours and is only 60% through. If it doesn't finish by 9:00 AM, we miss the ACH window and 12,000 customer payments are delayed."

Priya Narayan, the on-call DBA, immediately began her investigation.

Diagnosis

Priya started with the DB2 performance monitor to examine the batch job's access paths. The ACCOUNT_RECONCILIATION process performs a series of sequential scans on the DAILY_BALANCES table, joining it with TRANSACTIONS on ACCOUNT_ID and TRANS_DATE. The explain output showed that the optimizer had chosen reasonable access paths — table scans where appropriate, index access for the join — but the I/O wait times were extreme.

She queried the real-time statistics tables:

SELECT NACTIVE, SPACE, EXTENTS,
       REORGINSERTS, REORGUPDATES, REORGDELETES,
       NEARINDREF, FARINDREF, NEAROFFPOSF, FAROFFPOSF
FROM SYSIBM.SYSTABLESPACESTATS
WHERE DBNAME = 'MERIDIANDB'
  AND NAME = 'DAILYBAL_TS';

The numbers told the story:

Metric Value Normal Range
NEAROFFPOSF (near off-position) 340,000 < 50,000
FAROFFPOSF (far off-position) 892,000 < 10,000
NEARINDREF (near indirect ref) 125,000 < 20,000
FARINDREF (far indirect ref) 456,000 < 5,000
Extents 847 < 200

The DAILY_BALANCES tablespace was catastrophically fragmented. Nearly 90% of rows were stored out of their clustering sequence. The indirect reference counts — rows that had been moved from their original page and replaced with a pointer — were orders of magnitude above normal.

Priya ran REORG REPORTONLY to confirm:

REORG TABLESPACE MERIDIANDB.DAILYBAL_TS
  REPORTONLY

The report showed disorganization: 89%. Effectively, the tablespace had lost almost all physical ordering. Every sequential scan was performing random I/O, and nearly half a million rows required a double page read (follow the pointer from the original page to the relocated page).

Root Cause

How did this happen? Priya traced the history.

The DAILY_BALANCES table stores one row per account per business day — the closing balance for each account. The table was originally designed with a clustering index on (BALANCE_DATE, ACCOUNT_ID). Each night, the end-of-day batch inserts approximately 1.2 million new rows (one per account) for the current date, and a separate archival process deletes rows older than 90 days.

The problem was threefold:

  1. The archival process had been disabled 6 months ago during a system migration and was never re-enabled. The table had grown from 90 million rows (90 days of data) to over 220 million rows (6 months). The tablespace was nearly full, and free space within pages was exhausted.

  2. Variable-length column expansion. Three months ago, a schema change added a VARCHAR(200) RECONCILIATION_NOTES column. As the nightly batch updated old rows to add reconciliation notes, the expanded rows no longer fit in their original pages and were moved to overflow pages, leaving indirect reference pointers behind.

  3. REORG had been skipped. The DAILY_BALANCES tablespace was supposed to be in the weekly REORG rotation, but a JCL error three months ago had excluded it from the LISTDEF. Nobody noticed because the daily RUNSTATS job still ran (masking the problem in monitoring dashboards that only checked statistics freshness, not fragmentation).

The Emergency Response

Priya faced a time constraint: the ACH settlement window closed at 9:00 AM. It was now 8:05 AM. A full REORG of the 220-million-row tablespace would take approximately 2.5 hours with SHRLEVEL NONE, or 3+ hours with SHRLEVEL CHANGE. Neither would finish in time.

Step 1: Immediate Triage (8:05 AM - 8:10 AM)

Priya consulted with Marcus and the operations manager. Three options were considered:

Option A: Kill the batch and restart after REORG. This would miss the 9:00 AM ACH window. Impact: 12,000 delayed payments, regulatory reporting, customer complaints.

Option B: Let the batch continue and hope it finishes. At the current rate, completion was estimated at 9:45 AM — too late.

Option C: Optimize the current run, then REORG afterward. Priya could increase the buffer pool size for the DAILYBAL_TS tablespace to cache more pages in memory, reducing the I/O penalty from fragmentation. This would not fix the fragmentation but might speed the batch enough to finish in time.

They chose Option C with a fallback to manual ACH submission if the batch did not complete by 9:00 AM.

Step 2: Buffer Pool Expansion (8:10 AM - 8:15 AM)

Priya increased the buffer pool allocation for the DAILYBAL_TS buffer pool from 50,000 to 200,000 pages:

-DB2P ALTER BUFFERPOOL(BP32K1) VPSIZE(200000)

The additional buffer pool memory allowed DB2 to cache the heavily-accessed index pages and reduce the random I/O impact of the fragmentation. The batch job's I/O wait dropped by approximately 40%.

Step 3: Monitor and Wait (8:15 AM - 8:52 AM)

With the buffer pool expansion, the batch job accelerated. Priya monitored progress through the application's checkpoint logging. At 8:52 AM, the batch completed — 8 minutes before the ACH deadline.

Step 4: Emergency REORG Planning (8:52 AM - 9:30 AM)

With the immediate crisis resolved, Priya planned the emergency REORG. The challenge: the DAILY_BALANCES table was actively used throughout the business day for balance inquiries, account statements, and ongoing reconciliation processes. A SHRLEVEL NONE REORG was not an option during business hours.

She chose a two-phase approach:

Phase 1 (During business hours): REORG with SHRLEVEL CHANGE

This allowed continued read and write access. The REORG would take longer than SHRLEVEL NONE but would not impact application availability.

//EMERGRG  EXEC DSNUPROC,SYSTEM=DB2P,
//            UID='EMERG.REORG.DBAL',UTPROC=''
//DSNUPROC.SORTWK01 DD UNIT=SYSDA,SPACE=(CYL,(300,150))
//DSNUPROC.SORTWK02 DD UNIT=SYSDA,SPACE=(CYL,(300,150))
//DSNUPROC.SORTWK03 DD UNIT=SYSDA,SPACE=(CYL,(300,150))
//DSNUPROC.SORTWK04 DD UNIT=SYSDA,SPACE=(CYL,(300,150))
//DSNUPROC.SYSREC   DD DSN=MERIDIAN.EMERG.DAILYBAL.SYSREC,
//            DISP=(NEW,CATLG,CATLG),
//            UNIT=SYSDA,SPACE=(CYL,(500,200))
//DSNUPROC.SORTOUT  DD DSN=MERIDIAN.EMERG.DAILYBAL.SORTOUT,
//            DISP=(NEW,CATLG,CATLG),
//            UNIT=SYSDA,SPACE=(CYL,(500,200))
//DSNUPROC.SYSIN DD *
  REORG TABLESPACE MERIDIANDB.DAILYBAL_TS
    SHRLEVEL CHANGE
    FASTSWITCH YES
    MAPPINGTABLE MERIDIAN.REORG_MAP_DAILYBAL
    LOG YES
    SORTDATA YES
    SORTKEYS YES
    STATISTICS TABLE(ALL) INDEX(ALL)
      KEYCARD
      FREQVAL COUNT 20
      HISTOGRAM NUMQUANTILES 100
/*

Phase 2 (After REORG): Full image copy

Immediately after REORG completion, a full image copy to preserve the clean state:

//EMERGCP  EXEC DSNUPROC,SYSTEM=DB2P,
//            UID='EMERG.COPY.DBAL',UTPROC=''
//DSNUPROC.COPY1 DD DSN=MERIDIAN.EMERG.COPY.DAILYBAL.FULL,
//            DISP=(NEW,CATLG,CATLG),
//            UNIT=SYSDA,SPACE=(CYL,(400,200))
//DSNUPROC.SYSIN DD *
  COPY TABLESPACE MERIDIANDB.DAILYBAL_TS
    COPYDDN(COPY1)
    SHRLEVEL REFERENCE
    FULL YES
/*

Step 5: Execute the REORG (9:30 AM - 1:15 PM)

Priya submitted the emergency REORG job at 9:30 AM. She monitored progress with DISPLAY UTILITY:

-DB2P DISPLAY UTILITY(EMERG.REORG.DBAL)

The REORG progressed through its phases:

Time Phase Status
9:30 AM UNLOAD Started
10:15 AM UNLOAD 45% complete
11:00 AM SORT Started
11:30 AM RELOAD Started
12:15 PM BUILD (indexes) Started
12:45 PM LOG (apply concurrent changes) Started
1:10 PM SWITCH Started
1:12 PM SWITCH Complete

Total REORG time: 3 hours 42 minutes. During this entire period, applications continued to read and write the DAILY_BALANCES table normally. The SWITCH phase — the only moment of brief write suspension — lasted 2 minutes and was transparent to users because the application's connection pool handled the momentary delay.

Step 6: Post-REORG Verification (1:15 PM - 1:30 PM)

After REORG completion, Priya verified the results:

REORG TABLESPACE MERIDIANDB.DAILYBAL_TS
  REPORTONLY

Disorganization: 2% (down from 89%).

She reran the fragmentation query:

Metric Before After
NEAROFFPOSF 340,000 1,200
FAROFFPOSF 892,000 0
NEARINDREF 125,000 800
FARINDREF 456,000 0
Extents 847 124

The tablespace was clean. She then submitted the full image copy, which completed in 12 minutes.

Step 7: Fix the Archival Process (Same Day Afternoon)

Priya worked with the application team to re-enable the archival process that had been disabled 6 months ago. They first ran a catch-up archival to purge the 130 million excess rows:

-- z/OS: Handled via UNLOAD + DELETE batch
-- Unload rows older than 90 days for archive
-- Then delete in batches of 100,000 with commits

This was done in controlled batches over the following two evenings to avoid excessive log volume.

Step 8: Fix the LISTDEF (Same Day)

Priya corrected the JCL error that had excluded DAILYBAL_TS from the weekly REORG LISTDEF. She also added a monitoring check: a daily job that queries SYSIBM.SYSTABLESPACESTATS for any tablespace with FAROFFPOSF exceeding 50,000, generating an alert to the DBA team.

Post-Incident Analysis

The team conducted a formal post-incident review and identified five contributing factors:

  1. Missing monitoring for fragmentation. The existing monitoring checked statistics freshness (RUNSTATS) and backup recency (COPY) but did not check fragmentation metrics. REORG REPORTONLY or the real-time statistics fragmentation indicators should have been part of the daily monitoring suite.

  2. LISTDEF maintenance gap. When JCL is changed, there must be a validation step that confirms all expected tablespaces are still included. A simple count check ("LISTDEF should contain N tablespaces") would have caught the exclusion.

  3. Disabled process not tracked. The archival process was disabled with a change ticket that had no follow-up action or expiration date. Change management procedures were updated to require a "re-enable by" date for any temporarily disabled process.

  4. Schema change impact not assessed. Adding the VARCHAR(200) column created a predictable fragmentation pattern (row expansion causing overflow) that should have triggered an immediate REORG and potentially a free space increase in the tablespace definition.

  5. No escalation threshold. There was no defined threshold at which fragmentation triggers an automatic REORG or a mandatory alert. The team established a policy: any tablespace exceeding 30% disorganization triggers an alert; exceeding 50% triggers an emergency REORG within 24 hours.

Preventive Measures Implemented

Following the incident, the Meridian Bank DBA team implemented these changes:

  1. Daily fragmentation monitoring — A scheduled job queries SYSTABLESPACESTATS and REORG REPORTONLY, alerting when disorganization exceeds 30%.

  2. LISTDEF validation — After any JCL change to maintenance jobs, an automated check verifies that all production tablespaces are included in the appropriate LISTDEFs.

  3. Change impact assessment checklist — All schema changes now require a DBA review of potential impacts on fragmentation, space utilization, and maintenance schedules.

  4. Archival process monitoring — The data archival jobs now have health checks that alert if they fail to run for more than 48 hours.

  5. Buffer pool right-sizing — Priya reviewed all buffer pool allocations and increased the DAILYBAL_TS pool from 50,000 to 100,000 pages permanently (not the emergency 200,000, but a sustainable increase based on the table's access patterns).

Performance Impact Summary

Metric Before REORG After REORG Improvement
Disorganization 89% 2% 87 percentage points
Settlement batch runtime 3+ hours 38 minutes 79% faster
Average balance inquiry response 2.4 seconds 0.3 seconds 88% faster
Buffer pool hit ratio (DAILYBAL_TS) 62% 94% 32 percentage points
Tablespace extents 847 124 85% reduction

The performance improvement was dramatic and immediate. The end-of-day settlement batch, which had triggered the crisis, returned to its normal 38-minute runtime — actually 7 minutes faster than its pre-incident average of 45 minutes, because the REORG with inline statistics also resolved some stale-statistics-induced access path inefficiencies.


Discussion questions:

  1. Priya chose to expand the buffer pool as an immediate workaround rather than killing and restarting the batch. What risks did this introduce, and how would you evaluate whether the buffer pool expansion was safe?
  2. The root cause was ultimately a disabled archival process, not a missed REORG. How would you design a monitoring system that catches upstream process failures before they manifest as database performance problems?
  3. If this had been a DB2 LUW environment instead of z/OS, how would the diagnosis and remediation steps differ? Write the equivalent LUW commands for the key steps in Priya's response.