Case Study 1: Buffer Pool Redesign at Meridian Bank

Background

Meridian National Bank has been running DB2 on z/OS for 12 years. When the system was first deployed, the DBA team created a minimal buffer pool configuration: everything in BP0 (the default pool) with a VPSIZE of 200,000 4 KB pages (approximately 800 MB). At the time, this was generous — the database was 50 GB, the workload was primarily batch, and 800 MB provided a comfortable 85%+ hit ratio.

Twelve years later, the picture is different. The database has grown to 2.1 TB. The number of table spaces has grown from 30 to 180. The workload has shifted from predominantly batch to 80% OLTP / 20% batch, driven by the launch of online and mobile banking. But the buffer pool configuration has never been revisited.

The LPAR has 128 GB of real memory, and DB2's DBM1 address space is currently configured with only 4 GB — a fraction of what is available.

The Problem

The DBA team captures a DISPLAY BUFFERPOOL snapshot during a typical Tuesday afternoon:

-DISPLAY BUFFERPOOL(BP0) DETAIL(*)

BUFFERPOOL NAME BP0, BUFFERPOOL ID 0
VP CURRENT SIZE = 200000 BUFFERS

GETPAGE REQUESTS =    450,000,000
PAGES READ       =     67,500,000
HIT RATIO        =          85.00%

SYNC READ I/O    =     52,000,000
SEQ PREFETCH REQ =      2,500,000
PAGES PREFETCHED =     15,500,000

SYNC WRITES      =        125,000
DEFERRED WRITES  =      8,500,000

VPSEQT REACHED   =        450,000
DWQT REACHED     =        234,000

No other buffer pools are in use. Everything — catalog, OLTP tables, indexes, batch tables, temp work — shares BP0.

Observed Symptoms

  1. OLTP response time spikes during batch: The nightly batch job (2 AM - 5 AM) causes OLTP response times to jump from 20 ms to 300+ ms for early-morning international banking users.

  2. Low overall hit ratio: 85% is far below the 99% target for OLTP. Every OLTP query suffers unnecessary physical I/O. Quick math: 67.5 million physical reads at 4 ms each = 75 hours of cumulative I/O wait time during the monitoring interval.

  3. Synchronous writes: 125,000 sync writes indicate the deferred write queue is filling up faster than async writes can clear it. Each sync write adds 5-10 ms directly to an application thread's response time.

  4. VPSEQT reached frequently: 450,000 VPSEQT events (the default 80% threshold) mean sequential scans are regularly consuming 80% of the pool, displacing OLTP pages.

  5. Underutilized memory: The LPAR has 128 GB, DB2 uses 800 MB for buffer pools. That is 0.6% of available memory.

Workload Analysis

The DBA team queries SYSIBM.SYSTABLESPACESTATS to identify the top table spaces:

Tablespace Object Size (Pages) Getpages (24h) Access Pattern
TSACCT01 ACCOUNTS 65,000 120,000,000 Random OLTP
TSIXAC01 ACCOUNTS indexes 22,000 95,000,000 Random probe
TSCUST01 CUSTOMERS 25,000 80,000,000 Random OLTP
TSIXCS01 CUSTOMERS indexes 8,000 75,000,000 Random probe
TSTXN01 TRANSACTIONS 1,500,000 45,000,000 Mixed (inserts + scans)
TSIXTX01 TRANSACTIONS indexes 180,000 40,000,000 Mixed
TSAUDT01 AUDIT_LOG 800,000 15,000,000 Batch scan + write
TSHOLD01 HOLDS 5,000 12,000,000 Random OLTP
TSREF01 Reference tables 500 8,000,000 Random lookup
TSTEMP01 Temp table space varies 5,000,000 Sort work

Two insights jump out immediately:

  1. The OLTP working set is small. ACCOUNTS (65K pages), CUSTOMERS (25K pages), HOLDS (5K pages), and reference tables (500 pages) total 95,500 pages = 373 MB. Their indexes total 30,000 pages = 117 MB. This entire hot working set is only 490 MB — yet the 800 MB pool cannot cache it because batch scans and transaction history are constantly evicting these pages.

  2. TRANSACTIONS is the battleground. At 1.5 million pages (5.9 GB), it cannot be fully cached. But it is accessed by both OLTP (inserts, recent lookups) and batch (sequential scans for end-of-day processing). This mixed access is the root cause of the VPSEQT thrashing.

The Redesign

The senior DBA proposes a six-pool architecture within a 16 GB budget (conservative, given 128 GB is available):

Pool Layout

-- BP0: System objects only (catalog, directory)
-ALTER BUFFERPOOL(BP0)
  VPSIZE(15000) VPSEQT(50) DWQT(40) VDWQT(10) PGFIX(YES)
  -- 15,000 x 4 KB = 59 MB

-- BP1: OLTP Data (ACCOUNTS, CUSTOMERS, HOLDS, reference tables)
-- Working set: 95,500 pages. Size at 130%: ~125,000 pages
-ALTER BUFFERPOOL(BP1)
  VPSIZE(125000) VPSEQT(10) DWQT(30) VDWQT(10) PGFIX(YES)
  -- 125,000 x 4 KB = 488 MB

-- BP2: OLTP Indexes (all OLTP index table spaces)
-- Working set: 30,000 pages. Size at 160%: ~48,000 pages
-ALTER BUFFERPOOL(BP2)
  VPSIZE(50000) VPSEQT(0) DWQT(30) VDWQT(10) PGFIX(YES)
  -- 50,000 x 4 KB = 195 MB

-- BP3: Transaction Data (TRANSACTIONS table - mixed workload)
-- Cannot cache all 1.5M pages. Cache ~350K recent pages.
-ALTER BUFFERPOOL(BP3)
  VPSIZE(400000) VPSEQT(40) DWQT(35) VDWQT(15) PGFIX(YES)
  -- 400,000 x 4 KB = 1.56 GB

-- BP4: Transaction Indexes
-- Working set: 180,000 pages.
-ALTER BUFFERPOOL(BP4)
  VPSIZE(200000) VPSEQT(10) DWQT(30) VDWQT(10) PGFIX(YES)
  -- 200,000 x 4 KB = 781 MB

-- BP5: Batch/Audit data (AUDIT_LOG, batch-only tables)
-ALTER BUFFERPOOL(BP5)
  VPSIZE(100000) VPSEQT(95) DWQT(45) VDWQT(20) PGSTEAL(FIFO) PGFIX(YES)
  -- 100,000 x 4 KB = 391 MB

-- BP6: Temp/Sort work
-ALTER BUFFERPOOL(BP6)
  VPSIZE(80000) VPSEQT(90) DWQT(50) PGSTEAL(FIFO) PGFIX(YES)
  -- 80,000 x 4 KB = 313 MB

Total: ~3.8 GB. This is still conservative — the team plans to expand pools based on monitoring data after the initial deployment.

Table Space Reassignment

-- Move OLTP data to BP1
ALTER TABLESPACE MERIDIAN.TSACCT01 BUFFERPOOL BP1;
ALTER TABLESPACE MERIDIAN.TSCUST01 BUFFERPOOL BP1;
ALTER TABLESPACE MERIDIAN.TSHOLD01 BUFFERPOOL BP1;
ALTER TABLESPACE MERIDIAN.TSREF01  BUFFERPOOL BP1;

-- Move OLTP indexes to BP2
ALTER TABLESPACE MERIDIAN.TSIXAC01 BUFFERPOOL BP2;
ALTER TABLESPACE MERIDIAN.TSIXCS01 BUFFERPOOL BP2;

-- Move transaction data to BP3
ALTER TABLESPACE MERIDIAN.TSTXN01  BUFFERPOOL BP3;

-- Move transaction indexes to BP4
ALTER TABLESPACE MERIDIAN.TSIXTX01 BUFFERPOOL BP4;

-- Move batch data to BP5
ALTER TABLESPACE MERIDIAN.TSAUDT01 BUFFERPOOL BP5;

-- Move temp to BP6
ALTER TABLESPACE MERIDIAN.TSTEMP01 BUFFERPOOL BP6;

Implementation

The team executes the migration during a Tuesday afternoon (not a maintenance window) because all changes are dynamic:

  1. Phase 1 (14:00): Create and size all new buffer pools.
  2. Phase 2 (14:05): Move reference tables and HOLDS (low risk, small volume).
  3. Phase 3 (14:10): Move CUSTOMERS and its indexes. Monitor for 10 minutes.
  4. Phase 4 (14:20): Move ACCOUNTS and its indexes. Monitor for 15 minutes.
  5. Phase 5 (14:35): Move TRANSACTIONS to BP3, its indexes to BP4.
  6. Phase 6 (14:40): Move AUDIT_LOG to BP5, temp to BP6.
  7. Phase 7 (14:45): Shrink BP0 to 15,000 pages.

Throughout the migration, the team monitors with:

-DISPLAY BUFFERPOOL(*) DETAIL(*)

The rollback plan is simple: move any problematic table space back to BP0 and increase BP0. All commands are dynamic — no restart required.

Results

After 48 hours of the new configuration:

Metric Before After Improvement
BP1 (OLTP Data) hit ratio 85% (shared) 99.4% +14.4 pts
BP2 (OLTP Index) hit ratio 85% (shared) 99.8% +14.8 pts
BP3 (Transaction) hit ratio 85% (shared) 92.1% +7.1 pts
Average OLTP response 20 ms 4 ms 5x faster
OLTP response during batch 300 ms 5 ms 60x faster
Synchronous writes/hour ~5,200 < 10 Eliminated
Batch window 3.0 hours 2.4 hours 20% faster

The most dramatic improvement: OLTP response during the batch window dropped from 300 ms to 5 ms. The batch job no longer affects OLTP at all because the workloads are in completely separate pools.

Lessons Learned

  1. Default configurations are starting points, not final answers. The original BP0-only setup was reasonable for a 50 GB batch system. It was completely inadequate for a 2.1 TB mixed OLTP/batch system.

  2. Memory is cheap; outages are expensive. The team used less than 4 GB of the 128 GB available. They had been running a 2 TB system on 800 MB of buffer pool for years. The cost of the memory was zero (it was already installed). The cost of the poor performance was measured in customer complaints and missed SLAs.

  3. Separation is the key insight. The total buffer pool memory only increased from 800 MB to 3.8 GB (5x). But the hit ratio for critical OLTP tables went from 85% to 99.4% — not because of more memory overall, but because OLTP pages were no longer competing with batch scans.

  4. Dynamic changes reduce risk. Every buffer pool change and table space reassignment was executed live, during business hours, with instant rollback capability. There was no maintenance window, no outage, and no risk to production beyond what monitoring could detect and revert in minutes.

  5. Start conservative, expand based on data. The team allocated only 3.8 GB initially, with plans to expand. After two weeks of monitoring, they increased BP3 (transactions) to 800,000 pages based on observed working set data. The incremental approach avoided overcommitting memory.