Case Study 2: Memory Constraint Tuning on Limited Hardware

DataField.Dev

Case Study 2: Memory Constraint Tuning on Limited Hardware

Background

Meridian National Bank is establishing a disaster recovery (DR) site. Due to budget constraints, the DR server has significantly less hardware than production:

Resource	Production (z/OS)	DR Site (LUW)
Platform	z/OS, 128 GB LPAR	Linux, 32 GB physical
CPUs	8 GCPs	8 cores (x86)
Storage	Enterprise Flash (zHyperLink)	NVMe SSD
Database size	2.1 TB	2.1 TB (replicated)
Expected workload	Full production (7,000 TPS peak)	30% of production (essential services only)

The DR site runs DB2 LUW 11.5 on RHEL 8. Data is replicated from the z/OS production system via a CDC (Change Data Capture) tool. During a failover, the DR site must support:

Core banking: Account lookups, balance inquiries, basic transactions (200 concurrent users)
ATM network: Balance checks and withdrawals (continuous, automated, ~500 TPS)
No batch processing: Batch jobs are deferred until production is restored

The challenge: configure DB2 memory for acceptable performance with one-quarter the memory and the same 2.1 TB of data.

Initial State (Unconfigured)

The DR site was set up with DB2 defaults and minimal tuning. After initial failover testing:

DB Configuration:
  INSTANCE_MEMORY    = AUTOMATIC (DB2 allocated ~29 GB)
  DATABASE_MEMORY    = AUTOMATIC (DB2 allocated ~26 GB)
  SELF_TUNING_MEM    = ON

Buffer Pool Configuration:
  IBMDEFAULTBP: SIZE AUTOMATIC (~22 GB, 4 KB pages, everything in one pool)

Observed Performance:
  Overall hit ratio:           78%
  Average OLTP response:       450 ms (target: < 100 ms)
  Sort overflow rate:          18% (target: < 3%)
  Package cache hit ratio:     82% (target: > 95%)
  Dirty page steals/hour:      12,400

The results are unacceptable. The 78% hit ratio means 22% of page requests result in physical I/O — devastating for OLTP response time. The 18% sort overflow rate means nearly one in five sorts spills to disk. The low package cache hit ratio means SQL statements are being re-prepared repeatedly, wasting CPU.

Analysis

Why the Defaults Failed

The defaults gave 22 GB to a single buffer pool. That sounds generous — but the database is 2.1 TB. Even 22 GB caches only about 1% of the data. The hit ratio depends entirely on whether the hot working set fits in that 22 GB, and with everything in one pool, there is no protection against access pattern interference.

More critically, STMM allocated almost everything to the buffer pool and left the other consumers starved: - Sort heap: too small (STMM saw few sorts initially and kept it low) - Package cache: too small (82% hit ratio means 18% of SQL prepares are re-done) - No workload separation: ATM balance checks and ad-hoc queries compete for the same pages

The DR Workload Profile

Unlike production, the DR site only needs to support core banking and ATM. The hot tables are well-defined:

Table	Size (4 KB Pages)	Size (MB)	DR Access Pattern	Priority
ACCOUNTS	65,000	254	Every transaction	Critical
CUSTOMERS	25,000	98	Every account lookup	Critical
BALANCES	40,000	156	Every balance check	Critical
TRANSACTIONS (recent 30 days)	150,000	586	Recent history lookups	High
HOLDS	5,000	20	ATM withdrawal checks	High
BRANCHES	200	0.8	Lookup	Low (cached trivially)
PRODUCTS	100	0.4	Lookup	Low (cached trivially)
All OLTP indexes	80,000	313	Every query	Critical
Transaction indexes	50,000	195	Transaction lookups	High

Key insight: The critical working set (ACCOUNTS + CUSTOMERS + BALANCES + HOLDS + reference + OLTP indexes) totals about 215,300 pages = 841 MB. This is small. If we can guarantee that this 841 MB is always cached, the critical OLTP hit ratio will approach 100%.

The Redesign

Memory Budget

Total physical memory:              32,768 MB  (32 GB)
Reserved for OS + filesystem cache: -6,144 MB  (6 GB, ~19%)
Available for DB2:                  26,624 MB  (26 GB)

INSTANCE_MEMORY:                    26,624 MB
  DATABASE_MEMORY:                  23,552 MB  (23 GB)
    Buffer pools (total):           18,432 MB  (18 GB, 78% of database memory)
    Sort heap (shared threshold):    2,048 MB  (2 GB)
    Package cache:                   1,024 MB  (1 GB)
    Catalog cache:                     256 MB
    Lock list:                         512 MB
    Database heap + utility heap:      256 MB
    Overhead + misc:                 1,024 MB
  Agent/FCM/instance overhead:       3,072 MB  (3 GB)

Buffer Pool Configuration

-- BP_CRITICAL: Core OLTP tables that must always be cached
-- ACCOUNTS (65K) + CUSTOMERS (25K) + BALANCES (40K) + HOLDS (5K) + ref (300)
-- Working set: ~135,300 pages = 529 MB
-- Size at 140%: 190,000 pages = 742 MB
-- FIXED size - STMM must NOT shrink this pool
CREATE BUFFERPOOL BP_CRITICAL
  SIZE 190000
  PAGESIZE 4096;

-- BP_INDEXES: All OLTP indexes
-- Working set: ~80,000 pages = 313 MB
-- Size at 140%: 112,000 pages = 438 MB
-- FIXED size
CREATE BUFFERPOOL BP_INDEXES
  SIZE 112000
  PAGESIZE 4096;

-- BP_TRANSACTIONS: Recent transaction data and transaction indexes
-- Cannot cache all; size for 60-70% of recent 30-day partition
-- STMM-managed: can grow if memory is available, shrink if not
CREATE BUFFERPOOL BP_TRANSACTIONS
  SIZE AUTOMATIC    -- Initial ~3.5 GB, STMM adjusts
  PAGESIZE 4096;

-- BP_TEMP: Sort work files
-- 32 KB pages for efficient sort I/O
CREATE BUFFERPOOL BP_TEMP
  SIZE AUTOMATIC
  PAGESIZE 32768;

-- Minimize default pool (system catalog only)
ALTER BUFFERPOOL IBMDEFAULTBP SIZE 10000;  -- 39 MB

Configuration Commands

-- Instance level
UPDATE DBM CFG USING INSTANCE_MEMORY 6815744;  -- 26 GB in 4 KB pages

-- Database level
UPDATE DB CFG FOR MERIDIAN USING SELF_TUNING_MEM ON;
UPDATE DB CFG FOR MERIDIAN USING DATABASE_MEMORY AUTOMATIC;

-- Fix the sort problem (18% overflow is unacceptable)
UPDATE DB CFG FOR MERIDIAN USING SORTHEAP 4096;         -- 16 MB per sort
UPDATE DB CFG FOR MERIDIAN USING SHEAPTHRES_SHR 524288;  -- 2 GB shared

-- Fix the package cache (82% hit ratio is unacceptable)
UPDATE DB CFG FOR MERIDIAN USING PCKCACHESZ 262144;      -- 1 GB

-- Catalog cache
UPDATE DB CFG FOR MERIDIAN USING CATALOGCACHE_SZ 65536;  -- 256 MB

-- Lock list (STMM-managed)
UPDATE DB CFG FOR MERIDIAN USING LOCKLIST AUTOMATIC;

-- I/O configuration
UPDATE DB CFG FOR MERIDIAN USING NUM_IOCLEANERS 8;
UPDATE DB CFG FOR MERIDIAN USING NUM_IOSERVERS 8;
UPDATE DB CFG FOR MERIDIAN USING LOGBUFSZ 2048;          -- 8 MB log buffer

Table Space Assignment

ALTER TABLESPACE TS_ACCOUNTS   BUFFERPOOL BP_CRITICAL;
ALTER TABLESPACE TS_CUSTOMERS  BUFFERPOOL BP_CRITICAL;
ALTER TABLESPACE TS_BALANCES   BUFFERPOOL BP_CRITICAL;
ALTER TABLESPACE TS_HOLDS      BUFFERPOOL BP_CRITICAL;
ALTER TABLESPACE TS_BRANCHES   BUFFERPOOL BP_CRITICAL;
ALTER TABLESPACE TS_PRODUCTS   BUFFERPOOL BP_CRITICAL;

ALTER TABLESPACE TS_ACCT_IDX   BUFFERPOOL BP_INDEXES;
ALTER TABLESPACE TS_CUST_IDX   BUFFERPOOL BP_INDEXES;
ALTER TABLESPACE TS_BAL_IDX    BUFFERPOOL BP_INDEXES;
ALTER TABLESPACE TS_HOLD_IDX   BUFFERPOOL BP_INDEXES;

ALTER TABLESPACE TS_TRANSACTIONS BUFFERPOOL BP_TRANSACTIONS;
ALTER TABLESPACE TS_TXN_IDX     BUFFERPOOL BP_TRANSACTIONS;

ALTER TABLESPACE TEMPSPACE1    BUFFERPOOL BP_TEMP;

Monitoring Strategy

On a memory-constrained system, monitoring must be more aggressive than on production. The team implements 5-minute monitoring intervals:

-- Critical alert: OLTP pool hit ratio below 98%
SELECT BP_NAME,
       CASE WHEN (POOL_DATA_L_READS + POOL_INDEX_L_READS) > 0
            THEN DECIMAL(
                (1.0 - FLOAT(POOL_DATA_P_READS + POOL_INDEX_P_READS)
                      / FLOAT(POOL_DATA_L_READS + POOL_INDEX_L_READS))
                * 100, 5, 2)
            ELSE 100 END AS hit_ratio
FROM TABLE(MON_GET_BUFFERPOOL(NULL, -2)) AS T
WHERE BP_NAME IN ('BP_CRITICAL', 'BP_INDEXES');

Alert thresholds:

Metric	Warning	Critical	Action
BP_CRITICAL hit ratio	< 99%	< 97%	Check for unexpected queries; verify pool not shrunk
BP_INDEXES hit ratio	< 99.5%	< 98%	Verify index statistics current
Sort overflow rate	> 3%	> 8%	Increase SORTHEAP; identify large sort queries
Package cache hit ratio	< 95%	< 90%	Increase PCKCACHESZ; check for literal SQL
Average OLTP response	> 50 ms	> 150 ms	Full diagnostic: check all pools, locks, I/O
Dirty page steals	> 100/hour	> 500/hour	Increase NUM_IOCLEANERS or buffer pool

Failover Validation

The team designs a structured test plan executed quarterly:

Test 1: Smoke Test (5 minutes) - 100 account lookups by customer ID - 100 balance inquiries by account number - 50 fund transfers between accounts - Target: 100% success, average response < 50 ms

Test 2: ATM Simulation (30 minutes) - 200 concurrent simulated ATM sessions - Each session: balance check, withdrawal, receipt generation - 5-second think time between operations - Target: 95th percentile response < 150 ms

Test 3: Sustained Load (2 hours) - Continuous Test 2 workload - Monitor buffer pool hit ratios and response times - Hit ratios should stabilize within 20 minutes (cache warm-up) - No degradation trend after warm-up

Test 4: Peak Load (30 minutes) - Double concurrent sessions to 400 - Verify graceful degradation (linear, not exponential) - Target: 95th percentile response < 400 ms under 2x load

Results After Tuning

After implementing the redesigned configuration and running the validation tests:

Metric	Before (Defaults)	After (Tuned)	Target	Met?
BP_CRITICAL hit ratio	78% (shared)	99.6%	> 99%	Yes
BP_INDEXES hit ratio	78% (shared)	99.9%	> 99%	Yes
BP_TRANSACTIONS hit ratio	78% (shared)	87.2%	> 80%	Yes
Sort overflow rate	18%	1.8%	< 3%	Yes
Package cache hit ratio	82%	97.3%	> 95%	Yes
Average OLTP response	450 ms	8 ms	< 100 ms	Yes
95th pct OLTP response	1,200 ms	35 ms	< 200 ms	Yes
Dirty page steals/hour	12,400	22	< 100	Yes

The transformation is dramatic. Average response time dropped from 450 ms to 8 ms — a 56x improvement — on the same hardware with the same data. The only change was how memory was organized.

Key Lessons

Defaults are designed for generality, not your workload. STMM tried to optimize globally but could not understand that ACCOUNTS and AUDIT_LOG have fundamentally different access patterns. Manual separation of critical from non-critical was essential.
On constrained hardware, prioritize ruthlessly. With only 32 GB, every MB matters. The critical working set (841 MB) got guaranteed, fixed-size pools. Everything else got STMM-managed pools that could flex. The batch workload was explicitly deferred — no memory was wasted on tables that would not be accessed during a DR scenario.
Fix all the memory problems, not just buffer pools. The original system had three distinct problems: low buffer pool hit ratios, sort overflows, and package cache misses. Fixing only the buffer pools would have left the other two problems degrading performance. The sort overflow fix alone (from 18% to 1.8%) eliminated thousands of disk I/O operations per hour.
32 GB is enough for OLTP. The critical insight: you do not need to cache 2.1 TB of data. You need to cache the 841 MB that OLTP transactions actually touch. The vast majority of the 2.1 TB is historical data that the DR workload never accesses. Understanding your working set is more important than having massive memory.
Test before you need it. The quarterly validation tests ensure that the DR site will actually perform when a real disaster occurs. Without testing, the team would not have discovered the default configuration's 450 ms response time until an actual failover — when it would be too late to tune.