Case Study 2: Memory Configuration Crisis — When Buffer Pools Are Wrong

DataField.Dev

Case Study 2: Memory Configuration Crisis — When Buffer Pools Are Wrong

The Incident

It was 10:47 AM on the second Monday of January — the first business day after Meridian National Bank's annual interest-posting batch completed over the weekend. The call center was fully staffed, and customers were logging in to check their updated balances. At 10:47 AM, the monitoring dashboard turned red.

Average response time for the CUSTOMER_INQUIRY transaction — a simple lookup that should complete in 2-3 milliseconds — had jumped to 340 milliseconds. Within minutes, the web application servers began reporting connection pool exhaustion. The mobile app showed spinning wheels. Customer service representatives could not pull up account information. By 10:55 AM, the incident was escalated to severity 1.

DBA Marcus Chen was paged. He logged into the z/OS console and began his investigation.

The Environment (Before the Incident)

Meridian's z/OS DB2 subsystem MBPD had been running stably for months with the following buffer pool configuration:

Buffer Pool Configuration (as of Friday evening):

BP0  (4K,  15,000 pages =   60 MB) - Catalog, reference tables (BRANCH, EMPLOYEE)
BP2  (8K,  80,000 pages =  640 MB) - CUSTOMER and ACCOUNT data
BP3  (8K,  50,000 pages =  400 MB) - TRANSACTION data
BP4  (4K,  40,000 pages =  160 MB) - All indexes
BP5  (32K,  5,000 pages =  160 MB) - Work file (TEMP) tablespace

Total buffer pool memory: ~1,420 MB

Normal BPHR:
  BP0: 99.8%
  BP2: 99.3%
  BP3: 94.5%
  BP4: 99.6%
  BP5: N/A (temp space, not meaningful)

The CUSTOMER table held 520,000 rows across approximately 28,000 data pages (8K). The ACCOUNT table held 1,250,000 rows across approximately 48,000 data pages. Combined, these tables occupied roughly 76,000 pages — fitting comfortably within BP2's 80,000-page allocation. The 99.3% hit ratio confirmed that virtually all customer and account data was resident in the buffer pool.

What Changed

Over the weekend, three things happened:

Change 1: The Annual Interest Posting Batch

The annual interest posting job ran on Saturday night. This job updated the BALANCE column in every row of the ACCOUNT table (1.25 million rows) and inserted an interest-posting TRANSACTION row for each account. The updates caused the ACCOUNT table's data pages to expand slightly — updated rows with larger decimal values in BALANCE occasionally caused page splits, growing the table from 48,000 pages to 53,000 pages.

Additionally, the TRANSACTION table grew by 1.25 million rows (the interest postings), adding approximately 8,000 data pages.

Change 2: A "Temporary" Buffer Pool Reduction

On Friday afternoon, a junior DBA named Kevin had been troubleshooting an unrelated problem with a reporting application that used the work file tablespace (BP5). The reporting queries were failing with sort overflow errors. Kevin's solution was to increase BP5 from 5,000 pages to 25,000 pages — an additional 640 MB of 32K memory. To stay within the DBM1 address space memory limit, he reduced BP2 from 80,000 pages to 60,000 pages, "temporarily, just for the weekend batch window." He planned to reverse the change on Monday morning but forgot.

Change 3: Monday Morning RUNSTATS

An automated job ran RUNSTATS on the CUSTOMER and ACCOUNT tables at 6:00 AM Monday. This updated the catalog statistics to reflect the new table sizes. The optimizer, seeing the updated statistics, began choosing slightly different access paths for some queries — nothing dramatic on its own, but contributing to a marginally different page-access pattern.

The Diagnosis

Marcus started with the buffer pool statistics. What he saw immediately explained the crisis:

Buffer Pool Statistics at 10:50 AM (Monday):

BP0  (4K,  15,000 pages) - BPHR: 99.7%  [Normal]
BP2  (8K,  60,000 pages) - BPHR: 72.4%  [CRITICAL - was 99.3%]
BP3  (8K,  50,000 pages) - BPHR: 88.1%  [Degraded - was 94.5%]
BP4  (4K,  40,000 pages) - BPHR: 96.2%  [Slightly degraded - was 99.6%]
BP5  (32K, 25,000 pages) - BPHR: 15.3%  [Low but normal for temp space]

BP2's hit ratio had dropped from 99.3% to 72.4%. This was the smoking gun.

Marcus checked the buffer pool size and immediately saw the problem: BP2 was at 60,000 pages, not 80,000. The combined CUSTOMER and ACCOUNT table data now occupied approximately 81,000 pages (28,000 + 53,000), but the buffer pool could only hold 60,000.

The arithmetic was devastating:

Before: 76,000 data pages in an 80,000-page pool = nearly everything cached.
After: 81,000 data pages in a 60,000-page pool = at least 21,000 pages could not be cached. Every access to an uncached page required a physical I/O.

But the impact was worse than simple arithmetic suggested. The 60,000-page pool was not idle memory waiting for the right pages. It was thrashing — the LRU algorithm was constantly evicting pages to make room for newly requested pages, only to need those evicted pages again moments later. The effective hit ratio of 72.4% meant that more than one in four page requests was going to disk.

At 2-5 milliseconds per physical read, this turned a sub-millisecond buffer pool hit into a multi-millisecond disk wait. With multiple index accesses and data page fetches per transaction, the compound effect was a 100x slowdown.

The Cascade

The buffer pool crisis triggered a cascade of secondary effects:

CPU spike: Physical I/O operations require CPU for channel program management, interrupt handling, and I/O scheduling. The CPU utilization on the LPAR jumped from a normal 45% to 82%.
Lock duration increase: Each transaction held its locks longer (because it was spending more time waiting for I/O). Longer lock hold times increased lock contention, causing other transactions to wait.
Thread consumption: DB2 threads that would normally complete in 3ms were now taking 340ms. The maximum thread count was approached, and new connection requests from the web servers began to queue.
Log buffer pressure: More concurrent, slower transactions meant more log records in the log buffer simultaneously. Log buffer forces increased, adding latency to the MSTR address space.
Application connection pool exhaustion: The web application servers had connection pools sized for 3ms response times. At 340ms, connections were checked out for 100x longer, exhausting the pools and causing application-level errors.

The Fix

Marcus implemented the fix in two stages:

Immediate Fix (10:58 AM)

Marcus issued the DB2 command to resize BP2 back to its proper size:

-DB2P ALTER BUFFERPOOL(BP2) VPSIZE(80000)

He simultaneously reduced BP5 back to its original size:

-DB2P ALTER BUFFERPOOL(BP5) VPSIZE(5000)

The ALTER BUFFERPOOL command takes effect incrementally — DB2 allocates additional pages over time as they are needed. Within about 90 seconds, BP2 had grown to 80,000 pages. But the buffer pool was initially cold — it contained only 60,000 pages of data, and the other 20,000 pages were empty.

Over the next 5-10 minutes, as transactions continued to flow, the buffer pool gradually warmed up. Pages that had been evicted were re-read from disk and cached. By 11:10 AM, the hit ratio was climbing: 82%, then 89%, then 93%. By 11:20 AM, it had stabilized at 98.8%.

Response times followed the recovery curve: 340ms at 10:55, dropping to 120ms at 11:05, 25ms at 11:10, and back to 3ms by 11:20 AM. Total customer-facing impact: approximately 35 minutes.

Permanent Fix (that afternoon)

After the immediate crisis was resolved, Marcus took several longer-term actions:

Increased BP2 to 100,000 pages (800 MB): The post-interest-posting table sizes required a larger pool. The previous 80,000-page allocation had only 4,000 pages of headroom, which was insufficient to absorb the growth.
Established a buffer pool change-control procedure: Any buffer pool resize would require DBA team lead approval and could not be implemented on a Friday before a holiday or batch weekend.
Created monitoring alerts: An alert would fire if any buffer pool hit ratio dropped below 95% during OLTP hours, giving the team early warning before the situation became critical.
Documented the BP5 sort overflow root cause: The reporting application's sort problem was caused by an inefficient query (SELECT DISTINCT on a large table without appropriate indexes), not by an undersized work file buffer pool. Kevin's buffer pool change had treated the symptom, not the disease.
Post-incident review: The team conducted a blameless postmortem. The key finding was not that Kevin made a mistake — everyone makes mistakes — but that the system had no guardrails to prevent a buffer pool reduction during a critical batch window.

The Mathematics of Buffer Pool Impact

This incident illustrates a non-linear relationship between buffer pool sizing and performance. The relationship is not proportional — a 25% reduction in buffer pool size caused a 100x degradation in response time. Here is why:

The Working Set Problem

The working set is the set of pages actively being used by the current workload. If the buffer pool is larger than the working set, the hit ratio approaches 100%. If the buffer pool is smaller than the working set, pages are constantly evicted and re-read — a condition called thrashing.

Buffer Pool Performance Curve (conceptual):

Hit Ratio
100% |                          ___________________
     |                    ____/
 95% |                ___/
     |             __/
 90% |           _/
     |         _/
 80% |       _/
     |     _/
 70% |   _/
     | _/
 50% |/
     +--------------------------------------------
     0%       50%      100%     150%     200%
          Buffer Pool Size (as % of working set)

     The "knee" of the curve is around 80-95% of working set size.
     Below this, hit ratios degrade rapidly.
     Above this, additional memory has diminishing returns.

Before the incident, BP2 was at approximately 105% of the working set (80,000 pages for a 76,000-page working set). After the changes, it was at approximately 74% of the working set (60,000 pages for an 81,000-page working set). This pushed the system below the "knee" of the curve, where small changes in buffer pool size cause large changes in hit ratio.

The Compound Effect

A single customer inquiry transaction requires approximately: - 3-4 index page reads (navigating the primary key index) - 1-2 data page reads (fetching the customer row and related account rows) - Total: 5-6 page requests

At a 99.3% hit ratio, roughly 0.04 of those 6 requests (less than 1 in 25 transactions) results in a physical I/O. Average I/O contribution: 0.04 * 4ms = 0.16ms. Barely noticeable.

At a 72.4% hit ratio, roughly 1.66 of those 6 requests result in physical I/O. Average I/O contribution: 1.66 * 4ms = 6.6ms. Now add I/O queue wait times (because the I/O subsystem is handling 40x more requests): effective I/O latency increases from 4ms to perhaps 15ms due to queuing. Revised I/O contribution: 1.66 * 15ms = 25ms per transaction. And this does not account for the lock wait and thread contention cascade.

The lesson: buffer pool sizing is a cliff, not a slope. When you are above the cliff, everything is fast. When you fall below it, everything breaks at once.

Discussion Questions

Kevin increased BP5 to solve a sort overflow problem. What should he have done instead? How do you distinguish between "not enough buffer pool memory" and "the query needs optimization"?
The monitoring system did not alert on the buffer pool change. Design an alert strategy that would catch this problem before it affects customers. What metrics would you monitor, and what thresholds would you set?
The fix took about 35 minutes from first alert to full recovery. Could this have been automated? What are the risks of automatically resizing buffer pools?
Marcus increased BP2 to 100,000 pages as a permanent fix, providing ~19,000 pages of headroom. How would you determine the right amount of headroom? What factors affect future growth?
This incident occurred because a change was made on Friday and not reversed on Monday. How would you design a change management process that prevents this class of error? Consider both technical controls (automated reversal) and procedural controls (checklists, approvals).
If this system had been running on DB2 LUW with STMM enabled, would the incident have occurred? Discuss both the benefits and the limitations of STMM in preventing this type of problem.

Hands-On Exercise

Using the following simplified buffer pool statistics, calculate the performance impact of a buffer pool reduction.

Scenario: A buffer pool has 50,000 pages and a working set of 45,000 pages. The current BPHR is 99.1%. Management wants to reduce the pool to 35,000 pages to "save memory" for another application.

Estimate the new hit ratio (assume it will be approximately proportional to pool_size / working_set when below the working set, with a floor around 65-70%).
Calculate the additional physical I/Os per hour if the system processes 500,000 page requests per hour.
If each physical I/O takes 4ms and the I/O subsystem can handle 5,000 I/Os per second before queuing delays begin, will the reduction cause queuing?
Write a recommendation to management with your findings.

Worked Solution

1. **Estimated hit ratio**: With 35,000 pages for a 45,000-page working set, the pool is at 78% of the working set. Based on the performance curve, the estimated hit ratio is approximately 78-82%. Let us use 80%. 2. **Additional physical I/Os**: - Current: (1 - 0.991) * 500,000 = 4,500 physical I/Os per hour - Projected: (1 - 0.80) * 500,000 = 100,000 physical I/Os per hour - Additional: 95,500 I/Os per hour, a 22x increase 3. **Queuing analysis**: - Current I/O rate: 4,500 / 3,600 = 1.25 I/Os per second (well below 5,000 limit) - Projected I/O rate: 100,000 / 3,600 = 27.8 I/Os per second (still below 5,000, so no queuing from this pool alone) - However, this analysis only covers one buffer pool. Combined with I/O from other pools, batch jobs, logging, and other workloads, the total I/O load could approach the threshold. 4. **Recommendation**: "Reducing BP from 50,000 to 35,000 pages will decrease the buffer pool hit ratio from 99.1% to approximately 80%, increasing physical disk I/Os by a factor of 22. This will directly increase transaction response times by an estimated 10-15ms per transaction (from the current sub-2ms to 12-17ms). For a customer-facing OLTP system, this degradation is likely unacceptable. The 120 MB of memory freed (15,000 * 8K) does not justify the performance impact. I recommend keeping the current allocation and finding memory for the other application through other means (additional RAM, workload scheduling, or server consolidation)."