Case Study 36.1: Full System Stress Test — What We Found
Background
Pacific Coast Savings Bank (PCSB) is a mid-sized regional bank with 680,000 customers and 1.4 million accounts. Their core banking system runs on DB2 11.5.8 for LUW on a two-node HADR pair (RHEL 8.6). The bank had never performed a formal stress test — their production system had simply grown organically over eight years, and performance was "good enough" during normal operations.
The trigger for the stress test was a planned acquisition. PCSB was about to merge with a smaller bank, adding 220,000 customers and 480,000 accounts. The CTO asked: "Can our systems handle the combined workload?"
The Team
- Lead DBA (the narrator): 12 years of DB2 experience, responsible for the stress test design and execution.
- Application DBA: 6 years of experience, responsible for identifying the critical SQL workload.
- System Administrator: Managed the Linux infrastructure and monitoring tools.
- Application Team Lead: Provided transaction volume projections and application behavior documentation.
Phase 1: Baseline Capture (Week 1)
We began by establishing what "normal" looked like. We instrumented the production database with 15-minute monitoring snapshots for five business days (Monday through Friday). Key baseline metrics:
| Metric | Average | Peak (Month-End) |
|---|---|---|
| Transactions per second | 1,100 | 2,400 |
| Average response time | 8 ms | 18 ms |
| Buffer pool hit ratio | 99.4% | 98.1% |
| Lock wait time (avg) | 3 ms | 12 ms |
| CPU utilization | 22% | 48% |
| Active connections | 180 | 340 |
These numbers looked healthy. The question was whether they would remain healthy at 1.5x volume (the post-merger projection).
Phase 2: Workload Analysis (Week 1-2)
The Application DBA captured the top 50 SQL statements by CPU consumption from the package cache. We categorized them:
- 30 statements: Account inquiry (read-only, high frequency)
- 8 statements: Fund transfer and payment processing (read-write)
- 5 statements: Batch processing (nightly interest calculation, statement generation)
- 4 statements: Regulatory reporting (complex aggregation)
- 3 statements: Administrative (RUNSTATS, monitoring queries)
We profiled the transaction mix by hour of day and discovered something unexpected: the interest calculation batch job, which ran at 2:00 AM, overlapped with the West Coast branches' end-of-day processing until 2:30 AM Pacific (5:30 AM Eastern). This overlap had never been noticed because it affected only a small number of tellers in three branches — they simply experienced "slow" systems for 30 minutes each night and never reported it.
Phase 3: Stress Test Design (Week 2)
We designed the stress test to simulate post-merger conditions:
- 1.5x the baseline transaction rate across all transaction types.
- 1.3x the concurrent connection count (adding connections for the acquired bank's branch network).
- Month-end conditions (the highest-stress scenario).
- Concurrent batch processing (to reproduce the 2:00 AM overlap discovered during profiling).
We built the workload using Apache JMeter with the DB2 JDBC driver. Each JMeter thread represented a bank teller or online banking session, executing a randomized sequence of transactions with realistic think times (2-5 seconds between actions, matching observed user behavior).
The test environment was a clone of production, restored from the previous weekend's backup, with the acquired bank's data loaded on top (from a data migration extract).
Phase 4: Execution and Findings (Week 3)
We ran the stress test for 8 hours — simulating a full business day at post-merger volume. The results were sobering.
Finding 1: Buffer Pool Collapse
At 1.2x load (before even reaching the 1.5x target), the buffer pool hit ratio for the main data tablespace dropped from 99.4% to 78.3%. Investigation revealed the cause: the buffer pool was sized at 32 GB, which was sufficient for the current data working set but not for the expanded customer base. The additional 220,000 customers meant more pages in the CUSTOMER and ACCOUNT tables being accessed concurrently, exceeding the buffer pool's capacity to hold the active working set.
Impact: At 78.3% hit ratio, 21.7% of data page reads went to disk. At 1.5x transaction volume, this translated to approximately 3,400 additional physical I/O operations per second — enough to saturate the storage controller's read cache.
Resolution: Increase buffer pool from 32 GB to 64 GB. The server had 128 GB of RAM with only 62 GB allocated to DB2. The cost: zero (memory was already installed).
Finding 2: Lock Escalation Storm
At 1.4x load, the interest calculation batch job triggered lock escalations on the ACCOUNT table. The batch job acquired row-level locks on every account as it calculated and posted interest. At the current volume of 1.4 million accounts, it held approximately 1.4 million locks simultaneously — exceeding the LOCKLIST capacity and triggering escalation to a table-level exclusive lock. This locked out all teller transactions for 45 seconds until the batch committed.
Impact: 340 concurrent teller sessions experienced a 45-second freeze. In a bank, this means 340 tellers staring at frozen screens, 340 customers waiting, and potentially hundreds of ATM transactions timing out.
Resolution: Three changes: 1. Restructure the batch job to commit every 10,000 accounts instead of running as a single transaction. 2. Increase LOCKLIST from 8,192 pages to 32,768 pages. 3. Increase MAXLOCKS from 22% to 40%.
After these changes, the re-test showed zero lock escalations.
Finding 3: Sort Heap Exhaustion
The regulatory reporting queries, which ran during business hours for ad-hoc compliance inquiries, caused massive sort overflows at 1.3x load. The GROUP BY and ORDER BY clauses on multi-million-row result sets exceeded the sort heap, spilling to temporary tablespace. With multiple concurrent reports, the temporary tablespace I/O became a bottleneck.
Impact: Regulatory report execution time increased from 45 seconds to 12 minutes under load.
Resolution: Create a materialized query table (MQT) for the most common regulatory aggregation. The MQT is refreshed nightly during the batch window. Ad-hoc reports query the MQT instead of the base tables, reducing the sort requirement by 98%.
Finding 4: The Unexpected Connection Pool Problem
At 1.5x concurrent connections (510 simultaneous connections), the database rejected new connections with SQL1040N (maximum number of applications reached). The MAXAPPLS parameter was set to 500 — sufficient for current operations but not for the post-merger connection count.
Impact: Branch applications at the acquired bank's locations would receive connection errors during peak hours.
Resolution: Increase MAXAPPLS from 500 to 800. Also reviewed the application connection pool settings — the acquired bank's application used a different connection pool library with more aggressive connection retention, holding connections open longer than necessary.
Finding 5: Logging Bottleneck
At peak stress (1.5x with concurrent batch), the active log space was 92% full. The log archiving process, which writes to an NFS share, could not keep pace with the log generation rate. If the active logs had filled completely, DB2 would have suspended all write operations until log space was freed.
Impact: Near-miss for a complete database freeze. If the test had run 20 minutes longer, the active logs would have filled.
Resolution: Increase LOGPRIMARY from 20 to 40 and LOGFILSIZ from 10,000 to 20,000 4K pages. Move log archiving from NFS to a local SSD volume, with asynchronous replication to the remote archive.
Phase 5: Remediation and Re-Test (Week 4)
After implementing all five remediations, we re-ran the stress test. Results at 1.5x load:
| Metric | Target | Re-Test Result | Status |
|---|---|---|---|
| Throughput (TPS) | >3,600 | 3,940 | PASS |
| Average response time | <25 ms | 16 ms | PASS |
| 95th percentile RT | <200 ms | 89 ms | PASS |
| Buffer pool hit ratio | >95% | 98.8% | PASS |
| Lock escalations / hr | 0 | 0 | PASS |
| Sort overflows / hr | <10 | 2 | PASS |
| Active log utilization | <70% | 44% | PASS |
| CPU utilization | <80% | 61% | PASS |
Every metric passed with comfortable margins.
Lessons Learned
-
Never assume "good enough" will scale. PCSB's system performed well at current load but had five issues lurking beneath the surface that only appeared under increased volume.
-
Batch-OLTP overlap is a hidden killer. The 2:30 AM overlap had gone unnoticed for years because it affected only three West Coast branches. At higher volume, it would have become a system-wide outage.
-
Buffer pool sizing is not "set and forget." As data volume grows, the working set grows. Buffer pools must be re-evaluated periodically — not just when performance degrades.
-
Log throughput is a capacity dimension people forget. Storage capacity, CPU, and memory get planned for. Log archiving throughput often does not — until it becomes a crisis.
-
The stress test paid for itself immediately. The total cost of the remediation was approximately $15,000 (mostly the SSD for log archiving — all other fixes were configuration changes). Without the stress test, the bank would have discovered these issues in production, during the post-merger transition, with real customers affected. That cost would have been measured in customer trust, regulatory scrutiny, and operational crisis management.
Discussion Questions
-
If you were the lead DBA, what additional scenarios would you have included in the stress test beyond the five transaction types tested?
-
The interest calculation batch job was restructured to commit every 10,000 accounts. What are the trade-offs of this approach versus a single large transaction? What happens if the batch fails at account 500,000 of 1.4 million?
-
The MQT solution for regulatory reporting trades storage space and maintenance overhead for query performance. Under what circumstances would you recommend against an MQT?
-
Finding 5 (logging bottleneck) was a near-miss. How would you design monitoring and alerting to detect this issue before it becomes critical in production?