Case Study 33.1: Connection Pool Tuning for 10,000 Concurrent Users

DataField.Dev

Case Study 33.1: Connection Pool Tuning for 10,000 Concurrent Users

Background

Meridian Bank is launching a new mobile banking application expected to serve 10,000 concurrent users during peak hours (7-9 AM and 5-7 PM weekdays). The backend is a Java Spring Boot REST API running on Kubernetes, connecting to DB2 for z/OS through JDBC Type 4 drivers.

The architecture team initially configured connection pools generously, reasoning that more connections would provide better throughput and lower latency. Within the first week of the pilot launch with 1,000 users, they encountered severe performance degradation that threatened the full production launch.

Initial Configuration

The initial deployment consisted of:

4 Kubernetes pods (application replicas behind a load balancer)
Each pod: HikariCP with maximumPoolSize=200
Total potential connections to DB2: 4 x 200 = 800
DB2 z/OS MAXDBAT: 1,000 (maximum distributed threads)

// Initial (problematic) HikariCP configuration
HikariConfig config = new HikariConfig();
config.setMaximumPoolSize(200);     // Way too high
config.setMinimumIdle(50);          // 50 idle connections per pod at all times
config.setConnectionTimeout(30000); // 30 seconds
config.setIdleTimeout(600000);      // 10 minutes
config.setMaxLifetime(3600000);     // 1 hour — too long
// No leak detection configured
// No connection test query configured

Symptoms

During the pilot with 1,000 concurrent users, four problems emerged.

Problem 1: DB2 Thread Exhaustion

With 4 pods each maintaining 50+ idle connections, DB2 had over 200 active distributed threads even during low-traffic periods. Each DB2 thread consumes:

1-4 MB of working storage (EDM pool, thread-local storage)
Lock hash chain entries
Log buffer reservation slots
CPU cycles for thread management housekeeping

DB2 monitoring revealed alarming numbers:

MAXDBAT (maximum distributed threads) = 1,000
Active distributed threads at peak    =   847
Queued threads (waiting for a slot)   =    23

With only 1,000 users, the system was already at 85% of its thread capacity. Scaling to 10,000 users with this configuration would require 8,000 potential connections -- eight times the DB2 maximum.

Problem 2: Lock Escalation

More concurrent DB2 threads meant more row-level locks held simultaneously across the ACCOUNTS tablespace. When the lock count exceeded the LOCKSIZE threshold, DB2 escalated from row-level to table-level locks:

DSNT501I LOCK ESCALATION ON TABLESPACE BANKING.ACCOUNTS_TS
         LOCK MODE CHANGED FROM ROW TO TABLE

Table-level locks serialized all account operations, turning a highly concurrent workload into a sequential one. Response times spiked from 45ms to over 2 seconds during escalation events.

Problem 3: Connection Pool Waste

HikariCP metrics told a clear story of waste:

Active connections:   35 (average), 52 (peak)
Idle connections:    148 (average)
Total connections:   200
Pool utilization:     17.5% average

The pool maintained 200 connections per pod, but only 35 were active on average. Each idle connection still consumed a DB2 distributed thread because DB2 z/OS keeps the thread allocated for the lifetime of the DRDA connection.

Problem 4: Increased Latency

Counter-intuitively, the oversized pool increased average response time from a baseline of 45ms to 120ms. The mechanisms behind this degradation:

CPU contention: 847 DB2 threads competing for CPU cycles on the z/OS LPAR caused higher dispatch wait times.
Internal latch contention: More threads meant more contention for DB2 internal latches (buffer pool, EDM pool, log buffer).
Log write serialization: More concurrent writers meant more contention for the DB2 log buffer and log write I/O.
Memory pressure: Thread storage consumed EDM pool space that would otherwise cache database objects and access plans.

Root Cause Analysis

The team conducted a systematic throughput analysis.

Actual Connection Demand

Concurrent users:                     1,000
Average requests per user per minute:     2
Total requests per minute:            2,000
Requests per second:                     33

Average request duration:               50ms
Average DB queries per request:           3
Average query duration:                 10ms

Actual concurrent connections needed:
  = requests_per_second * avg_request_duration_seconds
  = 33 * 0.05
  = 1.65 connections total

Per pod (4 pods):
  = 1.65 / 4
  = 0.41 connections per pod

With 3x safety margin:
  = 1.65 * 3 / 4
  = ~1.25 connections per pod

Generous upper bound (10x):
  = 16.5 / 4
  = ~4 connections per pod

The analysis revealed that 4-5 connections per pod would have handled the 1,000-user pilot load comfortably. The 200-connection configuration was 40-50x oversized.

DB2 Thread Cost Analysis

Each distributed thread on z/OS consumed:

Resource	Per Thread	800 Threads Total
EDM pool (thread storage)	~2 MB	1.6 GB
Lock hash entries	~50	40,000
Log buffer slots	1	800
CPU (thread management)	0.1ms/sec	80ms/sec

The 1.6 GB of EDM pool consumed by idle thread storage was space that should have been caching DB2 catalog objects and compiled access plans, resulting in additional CPU overhead from cache misses.

Optimized Configuration

After the analysis, the team deployed a right-sized configuration:

// Optimized HikariCP configuration
HikariConfig config = new HikariConfig();
config.setMaximumPoolSize(20);       // Right-sized for projected peak
config.setMinimumIdle(5);            // Minimal idle footprint
config.setConnectionTimeout(10000);  // 10 sec — fail fast
config.setIdleTimeout(300000);       // 5 min idle eviction
config.setMaxLifetime(1800000);      // 30 min max age

// Connection validation
config.setConnectionTestQuery("SELECT 1 FROM SYSIBM.SYSDUMMY1");
config.setValidationTimeout(5000);

// Leak detection — critical
config.setLeakDetectionThreshold(30000); // Warn at 30 sec

// DB2 performance properties
config.addDataSourceProperty("fetchSize", "100");
config.addDataSourceProperty("currentSchema", "BANKING");
config.addDataSourceProperty("queryDataSize", "32767");

DB2 z/OS parameters were also adjusted:

MAXDBAT  = 200    (reduced from 1,000)
CONDBAT  = 150    (connection thread limit for distributed)
IDTHTOIN = 120    (idle thread timeout: 2 minutes)

The IDTHTOIN parameter was key -- it reclaims idle DB2 threads after 2 minutes of inactivity, preventing resource waste even if the application pool maintains idle connections.

Results

Metric	Before (200 per pod)	After (20 per pod)	Change
Avg response time	120ms	38ms	-68%
P99 response time	850ms	180ms	-79%
DB2 active threads (peak)	847	62	-93%
Lock escalations per hour	45	0	-100%
DB2 CPU per transaction	0.85ms	0.52ms	-39%
Idle connection waste	592 connections	12 connections	-98%
EDM pool used by threads	1.6 GB	160 MB	-90%

Every metric improved dramatically. The lock escalations dropped to zero because fewer concurrent threads meant fewer simultaneous row locks, staying well below the escalation threshold.

Scaling to 10,000 Users

With the optimized 20-connection-per-pod configuration, the team designed the production architecture:

20 Kubernetes pods (scaled from 4 to handle 10x user load)
20 connections per pod = 400 maximum connections
DB2 MAXDBAT = 500 (400 application + headroom for admin and batch)

Under full 10,000-user load (monitored over the first production week):

Total requests per second:          ~330
Active connections (total):          ~80 average, ~120 peak
Idle connections (total):           ~280
DB2 thread utilization:              24% average, 36% peak
Average response time:               42ms
P99 response time:                  195ms
Lock escalation events:               0

The system handled 10,000 concurrent users with comfortable headroom for growth. At 36% peak thread utilization, the architecture could support approximately 28,000 users before needing additional capacity.

Key Lessons

1. Fewer connections equals better performance

Each connection is a DB2 thread. More threads mean more contention for CPU, memory, latches, and the log buffer. Right-size your pool based on actual concurrent query demand, not user count. The formula connections_needed = requests_per_second * avg_query_duration * safety_factor is the starting point.

2. Users are not connections

A user with 8-second think time and 50ms request duration is using a database connection for only 0.6% of their session time. Multiplying users by connection-per-user is the single most common pool sizing mistake.

3. Monitor the right metrics

ThreadsAwaitingConnection > 0 sustained: pool is too small.
IdleConnections >> ActiveConnections: pool is too large.
The ideal pool has minimal idle connections and zero (or near-zero) waiting threads.

4. DB2 z/OS is especially sensitive to thread count

Each distributed thread consumes EDM pool memory that competes with the plan cache and catalog cache. Oversized pools on z/OS have a direct cost: more thread storage means more plan cache misses means more CPU consumed on access plan compilation.

5. Set IDTHTOIN on z/OS

This parameter reclaims idle DB2 threads on the server side, independent of the application pool. Without it, idle threads persist for the lifetime of the TCP connection, wasting resources even when the application is idle.

6. Leak detection is not optional

During the investigation, the team also discovered a code path that failed to close a connection in an error case. Without leakDetectionThreshold, this leak would have gone undetected until pool exhaustion in production. With it, they received a log warning within 30 seconds of the leak occurring.

Discussion Questions

If the average query duration increases from 10ms to 40ms due to a schema change, how should the pool size be recalculated? Would you need to add more pods?
The team chose maxLifetime=1800000 (30 minutes). What specific failure modes does this protect against? What would happen if maxLifetime were set to 24 hours?
How would you configure the pool differently for a batch data pipeline (few connections, long-running queries) versus an OLTP API (many connections, short queries)?
The IDTHTOIN=120 setting on DB2 z/OS reclaims idle threads after 2 minutes. What is the interaction between this setting and HikariCP's minimumIdle=5? Could they conflict?
If Meridian Bank adds a second DB2 subsystem for read replicas, how should the application's pool configuration change?