Case Study 1: Enterprise DB2 Installation -- Planning a Production Deployment
Background
Regional Atlantic Financial Group (RAFG) is a mid-sized financial holding company with $18 billion in assets, operating 120 branches across six states. Their core banking platform runs on DB2 for z/OS version 12, hosted on an IBM z15 mainframe at their primary data center in Charlotte, North Carolina. A disaster recovery site in Richmond, Virginia, maintains a standby copy through GDPS (Geographically Dispersed Parallel Sysplex).
In addition to their mainframe systems, RAFG's digital banking team operates a suite of customer-facing applications backed by Db2 LUW running on Red Hat Enterprise Linux. These systems handle online banking, mobile applications, and an API gateway that serves third-party fintech partners.
RAFG has just completed a merger with Coastal Savings Bank, doubling their customer base from 400,000 to 820,000 accounts. The infrastructure team has been tasked with standing up new DB2 environments to support the merged entity -- both a new DB2 for z/OS data sharing group for the expanded core banking workload and a new Db2 LUW cluster for the digital platform.
The Challenge
Marcus Williams, RAFG's lead DBA, faces a complex installation and configuration project with multiple dimensions.
z/OS Track -- New Data Sharing Group:
The existing DB2 subsystem (DSN1) is a single-member system. To handle the doubled transaction volume -- projected at 15,000 transactions per second during peak hours -- the team must implement a two-member data sharing group. This requires careful DSNZPARM planning for both members, new coupling facility structures, and a buffer pool strategy that accounts for cross-system cache invalidation.
Marcus must determine values for critical system parameters: - CTHREAD (max concurrent threads): Currently set to 200, but peak analysis suggests the merged workload will need 400+ per member - Buffer pool sizes: The current BP0 allocation of 50,000 pages is inadequate; the team needs to size BP0, BP1, BP2, and BP32K based on workload analysis - EDM pool: With 3,200 packages after the merger (up from 1,800), the EDM pool needs expansion - Log data sets: Active log size must accommodate the increased write volume without forcing archive switches during peak batch windows
LUW Track -- New Digital Platform Cluster:
The digital banking team needs a new Db2 11.5 installation on a three-node Linux cluster with pureScale for high availability. The environment must support: - 50,000 concurrent mobile banking sessions - Sub-100ms response time for balance inquiries - 99.99% availability (52.6 minutes maximum annual downtime)
Configuration decisions include instance-level parameters (INTRA_PARALLEL, SHEAPTHRES_SHR), database parameters (LOGFILSIZ, LOGPRIMARY, NUM_IOCLEANERS, NUM_IOSERVERS), and buffer pool sizing for the specific workload profile.
Analysis and Planning
Marcus and his team adopt a structured planning methodology.
Step 1: Workload Characterization
Before touching any installation media, the team spends two weeks analyzing the combined workload: - They extract SMF 101 (accounting) and SMF 102 (performance) records from both the existing RAFG system and Coastal Savings' DB2 subsystem - On the LUW side, they capture db2pd snapshots and monitoring table function output during peak and off-peak periods - They model the projected merged workload at 1x, 1.5x, and 2x current volume
Step 2: Sizing and Configuration
Using workload data, they develop configuration specifications:
For z/OS, the DSNZPARM parameters are organized into installation panels (DSNTIP1 through DSNTIP8), and Marcus documents specific values for each panel with justification. Key decisions include: - Setting CTHREAD to 500 per member with IDTHTOIN (idle thread timeout) at 120 seconds to reclaim unused threads - Allocating BP0 at 80,000 pages, BP1 at 40,000 pages (indexes), BP2 at 20,000 pages (reference data), and BP32K at 30,000 pages (large row tables) - Sizing the EDM pool at 40,000 pages to accommodate 3,200+ packages with headroom - Configuring active logs at 3,000 cylinders each with 6 active/6 archive data sets per member
For LUW, the configuration is captured in a parameter spreadsheet: - LOGFILSIZ: 16384 (64 MB per log file) to handle burst transaction volumes - LOGPRIMARY: 20, LOGSECOND: 10 for 1.2 GB total log capacity - NUM_IOCLEANERS: 12 (matching the number of physical disks) - NUM_IOSERVERS: 12 - SORTHEAP: 8192 for complex analytical queries from the reporting tier - SHEAPTHRES_SHR: 1048576 (shared sort heap threshold)
Step 3: Installation Runbook
The team creates a detailed installation runbook with over 200 steps, including: - Pre-installation verification (OS patches, kernel parameters, filesystem layout) - Installation execution with checksums verified at each stage - Post-installation configuration with parameter application - Verification testing (unit tests for each subsystem component) - Rollback procedures at each checkpoint
Step 4: Testing Strategy
Before production deployment, the team validates in a staging environment: - Functional testing: All 3,200 packages bind successfully; all stored procedures execute correctly - Performance testing: Simulated peak load at 2x projected volume with response time measurement - Failover testing: Deliberate member failure in the data sharing group to verify transparent re-routing - Recovery testing: Full backup/restore cycle plus point-in-time recovery to a specific log point
Outcome
The installation project spans six weeks: - Weeks 1-2: Staging environment build and validation - Week 3: Production z/OS data sharing group implementation (over a holiday weekend) - Week 4: Production LUW pureScale cluster installation - Weeks 5-6: Workload migration, parallel running, and cutover
Post-deployment metrics show: - z/OS transaction throughput: 18,200 TPS peak (exceeding the 15,000 TPS target) - LUW response time: 45ms average for balance inquiries (well under the 100ms target) - First-month availability: 100% on z/OS, 99.998% on LUW (one 10-minute planned maintenance window)
The buffer pool hit ratios stabilize at 98.5% on z/OS and 99.1% on LUW, confirming the sizing analysis was accurate.
Lessons Learned
-
Workload analysis before configuration, always. Marcus notes that their two-week analysis investment prevented what could have been months of post-deployment tuning. "You cannot configure what you have not measured," he tells his team.
-
Document every parameter decision with its justification. When audit asks why CTHREAD is set to 500, the team can point to the workload analysis showing 487 concurrent threads during the peak of the batch window.
-
Installation runbooks must include rollback steps. During the z/OS installation, a coupling facility structure allocation failed on the first attempt. The runbook's rollback procedure allowed the team to recover and retry within 30 minutes rather than scrambling.
-
Staging environments save production weekends. Two issues discovered in staging -- a missing PTF for data sharing and an incorrect ZPARM value for MAXDBAT -- would have caused hours of debugging in production.
-
Configuration is not a one-time event. Marcus schedules monthly configuration reviews for the first six months post-deployment, knowing that the workload will evolve as the merged customer base settles into new patterns.
Discussion Questions
-
Why did the team choose a data sharing group rather than simply increasing resources on the existing single-member DB2 subsystem? What are the trade-offs?
-
The team sized buffer pools based on workload analysis. How would you monitor buffer pool effectiveness after deployment, and what metrics would trigger a resizing decision?
-
Marcus set CTHREAD to 500 even though projected peak was 487. What is the risk of setting it too close to the actual peak? What is the risk of setting it too high?
-
The LUW installation used pureScale for high availability. What alternative HA approaches exist for Db2 LUW, and in what scenarios might they be preferred over pureScale?
-
How would this installation plan differ if RAFG were deploying to Db2 on Cloud (IBM Cloud) instead of on-premises infrastructure?