Case Study 28.1: Data Sharing Group Deployment at a Major Bank
Background
Continental Federal Savings, a top-20 US bank with $180 billion in assets, operates its core banking platform on DB2 for z/OS. The bank runs a single DB2 subsystem (DB1A) on a z15 LPAR processing:
- 12,000 CICS transactions per second at peak
- 800 distributed (DRDA) connections from web and mobile applications
- Nightly batch processing: 15 million account updates for interest calculation
- Monthly regulatory reporting consuming 4-6 hours of CPU-intensive queries
The bank faces three pressing problems:
- Planned maintenance outages: Applying quarterly DB2 PTFs requires a 2-hour outage window on Saturday nights. Customers complain about unavailable online banking.
- Capacity ceiling: The single LPAR is at 82% CPU utilization during peak hours. Growth projections show it will hit 95% within 18 months.
- Risk exposure: A single DB2 failure would halt all banking operations. The board of directors has demanded improved resilience.
The infrastructure team has been directed to implement DB2 data sharing.
Challenge
Design and deploy a data sharing group that: - Eliminates planned outages for DB2 maintenance - Provides room for 50% transaction growth over three years - Survives any single hardware failure without service interruption - Minimizes disruption to existing applications during migration
Solution Design
Physical Architecture
Continental Federal deploys a three-member data sharing group across two physical CPCs:
Group Name: CFSGDB2
Group Attach: CFSBANK
Members: CF1A, CF1B, CF1C
CPC-1 (z15 T02):
LPAR-A: CF1A (online workload)
LPAR-B: CF1C (batch + utilities)
LPAR-C: CF-1 (coupling facility)
CPC-2 (z15 T02):
LPAR-D: CF1B (online workload)
LPAR-E: (spare — future CF1D)
LPAR-F: CF-2 (coupling facility)
Both CPCs are in the same data center but connected to different power feeds, cooling systems, and network switches.
Coupling Facility Configuration
| Structure | Primary CF | Duplex CF | Size |
|---|---|---|---|
| CFSGDB2_LOCK1 | CF-1 | CF-2 | 1.5 GB |
| CFSGDB2_SCA | CF-1 | CF-2 | 128 MB |
| CFSGDB2_GBP0 | CF-1 | CF-2 | 8 GB |
| CFSGDB2_GBP1 | CF-1 | CF-2 | 4 GB |
| CFSGDB2_GBP8K0 | CF-1 | CF-2 | 2 GB |
| CFSGDB2_GBP32K | CF-1 | CF-2 | 512 MB |
All structures are duplexed. GBP0 is sized generously because the ACCOUNTS tablespace (in BP0) is the most heavily shared pageset.
Workload Distribution
| Member | Primary Workload | CICS Regions | DRDA Connections |
|---|---|---|---|
| CF1A | Online banking, teller transactions | 8 CICS regions | 400 via sysplex distributor |
| CF1B | Online banking, ATM, wire transfers | 8 CICS regions | 400 via sysplex distributor |
| CF1C | Nightly batch, utilities, reporting | 0 (batch only) | 50 (reporting) |
Migration Approach
The team adopted a phased migration over 12 weeks:
Phase 1 (Weeks 1-3): Infrastructure preparation - Install and configure coupling facilities on both CPCs - Define CFRM policies with all required structures - Configure sysplex distributor with DVIPA - Test CF connectivity from all LPARs
Phase 2 (Weeks 4-6): Enable data sharing on existing member - Stop DB2 (DB1A) during a scheduled maintenance window - Reinstall DB2 as the first member of the data sharing group (CF1A) - Start CF1A and verify data sharing group is functional - Resume all CICS and DRDA connections
Phase 3 (Weeks 7-9): Add second member - Install CF1B on CPC-2 - Start CF1B — it joins the existing data sharing group - Gradually shift 50% of CICS regions and DRDA connections to CF1B - Monitor GBP activity and CF service times
Phase 4 (Weeks 10-12): Add third member and optimize - Install CF1C for batch workload - Redirect batch JCL to use CF1C's SSID - Perform rolling maintenance test (apply a test PTF to one member) - Conduct a simulated member failure and peer recovery test
Implementation Challenges
Challenge 1: GBP Sizing for the ACCOUNTS Tablespace
During Phase 3, when both CF1A and CF1B began updating the ACCOUNTS tablespace concurrently, the GBP0 hit ratio dropped to 38%. Investigation revealed:
- The ACCOUNTS tablespace had 12 million pages
- Both members were performing high-volume updates on the same hot pages (active checking accounts)
- The initial GBP0 size (4 GB) was insufficient for the working set
Resolution: Doubled GBP0 to 8 GB. The hit ratio recovered to 78%. Further optimization of workload affinity (routing account transactions by account number range) reduced cross-member contention and pushed the hit ratio to 91%.
Challenge 2: False Contention in the Lock Structure
After adding CF1B, the false contention rate spiked to 12%. The lock table had been sized for a single-member workload.
Resolution: Increased the lock structure from 512 MB to 1.5 GB, providing a larger lock table with more entries. False contention dropped to 0.3%.
Challenge 3: Batch Job Performance on CF1C
The nightly interest calculation batch job, which previously ran in 3 hours on the standalone system, initially took 5.5 hours on CF1C. The cause was GBP-dependent writes — every page updated by the batch job was being written to the GBP because online members (CF1A, CF1B) had concurrent interest in the same pagesets.
Resolution: 1. Moved interest calculation tables to a dedicated tablespace in BP1 2. Scheduled the batch window to start 30 minutes after peak online traffic subsides 3. During the batch window, temporarily reduced online activity on the interest calculation tables through application logic 4. Batch job completion time returned to 3.5 hours — a modest 17% increase that was acceptable
Results
After 90 days in production:
| Metric | Before Data Sharing | After Data Sharing | Change |
|---|---|---|---|
| Planned downtime (quarterly) | 8 hours/year | 0 hours/year | -100% |
| Peak CPU utilization | 82% (one LPAR) | 55% (highest member) | -27 percentage points |
| Available capacity | 18% of one LPAR | ~150% of original | +732% |
| Unplanned downtime (annual) | 45 minutes (one incident) | 0 minutes | -100% |
| Transaction throughput capacity | 15,000 TPS | 35,000+ TPS | +133% |
| CF service time (average) | N/A | 12 microseconds | Within target |
| GBP0 hit ratio | N/A | 91% | Exceeds target |
Lessons Learned
-
Start with generous CF structure sizes. It is far easier to give back capacity than to react to performance degradation in production. Budget for 2x your initial estimates.
-
Workload affinity matters more than even distribution. The initial "spread everything evenly" approach caused excessive GBP activity. Routing related transactions to the same member dramatically reduced CF overhead.
-
Test failure scenarios before production. The team conducted three simulated member failures during the migration, each revealing operational gaps (alert routing, runbook completeness, team readiness) that were fixed before go-live.
-
Monitor continuously. Data sharing introduces metrics that do not exist in a single-subsystem environment. Invest in monitoring dashboards for CF service time, GBP hit ratios, XI rates, and false contention from day one.
-
Batch workload isolation is critical. Batch jobs that update millions of rows create enormous GBP pressure if they share pagesets with online members. Dedicate a member to batch and design tablespace assignments to minimize cross-member sharing during batch windows.
Discussion Questions
- Why did the team choose three members instead of two? What are the trade-offs of adding a fourth member?
- If Continental Federal needed to survive the loss of an entire data center, how would the data sharing design need to change? (Hint: consider GDPS.)
- The batch job runs 17% slower in data sharing. Under what circumstances would this be unacceptable, and what alternatives exist?