Case Study 28.1: Data Sharing Group Deployment at a Major Bank

DataField.Dev

Background

Continental Federal Savings, a top-20 US bank with $180 billion in assets, operates its core banking platform on DB2 for z/OS. The bank runs a single DB2 subsystem (DB1A) on a z15 LPAR processing:

12,000 CICS transactions per second at peak
800 distributed (DRDA) connections from web and mobile applications
Nightly batch processing: 15 million account updates for interest calculation
Monthly regulatory reporting consuming 4-6 hours of CPU-intensive queries

The bank faces three pressing problems:

Planned maintenance outages: Applying quarterly DB2 PTFs requires a 2-hour outage window on Saturday nights. Customers complain about unavailable online banking.
Capacity ceiling: The single LPAR is at 82% CPU utilization during peak hours. Growth projections show it will hit 95% within 18 months.
Risk exposure: A single DB2 failure would halt all banking operations. The board of directors has demanded improved resilience.

The infrastructure team has been directed to implement DB2 data sharing.

Challenge

Design and deploy a data sharing group that: - Eliminates planned outages for DB2 maintenance - Provides room for 50% transaction growth over three years - Survives any single hardware failure without service interruption - Minimizes disruption to existing applications during migration

Solution Design

Physical Architecture

Continental Federal deploys a three-member data sharing group across two physical CPCs:

Group Name:     CFSGDB2
Group Attach:   CFSBANK
Members:        CF1A, CF1B, CF1C

CPC-1 (z15 T02):
  LPAR-A: CF1A (online workload)
  LPAR-B: CF1C (batch + utilities)
  LPAR-C: CF-1 (coupling facility)

CPC-2 (z15 T02):
  LPAR-D: CF1B (online workload)
  LPAR-E: (spare — future CF1D)
  LPAR-F: CF-2 (coupling facility)

Both CPCs are in the same data center but connected to different power feeds, cooling systems, and network switches.

Coupling Facility Configuration

Structure	Primary CF	Duplex CF	Size
CFSGDB2_LOCK1	CF-1	CF-2	1.5 GB
CFSGDB2_SCA	CF-1	CF-2	128 MB
CFSGDB2_GBP0	CF-1	CF-2	8 GB
CFSGDB2_GBP1	CF-1	CF-2	4 GB
CFSGDB2_GBP8K0	CF-1	CF-2	2 GB
CFSGDB2_GBP32K	CF-1	CF-2	512 MB

All structures are duplexed. GBP0 is sized generously because the ACCOUNTS tablespace (in BP0) is the most heavily shared pageset.

Workload Distribution

Member	Primary Workload	CICS Regions	DRDA Connections
CF1A	Online banking, teller transactions	8 CICS regions	400 via sysplex distributor
CF1B	Online banking, ATM, wire transfers	8 CICS regions	400 via sysplex distributor
CF1C	Nightly batch, utilities, reporting	0 (batch only)	50 (reporting)

Migration Approach

The team adopted a phased migration over 12 weeks:

Phase 1 (Weeks 1-3): Infrastructure preparation - Install and configure coupling facilities on both CPCs - Define CFRM policies with all required structures - Configure sysplex distributor with DVIPA - Test CF connectivity from all LPARs

Phase 2 (Weeks 4-6): Enable data sharing on existing member - Stop DB2 (DB1A) during a scheduled maintenance window - Reinstall DB2 as the first member of the data sharing group (CF1A) - Start CF1A and verify data sharing group is functional - Resume all CICS and DRDA connections

Phase 3 (Weeks 7-9): Add second member - Install CF1B on CPC-2 - Start CF1B — it joins the existing data sharing group - Gradually shift 50% of CICS regions and DRDA connections to CF1B - Monitor GBP activity and CF service times

Phase 4 (Weeks 10-12): Add third member and optimize - Install CF1C for batch workload - Redirect batch JCL to use CF1C's SSID - Perform rolling maintenance test (apply a test PTF to one member) - Conduct a simulated member failure and peer recovery test

Implementation Challenges

Challenge 1: GBP Sizing for the ACCOUNTS Tablespace

During Phase 3, when both CF1A and CF1B began updating the ACCOUNTS tablespace concurrently, the GBP0 hit ratio dropped to 38%. Investigation revealed:

The ACCOUNTS tablespace had 12 million pages
Both members were performing high-volume updates on the same hot pages (active checking accounts)
The initial GBP0 size (4 GB) was insufficient for the working set

Resolution: Doubled GBP0 to 8 GB. The hit ratio recovered to 78%. Further optimization of workload affinity (routing account transactions by account number range) reduced cross-member contention and pushed the hit ratio to 91%.

Challenge 2: False Contention in the Lock Structure

After adding CF1B, the false contention rate spiked to 12%. The lock table had been sized for a single-member workload.

Resolution: Increased the lock structure from 512 MB to 1.5 GB, providing a larger lock table with more entries. False contention dropped to 0.3%.

Challenge 3: Batch Job Performance on CF1C

The nightly interest calculation batch job, which previously ran in 3 hours on the standalone system, initially took 5.5 hours on CF1C. The cause was GBP-dependent writes — every page updated by the batch job was being written to the GBP because online members (CF1A, CF1B) had concurrent interest in the same pagesets.

Resolution: 1. Moved interest calculation tables to a dedicated tablespace in BP1 2. Scheduled the batch window to start 30 minutes after peak online traffic subsides 3. During the batch window, temporarily reduced online activity on the interest calculation tables through application logic 4. Batch job completion time returned to 3.5 hours — a modest 17% increase that was acceptable

Results

After 90 days in production:

Metric	Before Data Sharing	After Data Sharing	Change
Planned downtime (quarterly)	8 hours/year	0 hours/year	-100%
Peak CPU utilization	82% (one LPAR)	55% (highest member)	-27 percentage points
Available capacity	18% of one LPAR	~150% of original	+732%
Unplanned downtime (annual)	45 minutes (one incident)	0 minutes	-100%
Transaction throughput capacity	15,000 TPS	35,000+ TPS	+133%
CF service time (average)	N/A	12 microseconds	Within target
GBP0 hit ratio	N/A	91%	Exceeds target

Lessons Learned

Start with generous CF structure sizes. It is far easier to give back capacity than to react to performance degradation in production. Budget for 2x your initial estimates.
Workload affinity matters more than even distribution. The initial "spread everything evenly" approach caused excessive GBP activity. Routing related transactions to the same member dramatically reduced CF overhead.
Test failure scenarios before production. The team conducted three simulated member failures during the migration, each revealing operational gaps (alert routing, runbook completeness, team readiness) that were fixed before go-live.
Monitor continuously. Data sharing introduces metrics that do not exist in a single-subsystem environment. Invest in monitoring dashboards for CF service time, GBP hit ratios, XI rates, and false contention from day one.
Batch workload isolation is critical. Batch jobs that update millions of rows create enormous GBP pressure if they share pagesets with online members. Dedicate a member to batch and design tablespace assignments to minimize cross-member sharing during batch windows.

Discussion Questions

Why did the team choose three members instead of two? What are the trade-offs of adding a fourth member?
If Continental Federal needed to survive the loss of an entire data center, how would the data sharing design need to change? (Hint: consider GDPS.)
The batch job runs 17% slower in data sharing. Under what circumstances would this be unacceptable, and what alternatives exist?