Case Study 1: Continental National Bank's Parallel Sysplex Migration

Background

In 2016, Continental National Bank operated its entire production mainframe environment on a single LPAR — CNBPROD1 — running on an IBM z13 model N30. The system processed roughly 280 million transactions per day: online banking through CICS, batch processing for end-of-day settlement, DB2 for all relational data, and MQ for integration with downstream systems.

The setup worked. It had worked for years. But Kwame Mensah — then a senior systems programmer, not yet the chief architect — could see the cracks forming.

"We were running at 78% peak CPU utilization on our busiest days," Kwame recalls. "Our DB2 buffer pool hit ratio was dropping because we couldn't give DB2 enough real storage — CICS and batch were competing for the same memory. And every time we needed to apply z/OS maintenance, we had a full production outage. Eight hours minimum. We'd schedule it for Saturdays, and the business would lose an entire day of weekend processing."

The breaking point came during a regulatory examination. The OCC (Office of the Comptroller of the Currency) told CNB's CTO that a single-LPAR configuration for a Tier-1 bank was "inconsistent with safety and soundness expectations." The examiner's exact words, according to Kwame: "What happens when that one box goes down?"

CNB's CTO authorized a Parallel Sysplex migration project. Kwame was named technical lead.

The Requirements

Kwame's team defined four non-negotiable requirements:

  1. Continuous availability for online banking. CICS transactions must continue during any single point of failure — LPAR failure, DB2 member failure, coupling facility failure, or planned maintenance.

  2. Batch window integrity. The overnight batch window (11:00 PM to 5:00 AM) must complete even if an LPAR is unavailable. No batch job should require rerun due to infrastructure failure.

  3. Zero-data-loss failover. No committed transaction can be lost during any failure scenario. This is a regulatory requirement for a Tier-1 bank.

  4. Rolling maintenance capability. z/OS, DB2, CICS, and MQ maintenance must be applied without a full production outage. One LPAR at a time should be taken down for maintenance while the others continue serving the full workload.

Additionally, the architecture had to support CNB's projected growth to 500 million transactions per day by 2020 — a near-doubling of volume driven by mobile banking growth and a planned acquisition.

Architecture Decisions

Decision 1: Four LPARs on Two Physical Frames

Kwame's team evaluated configurations from two LPARs on one frame to eight LPARs on four frames. The decision: four LPARs (CNBPROD1 through CNBPROD4) distributed across two IBM z14 frames (the hardware was upgraded as part of the migration).

Rationale: - Two physical frames provide hardware-level redundancy. A total frame failure (extremely rare but possible) leaves two LPARs running on the surviving frame. - Four LPARs provide capacity headroom. With all four active, each LPAR runs at approximately 45% CPU utilization at peak. If one LPAR fails, the remaining three absorb the workload at approximately 60% utilization — within the safe operating range. - Four DB2 members in a data sharing group provide optimal parallelism for the batch workload, with each member handling a portion of the parallel batch streams.

"I fought for four instead of two," Kwame says. "Two LPARs means losing one puts you at 100% of single-LPAR capacity — you're right back where you started, just on the surviving image. Four means losing one still leaves you with 75% of your total capacity. That's the difference between 'we survived' and 'we survived comfortably.'"

Decision 2: DB2 Data Sharing Group with Four Members

All four LPARs run a DB2 member in a single data sharing group, DB2A. Every member can access every database.

Key design choices: - Group buffer pool sizing: Kwame worked with Lisa Tran to size the group buffer pools (GBPs) based on the expected cross-invalidation rate. For heavily updated tables (like the ACCOUNTS table), a larger GBP was allocated to reduce castout frequency. Initial sizing: 8 GB total GBP space in the coupling facility, distributed across three GBP structures (GBP0 for the default, GBP1 for high-update tablespaces, GBP32K for LOB data). - Lock structure sizing: Based on analysis of the single-LPAR lock manager's hash table utilization, the team sized the coupling facility lock structure at 512 MB — enough to hold lock entries for peak concurrency without false contention from hash collisions. - Data sharing protocol tuning: Lisa tuned the DB2 DSNZPARM parameters for data sharing: IRLMRWT (inter-system lock wait time), GRPBPFR (group buffer pool castout frequency), and MAXDBAT (maximum DB2 threads per member) were all adjusted based on workload modeling.

Decision 3: CICS Topology — TORs, AORs, and CICSPlex SM

The team designed a CICS topology with three types of regions:

  • TOR (Terminal Owning Region): Two TORs — one on CNBPROD1, one on CNBPROD2 — to receive inbound traffic. If one TOR fails, the VTAM generic resource definition routes all traffic to the surviving TOR.
  • AOR (Application Owning Region): Four AORs — one on each LPAR — to run application programs. CICSPlex SM (CPSM) distributes transactions across the AORs based on current workload.
  • FOR (File Owning Region): Two FORs — CNBPROD1 and CNBPROD3 — for VSAM file ownership using CICS VSAM RLS (Record Level Sharing), which uses the coupling facility for lock management.

"The TOR/AOR split was non-negotiable," Kwame says. "You never want terminal handling and application logic in the same region. A runaway application program in an AOR crashes that AOR — but the TOR is fine, and CPSM routes the next transaction to a different AOR. The user sees a brief delay, not an outage."

Decision 4: Coupling Facility Configuration

Two coupling facility LPARs per physical frame — CF01 and CF02 on Frame 1, CF03 and CF04 on Frame 2 — for a total of four CF LPARs.

Structure placement policy (CFRM): - Every critical structure (DB2 lock, SCA, GBPs, GRS star, XCF signaling) has a primary allocation on one CF and an alternate allocation on a CF on the other physical frame. - If CF01 fails, structures automatically rebuild on CF03 or CF04 (the alternate CFs on the surviving frame). - If an entire physical frame fails, two CF LPARs remain on the surviving frame to serve the surviving z/OS LPARs.

"The coupling facility is the single most critical piece of a Parallel Sysplex," Kwame emphasizes. "If you lose all your CFs, you lose data sharing, you lose GRS star, you lose CICS RLS. You're dead. That's why we have four, on two frames, with cross-frame alternates. Any single failure — even a whole frame — leaves a viable CF configuration."

Decision 5: MQ Queue-Sharing Group

Four MQ queue managers — one on each LPAR — joined in a queue-sharing group using the coupling facility. Shared queues allow any queue manager to get or put messages on any queue. If a queue manager fails, the remaining three continue processing the shared queues.

Decision 6: JES2 Multi-Access Spool

All four LPARs share a single JES2 spool (MAS configuration). A batch job submitted on any LPAR can run on any other LPAR's initiators. This provides maximum flexibility for batch scheduling and makes the batch window resilient to individual LPAR failures.

Implementation Challenges

Challenge 1: The Batch Serialization Problem

CNB's existing batch jobs used DISP=OLD on numerous datasets — a pattern that works fine on a single LPAR but creates serialization bottlenecks in a Sysplex. When Job A on CNBPROD1 holds DISP=OLD on a dataset, Job B on CNBPROD3 waiting for the same dataset is blocked — and GRS ensures the block is enforced across the Sysplex.

Solution: Rob Calloway spent three months analyzing the 4,200 batch jobs to identify unnecessary DISP=OLD specifications. "Half of them were DISP=OLD out of habit, not necessity," he says. "If a job only reads a file, it should be DISP=SHR. I found 600 jobs with DISP=OLD on reference files that were never updated." The team changed DISP=OLD to DISP=SHR wherever possible and restructured the batch schedule to minimize contention for datasets that genuinely required exclusive access.

Challenge 2: DB2 Plan and Package Binding

In a data sharing environment, DB2 plans and packages must be bound on one member and are then accessible from all members. But the existing BIND jobs were not designed for data sharing — they issued BIND REPLACE commands that could disrupt active threads on other members.

Solution: Lisa Tran redesigned the BIND strategy to use off-hours BIND windows with the ACQUIRE(USE) and RELEASE(COMMIT) options, minimizing the window during which active threads could be invalidated. She also implemented a rolling BIND process — binding on one member at a time, verifying access paths, then proceeding to the next.

Challenge 3: Application Code Changes

Most COBOL programs required no changes — they were unaware of the Sysplex. But several programs had assumptions that broke in a multi-LPAR environment:

  • Sequence number generators that used a counter in a VSAM file. With multiple LPARs, two programs could read the same counter value simultaneously. Solution: Replace with DB2 sequences or CICS named counter server (which uses the coupling facility).
  • Temporary dataset naming that used system symbols. Dataset names that were unique on one LPAR collided when multiple LPARs used the same naming convention. Solution: Include the LPAR name (&SYSNAME) in temporary dataset naming conventions.
  • CICS programs that cached data in shared storage. The cache was local to one CICS region — programs on other regions wouldn't see updates. Solution: Replace shared storage caches with coupling facility data tables or DB2 tables.

Challenge 4: The Cutover Weekend

The migration from single-LPAR to four-LPAR Parallel Sysplex was executed over a three-day weekend in November 2017. Kwame's team had rehearsed the cutover four times in their pre-production environment.

"Day one: IPL the Sysplex, start DB2 data sharing, verify the coupling facility structures," Kwame describes the plan. "Day two: start CICS regions on all four LPARs, run the full regression test suite, run a simulated batch window. Day three: go live with controlled transaction volume, ramp up to full production by Sunday evening."

The plan survived first contact with reality — mostly. The coupling facility structure sizing for the DB2 lock structure was 20% too small, causing false contention during the simulated batch window on Day 2. Lisa Tran resized the structure online (a coupling facility capability that proved its worth immediately) from 512 MB to 768 MB. The issue was resolved in 15 minutes without restarting any subsystem.

Results

Six months after migration (May 2018):

Metric Before (Single LPAR) After (4-LPAR Sysplex)
Planned downtime for z/OS maintenance 8 hours/month 0 hours (rolling)
Unplanned outages (application impact) 3 per year 0 (first 6 months)
Peak CPU utilization (per LPAR) 78% 47%
Batch window completion time 5.5 hours (tight) 4.2 hours (comfortable)
Online transaction response time (avg) 180ms 155ms
Maximum concurrent online users 1,800 4,500
DB2 lock wait time (avg) 2.3ms 1.8ms (local) / 3.1ms (global)

"The response time improvement surprised us," Kwame admits. "We expected the coupling facility overhead for DB2 data sharing to add latency. But the reduced CPU contention — four LPARs sharing the load instead of one overloaded image — more than compensated. And the batch window got faster because we could run more parallel streams across four sets of initiators."

The regulatory examiner returned in 2018 and noted: "Significant improvement in infrastructure resilience. The Parallel Sysplex configuration is consistent with industry best practices for a bank of this size and complexity."

Long-Term Impact

By 2026, CNB's Parallel Sysplex processes 500 million transactions per day — the growth target the architecture was designed for. The system has survived two LPAR failures (one hardware, one z/OS crash), a coupling facility structure failure, and numerous planned maintenance windows — all without customer-visible impact.

"The migration cost $4.2 million — hardware, software, and labor," Kwame reflects. "The avoided cost of the outages we would have had on the single-LPAR system is impossible to calculate precisely, but our risk management team estimated $15 million in potential regulatory fines and customer impact for a Tier-1 bank losing service for eight hours. The Sysplex paid for itself the first time we didn't have an outage."

Marcus Whitfield at FBA, watching CNB's success from across the industry, calls the Sysplex "the architecture of sleep." "When Kwame goes home at night," Marcus says, "he sleeps. Before the Sysplex, I doubt he did."


Discussion Questions

  1. Architecture Sizing: Kwame chose four LPARs when two would have met the minimum redundancy requirement. What are the costs (financial, operational complexity, coupling facility overhead) of four LPARs versus two? Under what circumstances would two LPARs have been sufficient?

  2. Coupling Facility Placement: The team placed coupling facilities on both physical frames with cross-frame alternates. What would happen if all four CF LPARs were on the same physical frame? Would the configuration still be considered highly available?

  3. Batch Serialization: Rob Calloway found 600 batch jobs with unnecessary DISP=OLD specifications. What organizational process or governance mechanism could prevent this problem from recurring? How do you ensure developers use DISP=SHR when appropriate?

  4. Application Code Changes: The sequence number generator problem (using VSAM counters) is a classic single-image assumption that breaks in a Sysplex. Identify three other common coding patterns in COBOL programs that might break in a multi-LPAR environment.

  5. Cost-Benefit Analysis: Kwame cites a $4.2 million cost and $15 million avoided risk. How would you structure a cost-benefit analysis for a Sysplex migration at your organization? What costs and benefits would you include? How would you quantify the risk reduction?

  6. Knowledge Dependency: Notice that the cutover required expertise from Kwame (Sysplex architecture), Lisa (DB2 data sharing), and Rob (batch scheduling) simultaneously. What happens when organizations attempt this migration without this depth of expertise? What role do IBM Lab Services and consultants play?

  7. Growth Planning: CNB designed for 500M transactions/day in 2020 and achieved it by 2026. If volume continues growing (mobile banking, real-time payments), what is the next architectural evolution? A fifth LPAR? A second Sysplex? A GDPS active-active configuration?