Case Study 1: CNB's CICS Region Topology
16 Regions, 4 LPARs, 500 Million Transactions Per Day
Background
Continental National Bank (CNB) is a Tier-1 financial institution processing 500 million online transactions per day across four channels: 3270 branch terminals, ATM network, online banking web portal, and mobile banking API. Their CICS environment runs on CICS TS 5.6 across a 4-LPAR Parallel Sysplex with DB2 data sharing.
This case study examines the evolution of CNB's CICS topology from a 10-region, 2-LPAR configuration to the current 16-region, 4-LPAR architecture — the decisions that drove the redesign, the trade-offs Kwame Mensah's team navigated, and the operational lessons from two years of production experience.
The 2018 Incident: The Catalyst
On March 14, 2018, at 14:22 EST, CNB's online banking web portal experienced a surge of traffic following a promotional email to 2 million customers. The web transactions routed to CNBAORA2, one of two AORs that also served ATM transactions on the same LPAR.
The web surge caused CNBAORA2's task count to hit MAXTASK (150 at the time). New web transactions queued. But so did ATM transactions — because both workloads shared the same AOR. Within 8 minutes, ATM response times exceeded 30 seconds. CNB's ATM network monitoring triggered alerts. Branch managers called the help desk. The CICS operations team emergency-drained web transactions from CNBAORA2, restoring ATM service after 23 minutes of degradation.
No data was lost. No transactions were corrupted. But 23 minutes of ATM degradation for a Tier-1 bank is a material operational event. The post-incident review identified the root cause: lack of channel isolation. Web and ATM traffic shared the same AOR pool, creating a shared failure domain.
Kwame Mensah presented three options to the architecture review board:
- Increase MAXTASK and add AOR capacity on existing LPARs. Cost: low. Risk reduction: minimal — the shared failure domain remains.
- Separate web AORs from ATM/3270 AORs on existing LPARs. Cost: moderate. Risk reduction: moderate — workload isolation achieved, but hardware failure still takes both channels.
- Full channel isolation across 4 LPARs with dedicated TOR/AOR groups per channel. Cost: high. Risk reduction: maximum — each channel is an independent failure domain.
The board chose option 3. The migration took 14 months.
The Target Architecture
Topology Diagram
╔══════════════════════════════════════════════════════════════════════╗
║ CNB CICS TOPOLOGY — PRODUCTION ║
╠══════════════════════════════════════════════════════════════════════╣
║ ║
║ SYSA (3270/ATM Channel) SYSB (3270/ATM Channel) ║
║ ┌────────────────────┐ ┌────────────────────┐ ║
║ │ CNBTORA1 (TOR) │ │ CNBTORB1 (TOR) │ ║
║ │ 3270: 2,500 terms │ │ 3270: 2,500 terms │ ║
║ │ ATM: 1,000 units │◄──IPIC──▶│ ATM: 1,000 units │ ║
║ ├────────────────────┤ ├────────────────────┤ ║
║ │ CNBAORA1 (AOR) │ │ CNBAORB1 (AOR) │ ║
║ │ Core banking │◄──IPIC──▶│ Core banking │ ║
║ │ MAXTASK=250 │ │ MAXTASK=250 │ ║
║ ├────────────────────┤ ├────────────────────┤ ║
║ │ CNBAORA2 (AOR) │ │ CNBAORB2 (AOR) │ ║
║ │ Core banking │◄──IPIC──▶│ Core banking │ ║
║ │ MAXTASK=250 │ │ MAXTASK=250 │ ║
║ ├────────────────────┤ ├────────────────────┤ ║
║ │ CNBFORA1 (FOR) │ │ CNBFORB1 (FOR) │ ║
║ │ Customer master │ │ Customer master │ ║
║ │ Account index │ │ Account index │ ║
║ └────────────────────┘ └────────────────────┘ ║
║ │ MRO │ MRO ║
║ ▼ (intra-LPAR) ▼ (intra-LPAR) ║
║ ║
║ SYSC (Web Channel) SYSD (Mobile API Channel) ║
║ ┌────────────────────┐ ┌────────────────────┐ ║
║ │ CNBTORC1 (TOR) │ │ CNBTORD1 (TOR) │ ║
║ │ CICS Web Services │ │ Liberty JVM │ ║
║ │ Online banking │◄──IPIC──▶│ Mobile REST API │ ║
║ ├────────────────────┤ ├────────────────────┤ ║
║ │ CNBAORC1 (AOR) │ │ CNBAORD1 (AOR) │ ║
║ │ Web transactions │ │ API transactions │ ║
║ │ MAXTASK=300 │ │ MAXTASK=350 │ ║
║ ├────────────────────┤ ├────────────────────┤ ║
║ │ CNBAORC2 (AOR) │ │ CNBAORD2 (AOR) │ ║
║ │ Web transactions │ │ API transactions │ ║
║ │ MAXTASK=300 │ │ MAXTASK=350 │ ║
║ ├────────────────────┤ ├────────────────────┤ ║
║ │ CNBFORC1 (FOR) │ │ CNBFORD1 (FOR) │ ║
║ │ Audit journals │ │ Session/token │ ║
║ ├────────────────────┤ └────────────────────┘ ║
║ │ CNBSMGR (CMAS) │ ║
║ │ Primary CICSPlex │ CNBSMG2 (CMAS) on SYSD ║
║ │ SM manager │◄──IPIC──▶ Secondary/standby ║
║ └────────────────────┘ ║
║ ║
║ DB2 Data Sharing Group: CNBDB2G ║
║ ┌──────────────────────────────────────────┐ ║
║ │ DB2A (SYSA) ◄──CF──▶ DB2B (SYSB) │ ║
║ │ DB2C (SYSC) ◄──CF──▶ DB2D (SYSD) │ ║
║ │ Coupling Facility: Lock structure + GBP │ ║
║ └──────────────────────────────────────────┘ ║
╚══════════════════════════════════════════════════════════════════════╝
Key Design Decisions
Decision 1: Four LPARs, not two. The 2-LPAR design (SYSA for traditional, SYSB for web/API) was considered and rejected. Kwame's reasoning: "Two LPARs gives you hardware isolation between traditional and digital channels. But web and mobile API have very different traffic patterns, security requirements, and scaling trajectories. Mobile is growing 50% per year; web is growing 10%. In three years, mobile will need its own scaling headroom. Better to separate now than to perform another migration later."
Decision 2: Active/active AOR pairs per channel. Each channel has two AORs. Under normal conditions, CPSM distributes workload roughly 50/50. During failover, the surviving AOR absorbs 100%. This requires each AOR to be sized for full channel capacity — not half. The AORs run at roughly 50% of their MAXTASK capacity under normal conditions.
Decision 3: DB2 data sharing eliminates most function shipping. The highest-volume data — customer accounts, balances, transaction history — is in DB2. Every AOR on every LPAR accesses DB2 through the data sharing group. No function shipping needed for DB2 data. The FORs handle only VSAM files that haven't been migrated to DB2 (customer master index for legacy compatibility, audit journals, session tokens).
Decision 4: Separate MAXTASK per channel. Core banking AORs (SYSA, SYSB): MAXTASK=250. These serve 3270 and ATM transactions with consistent, predictable volume. Web AORs (SYSC): MAXTASK=300. Web transactions are slightly more bursty. API AORs (SYSD): MAXTASK=350. Mobile API transactions are highly bursty with lower per-transaction resource consumption.
Decision 5: CMAS placement on SYSC and SYSD. The CMASs are on the channels most likely to need dynamic management (web and mobile). Placing them on SYSA/SYSB would mean that a traditional-channel LPAR failure could impact CICSPlex SM management for all channels. On SYSC/SYSD, a traditional-channel LPAR failure has zero impact on CPSM management.
Migration Strategy
The migration from 10 regions on 2 LPARs to 16 regions on 4 LPARs was executed in four phases over 14 months:
Phase 1: CICSPlex SM Deployment (Months 1–3)
Before adding regions, CNB needed enterprise management. They deployed CICSPlex SM on the existing 2-LPAR topology:
- Installed CMAS on each LPAR
- Defined existing regions as managed systems
- Migrated from custom routing programs to CPSM workload management
- Validated CPSM monitoring and alerting against existing operational thresholds
This phase carried the lowest risk — no topology changes, no application changes. It established the management foundation for subsequent phases.
Phase 2: Web Channel Isolation (Months 4–7)
SYSC was activated with: - CNBTORC1 (web TOR with CICS Web Services) - CNBAORC1, CNBAORC2 (web AOR pair) - CNBFORC1 (audit FOR)
Web traffic was migrated from the shared AORs to the dedicated web AORs using CPSM workload definitions. The cutover was gradual — CPSM routed 10% of web traffic to the new AORs, then 25%, 50%, 75%, 100% over four weeks. At each stage, response times, error rates, and resource utilization were compared between old and new paths.
Phase 3: Mobile API Channel (Months 8–12)
SYSD was activated with: - CNBTORD1 (API TOR with Liberty JVM server) - CNBAORD1, CNBAORD2 (API AOR pair) - CNBFORD1 (session/token FOR)
Mobile banking was a new channel, so there was no migration — just activation. However, the AOR programs were the same COBOL programs used by web and 3270 (invoked through DPL from the Liberty JVM server). Testing focused on ensuring the COBOL programs behaved correctly when invoked through the API path rather than the traditional BMS path.
Phase 4: Traditional Channel Rebalancing (Months 12–14)
With web and mobile traffic removed from SYSA/SYSB, the traditional channel had excess capacity. The team: - Reduced MAXTASK on core AORs from 300 to 250 - Removed web-specific CSD definitions from SYSA/SYSB AORs - Consolidated VSAM files — some files previously needed on both LPARs were now only needed on specific channels - Performed final DR testing with the complete 16-region topology
Operational Lessons — Two Years in Production
Lesson 1: IPIC Session Tuning Is Ongoing
Initial IPIC session counts (SENDCOUNT/RECEIVECOUNT) were set based on pre-migration estimates. Within the first quarter, the team found that cross-LPAR IPIC sessions between core banking AORs were hitting 85% utilization during end-of-month processing (batch-initiated CICS transactions for statement generation). Sessions were increased from 25 to 50 per IPIC connection.
Kwame's rule: "Review IPIC session utilization monthly for the first year, quarterly after that. Workload patterns shift. What was sized correctly in January may be undersized by June."
Lesson 2: CPSM Workload Definitions Need Care
When the mobile API launched, the CPSM workload definition routed all API transactions to the CNBAPI AOR group. Six months later, a new API endpoint for real-time balance notifications was added. These were lightweight, high-frequency transactions that didn't need the same AOR resources as funds transfers. The team created a second workload definition — CNBAPI-LITE — for notification transactions, allowing CPSM to route them more aggressively (higher MAXTASK utilization threshold).
Lesson: Don't define one workload per channel. Define workloads based on resource profiles. Lightweight and heavyweight transactions should be managed separately even if they arrive through the same channel.
Lesson 3: FOR Placement Matters More Than Expected
CNBFORC1 (audit FOR on SYSC) handles audit journal writes from web AORs. It also receives function-shipped audit writes from API AORs on SYSD — via IPIC, since the regions are on different LPARs.
During a compliance audit, the team discovered that audit write latency from SYSD AORs was 3x higher than from SYSC AORs (IPIC vs. MRO). For most purposes this was irrelevant, but audit record sequence ordering was affected — records from SYSD sometimes appeared to arrive out of order relative to SYSC records when viewed chronologically.
The fix: a second audit FOR on SYSD (the team was already considering this for capacity reasons), with a batch reconciliation process that merges and re-sequences audit records from both FORs nightly. The alternative — moving audit files to DB2 — was considered but deferred because the audit archive format was mandated by regulators and change would require regulatory approval.
Lesson 4: DR Testing Revealed a CMAS Dependency
During the first DR test with the 16-region topology, the team discovered that CMAS failover took 45 seconds — during which no new workload definitions could be deployed and CPSM health monitoring was suspended. Transaction routing continued (MAS agents used cached data), but the 45-second gap meant that if an AOR failed during CMAS failover, CPSM would not detect it for up to 45 seconds, and the failing AOR's transactions would get timeout errors.
The fix: reduced CMAS heartbeat interval from 30 seconds to 10 seconds, and configured MAS agents to perform local health checks independently of CMAS monitoring. If a MAS agent detects that its own region is unhealthy (DSA > 90%, MAXTASK > 95%), it proactively removes itself from the routing pool without waiting for CMAS to decide.
Lesson 5: Naming Convention Saves Hours in Incidents
During a Saturday evening incident, a junior operator was asked to restart "the AOR on SYSA." The naming convention (CNBAORA1, CNBAORA2) made it unambiguous — the operator confirmed which AOR was failing (CNBAORA2) and restarted the correct region on the first attempt.
Kwame contrasts this with a previous employer where CICS regions were named CICSPRD1 through CICSPRD12 with no encoding of type or location. "At 2 AM, nobody remembers that CICSPRD7 is the FOR on LPAR 3. But CNBFORA1 tells you everything: it's CNB's, it's a FOR, it's on SYSA, it's instance 1. That naming convention has paid for itself a hundred times over."
Performance Metrics — Post-Migration
| Metric | Before (10 regions) | After (16 regions) | Change |
|---|---|---|---|
| Core banking 95th percentile response time | 185ms | 142ms | -23% |
| Web portal 95th percentile response time | 340ms | 210ms | -38% |
| ATM degradation incidents per quarter | 3.2 | 0.1 | -97% |
| Cross-channel impact incidents | 2.1/quarter | 0/quarter (2 years) | -100% |
| CICS region restarts (unplanned) | 1.8/month | 1.2/month | -33% |
| Mean time to recover from AOR failure | 4.2 minutes | 0.8 minutes* | -81% |
*CPSM detects AOR failure and reroutes within seconds. The 0.8 minutes includes the AOR auto-restart and re-integration into the routing pool.
Discussion Questions
-
CNB chose full channel isolation (4 LPARs) over partial isolation (2 LPARs with separate AOR groups). Under what circumstances would partial isolation be the better choice?
-
The audit FOR placement issue (Lesson 3) was resolved with a second FOR and batch reconciliation rather than migrating to DB2. Evaluate this decision: what are the risks of the chosen approach vs. the DB2 migration approach?
-
The CMAS failover gap (Lesson 4) was mitigated with shorter heartbeat intervals and local MAS health checks. Are there scenarios where even these mitigations are insufficient? What would be the next level of hardening?
-
CNB's mobile API volume is growing 50% per year. In 3 years, SYSD will need 4 AORs. Should they add AORs to SYSD, add a fifth LPAR, or redesign the topology? What factors drive this decision?
-
Consider a scenario where CNB acquires a smaller bank with its own CICS environment. How would you integrate the acquired bank's CICS topology into CNB's existing CICSplex? What are the risks?