Case Study 1: CNB's MQ Infrastructure — 47 Connected Systems and Inter-Bank Messaging
Background
Continental National Bank (CNB) is a Tier-1 institution processing 500 million transactions per day across four LPARs. In 2019, CNB completed its acquisition of Heritage Regional Bank, inheriting 1.2 million consumer accounts and 14 additional application systems that needed to integrate with CNB's core banking platform.
Before the acquisition, CNB already had 33 systems connected to its core through a mix of CICS MRO links, Connect:Direct file transfers, direct DB2 access, and synchronous CICS START commands. The integration map was, in Kwame Mensah's words, "held together by tribal knowledge and duct tape."
This case study examines how CNB rebuilt its integration architecture around IBM MQ, the technical decisions made, the problems encountered, and the production metrics three years later.
The Pre-MQ Architecture
The Spaghetti Diagram
CNB's 33 systems were connected by approximately 340 point-to-point integration paths:
- 87 CICS MRO links between the core banking TOR/AOR/FOR structure and satellite systems (card management, lending, customer information file)
- 62 Connect:Direct file transfers running on various batch schedules (hourly, daily, weekly)
- 48 direct DB2 access paths where one system's programs directly read/wrote another system's DB2 tables
- 34 CICS START commands for cross-system transaction initiation
- 109 miscellaneous connections including VSAM shared DASD paths, IMS message queues (legacy), and custom socket programs
Documented Failures
In the 18 months prior to the MQ project, CNB experienced:
- 23 outages caused by cascade failures — one system going down taking upstream or downstream systems with it
- 7 data consistency incidents where a file transfer partially completed, leaving systems out of sync
- 4 production incidents caused by Connect:Direct schedule conflicts during maintenance windows
- 1 Severity 1 incident where a direct DB2 access path created a lock contention cascade that froze the core banking system for 47 minutes during trading hours
The Severity 1 incident was the catalyst. The card management system had a rogue query that held locks on the core ACCOUNTS table for 8 seconds — an eternity in OLTP. Because the card system accessed the core's DB2 directly, there was no throttle, no circuit breaker, and no way to isolate the problem without killing the card system's DB2 threads, which required a coordinated response across two operations teams.
"That's when the CTO said, 'Fix this,'" Kwame recalls. "Not 'plan to fix this.' Fix it."
The MQ Migration
Phase 1: Architecture Design (8 weeks)
Kwame's team designed a hub-and-spoke MQ topology with four queue managers:
| Queue Manager | LPAR | Role |
|---|---|---|
| CNBPQM01 | LPAR A | Core banking hub, primary routing |
| CNBPQM02 | LPAR B | Payments and wire transfers |
| CNBPQM03 | LPAR C | Batch processing, file-based integration replacement |
| CNBPQM04 | LPAR D | External interfaces (FedNow, SWIFT, card networks, mobile) |
Key design decisions:
Decision 1: Shared queues for critical payment flows. Wire transfers, ACH settlements, and FedNow payments use shared queues in the Coupling Facility. These flows have zero message loss tolerance and require instant failover. Everything else uses clustered queues.
Decision 2: Canonical message format. Instead of each pair of systems agreeing on a format (N-squared problem), CNB defined a canonical message format — a single, versioned data representation. Systems translate to/from the canonical format at the edges. This meant writing transformation programs, but it eliminated format coupling.
The canonical format uses a fixed header:
01 CNB-MSG-HEADER.
05 CNB-MSG-FORMAT-ID PIC X(8).
88 CNB-FMT-ACCTEVT VALUE 'CNBACCT1'.
88 CNB-FMT-WIRETRN VALUE 'CNBWIRE1'.
88 CNB-FMT-PAYMENT VALUE 'CNBPYMT1'.
88 CNB-FMT-CUSTINF VALUE 'CNBCUST1'.
05 CNB-MSG-VERSION PIC 9(4).
05 CNB-MSG-TIMESTAMP PIC X(26).
05 CNB-MSG-SOURCE PIC X(8).
05 CNB-MSG-TYPE PIC X(4).
88 CNB-TYPE-CREATE VALUE 'CREA'.
88 CNB-TYPE-UPDATE VALUE 'UPDT'.
88 CNB-TYPE-DELETE VALUE 'DELE'.
88 CNB-TYPE-INQUIRY VALUE 'INQR'.
05 CNB-MSG-PRIORITY PIC 9(1).
05 CNB-MSG-BODY-LENGTH PIC 9(8) COMP.
05 CNB-MSG-CORRELATION PIC X(24).
05 FILLER PIC X(17).
Decision 3: No direct DB2 access across system boundaries. All 48 cross-system DB2 access paths would be replaced with MQ request/reply or event-driven patterns. This was the most controversial decision — some teams argued that MQ adds latency. Kwame's response: "Direct DB2 access adds lock contention, cascade failures, and 3 AM pages. Pick your cost."
Decision 4: Phased migration with parallel running. Each integration path was migrated individually, with a 30-day parallel run where both old and new paths operated simultaneously. A reconciliation program compared results daily.
Phase 2: Infrastructure Build (4 weeks)
The MQ infrastructure was deployed and hardened:
- 4 queue managers configured and tested
- 12 sender/receiver channel pairs defined
- Shared queue structures allocated in the Coupling Facility
- TLS configured on all channels
- RACF security profiles defined for 47 systems (each system gets its own RACF user ID with access only to its queues)
- Monitoring configured: queue depth alerts, channel status checks, DLQ watchers
- Dual logging configured on all queue managers
- Backup and recovery procedures tested (queue manager recovery, channel recovery, shared queue CF structure recovery)
Phase 3: Migration Waves (40 weeks)
The 340 integration paths were migrated in ten waves, prioritized by risk and dependency:
Wave 1 (4 weeks): Audit feed — 6 systems sending audit events to the compliance system. Pure fire-and-forget datagrams. Low risk, high learning value. The team learned MQ operations on a non-critical path.
Wave 2 (4 weeks): Customer notification — 8 systems publishing events to the notification engine. Pub/sub pattern. First use of topics and subscriptions.
Wave 3 (6 weeks): Fraud screening — request/reply between core banking and the fraud detection engine. First use of request/reply pattern with SLA requirements. This wave revealed a critical issue: the fraud engine occasionally took 15 seconds to respond, causing the 5-second timeout to fire and route transactions to manual review. The team had to work with the fraud team to optimize their engine before proceeding.
Wave 4 (5 weeks): Card management integration. Replaced the 48 direct DB2 access paths with MQ request/reply. This was the highest-risk wave — if the MQ path added too much latency, card authorizations would time out and customers would have declined transactions. Kwame's team ran a load test replicating 150% of peak volume for 72 hours before going live.
Waves 5–10 (21 weeks): Remaining systems — lending, general ledger, ATM network, online banking, batch feeds, regulatory reporting, and the Heritage Regional Bank systems.
Heritage Regional Bank Integration
The acquisition systems were migrated in Wave 9. Heritage ran its own MQ infrastructure (two queue managers) that had to be connected to CNB's network. The approach:
- Defined inter-organization channels between Heritage QMs and CNB QM04
- Used MQ clustering to make Heritage queues visible to CNB systems
- Implemented format transformation at the CNB gateway — Heritage used a different message format
- Ran a 60-day parallel window (double the standard) because cross-organization integration has more failure modes
The Heritage integration completed in 6 weeks — half the time estimated — because MQ's location transparency meant CNB programs didn't need to know they were talking to a Heritage system. They put messages to a queue name; MQ handled the routing.
Inter-Bank Messaging: The FedNow and SWIFT Interfaces
FedNow Real-Time Payments
CNB's FedNow interface runs on LPAR D (CNBPQM04). The architecture:
Core Banking (CNBPQM01)
↓ MQPUT to CNB.FEDNOW.OUTBOUND (shared queue)
CNBPQM04 picks up message
↓ Transform to ISO 20022 XML
↓ Send via FedNow API (MQ-to-HTTP bridge)
FedNow returns response
↓ MQ-to-HTTP bridge puts response to CNB.FEDNOW.INBOUND
CNBPQM04 routes back to CNBPQM01
↓ Core banking processes confirmation/rejection
Key design elements:
- Shared queues for the outbound path — FedNow has a 20-second end-to-end SLA, so the message path can't tolerate queue manager failover delays
- Request/reply pattern — the core banking program needs to know whether the payment was accepted
- Message expiry — requests expire after 15 seconds; if FedNow hasn't responded, the transaction is flagged for investigation
- Idempotency — every FedNow message carries a unique payment ID; the FedNow interface checks for duplicates before sending
SWIFT Wire Transfers
The SWIFT interface is more complex because it involves store-and-forward with acknowledgments:
Wire Transfer Program (CNBPQM01)
↓ MQPUT to CNB.SWIFT.OUTBOUND
CNBPQM02 (Payments LPAR) picks up
↓ Compliance screening (request/reply, 10-second SLA)
↓ OFAC/sanctions check (request/reply, 5-second SLA)
↓ Transform to MT103 format
↓ Put to SWIFT Alliance queue
SWIFT Alliance sends to SWIFT network
↓ Acknowledgment returns
↓ Put to CNB.SWIFT.ACK.INBOUND
CNBPQM02 routes ack to CNBPQM01
↓ Wire transfer program updates status
↓ Publish notification event (pub/sub)
The compliance screening step uses a chained request/reply: the message goes to the compliance queue, gets a response, then goes to the OFAC queue, gets a response. Both must pass before the message proceeds to SWIFT. If either screening rejects the transfer, the message is routed to a manual review queue and the wire transfer program receives a "held for review" response.
Production Metrics — Three Years Later
Volume
| Metric | Daily Value |
|---|---|
| Total messages processed | 180 million |
| Peak hour messages | 22 million |
| Peak-to-average ratio | 7.2x |
| Average message size | 2.4 KB |
| Largest message size | 94 KB (batch statement) |
Reliability
| Metric | Value |
|---|---|
| Message loss events (3 years) | 0 |
| DLQ messages per day (average) | 47 |
| DLQ resolution time (average) | 22 minutes |
| Poison messages per day (average) | 3 |
| Unplanned queue manager outages (3 years) | 2 |
| Planned maintenance windows missed | 0 |
Performance
| Flow | Average Latency | 99th Percentile |
|---|---|---|
| Fire-and-forget (local) | 0.3 ms | 1.2 ms |
| Fire-and-forget (cross-LPAR) | 1.8 ms | 4.5 ms |
| Request/reply (local) | 2.1 ms | 8.3 ms |
| Request/reply (cross-LPAR) | 5.7 ms | 18.2 ms |
| Fraud screening (end-to-end) | 1.4 sec | 3.8 sec |
| FedNow (end-to-end) | 2.3 sec | 6.1 sec |
Operational Impact
| Metric | Before MQ | After MQ |
|---|---|---|
| Integration-related outages/year | 23 | 2 |
| Time to add new system | 4–6 months | 2–4 weeks |
| Maintenance window coordination | 2,300-row spreadsheet | MQ handles it |
| 3 AM pages (integration-related) | 8/month | 0.5/month |
| Cross-system data consistency incidents | 7 in 18 months | 0 in 36 months |
Lessons Learned
Lesson 1: Canonical format is worth the upfront cost
The transformation programs took 30% of the development effort but saved 70% of the ongoing maintenance. Every new system integration now takes weeks instead of months because the format is defined once.
Lesson 2: Size your DLQ monitoring for day one
CNB initially set the DLQ alert threshold at 1,000 messages. In the first week of Wave 1, 6,000 messages hit the DLQ because of a configuration error in the audit system's queue name. The operations team didn't notice for 3 hours. The threshold is now 50.
Lesson 3: Parallel running is non-negotiable
Every wave included a 30-day parallel run. In Waves 3, 4, and 7, the parallel run caught discrepancies that would have been production incidents. In Wave 4 (card management), the parallel run revealed that 0.2% of card authorizations were being processed with stale exchange rates because the MQ path had slightly different timing than the direct DB2 path. The fix took 3 days — in production, it would have been a regulatory issue.
Lesson 4: MQ operations is a skill
CNB initially assigned MQ operations to the "middleware team" — generalists who also managed CICS and DB2 utilities. After two incidents caused by operators unfamiliar with MQ channel recovery procedures, Kwame advocated for a dedicated MQ team. Five people now manage the MQ infrastructure full-time. Their expertise has prevented at least a dozen potential incidents (Kwame estimates) through proactive monitoring and tuning.
Lesson 5: The fraud screening SLA drove more design decisions than anything else
The 5-second SLA for fraud screening influenced: shared queue vs. clustered queue choice, message priority assignments, CICS thread allocation, and even the physical placement of queue managers on LPARs. "One SLA requirement cascaded through the entire architecture," Kwame says. "Know your SLAs before you start designing."
Discussion Questions
-
CNB chose a hub-and-spoke topology with four queue managers. Under what circumstances would a full-mesh cluster (every queue manager connects to every other) be more appropriate? What are the tradeoffs?
-
The canonical message format requires transformation at the edges. What happens when a transformation program has a bug that corrupts data? How would you design the error handling for the transformation layer?
-
CNB's DLQ averages 47 messages per day. Is this a problem? What would you investigate? What's the acceptable threshold for a system processing 180 million messages daily?
-
The FedNow interface uses shared queues for zero failover delay. What happens if the Coupling Facility structure itself fails? Design a fallback strategy.
-
Wave 3 revealed that the fraud engine sometimes took 15 seconds to respond (against a 5-second SLA). How would you design the request/reply flow to handle this gracefully without routing excessive transactions to manual review?
-
CNB has zero data consistency incidents since deploying MQ. Why does MQ's transactional messaging eliminate this class of problem? Could a data consistency incident still occur with MQ? Under what circumstances?