Case Study 2: SecureFirst's MQ-to-Cloud Bridge for Mobile Notifications
Background
SecureFirst Retail Bank launched its mobile banking app in 2022. The app was built on AWS — API Gateway, Lambda functions, DynamoDB, SNS for push notifications. The backend of record remained on the mainframe: CICS transactions, DB2 databases, VSAM files, and 30 years of COBOL business logic.
The challenge was immediate: customers expected real-time notifications. When a transaction posted, the mobile app should buzz within seconds. When a wire transfer cleared, the customer should know. When a suspicious transaction triggered a fraud alert, the customer's phone should light up.
The problem: the mainframe's COBOL programs knew about transactions the instant they happened. The mobile app lived in AWS. Between them lay a chasm of protocols, data formats, security boundaries, and organizational silos.
Carlos Vega, SecureFirst's API architect, proposed the solution: IBM MQ as the bridge. "MQ already speaks mainframe," Carlos said. "And IBM MQ has native connectors for cloud platforms. We don't need to build the bridge — we need to configure it."
Yuki Nakamura, the DevOps lead, was skeptical. "You want to put a 30-year-old messaging product in the critical path of a mobile experience? Our users expect sub-second response times."
"MQ is 30 years old the way a steel bridge is 100 years old," Carlos replied. "It's been carrying traffic the entire time."
The Architecture
The End-to-End Flow
z/OS Mainframe Network AWS Cloud
+------------------+ | +-------------------+
| COBOL Program | | | MQ Client |
| (CICS/Batch) | | | (EC2/ECS) |
| ↓ | | | ↓ |
| MQPUT to | | | Consumes from |
| NOTIFY.OUTBOUND | | | cloud-side queue |
| ↓ | | | ↓ |
| CNBPQM01 | TLS/1414 | | Transform to JSON |
| (z/OS QM) |→→→→→→→→→→→→→|→→→→→→→→→→→→→| ↓ |
| | Sender Ch | Receiver | Route to SNS |
+------------------+ | | (push notif.) |
| | ↓ |
| | Customer's phone |
| +-------------------+
Components
On z/OS:
- Queue manager: SFPQM01 (SecureFirst Production Queue Manager 01)
- Outbound notification queue: SF.NOTIFY.OUTBOUND (local queue, persistent, triggered)
- Sender channel: SFPQM01.TO.SFCLOUD — TLS-encrypted, connected to the cloud-side MQ instance
- Transmission queue: SFPQM01.TO.SFCLOUD.XMIT
In AWS:
- IBM MQ on Amazon ECS (containerized queue manager): SFCLOUD01
- Inbound queue: SF.NOTIFY.INBOUND (receives messages from z/OS)
- MQ Consumer: Java/Spring Boot application running on ECS, consuming from SF.NOTIFY.INBOUND
- Amazon SNS: Distributes push notifications to iOS (APNs) and Android (FCM)
- Amazon CloudWatch: Monitoring and alerting
On the z/OS side, the COBOL notification program:
IDENTIFICATION DIVISION.
PROGRAM-ID. SFNOTIFY.
*-------------------------------------------------------*
* SecureFirst Notification Publisher *
* Publishes transaction events for mobile notifications *
*-------------------------------------------------------*
DATA DIVISION.
WORKING-STORAGE SECTION.
COPY CMQV.
COPY CMQODV.
COPY CMQMDV.
COPY CMQPMOV.
01 WS-NOTIFY-MSG.
05 WS-NTF-HEADER.
10 WS-NTF-FORMAT PIC X(8) VALUE 'SFNTFY01'.
10 WS-NTF-VERSION PIC 9(4) VALUE 0002.
10 WS-NTF-TIMESTAMP PIC X(26).
10 WS-NTF-CUST-ID PIC X(12).
10 WS-NTF-EVENT-TYPE PIC X(4).
88 NTF-TXN-POST VALUE 'TXNP'.
88 NTF-WIRE-CLEAR VALUE 'WIRC'.
88 NTF-FRAUD-ALERT VALUE 'FRDA'.
88 NTF-ACCT-CHANGE VALUE 'ACCH'.
88 NTF-BILL-DUE VALUE 'BILD'.
10 WS-NTF-PRIORITY PIC 9(1).
05 WS-NTF-BODY.
10 WS-NTF-ACCT-NUM PIC X(16).
10 WS-NTF-AMOUNT PIC S9(13)V99 COMP-3.
10 WS-NTF-CURRENCY PIC X(3).
10 WS-NTF-DESC PIC X(80).
10 WS-NTF-MERCHANT PIC X(40).
10 WS-NTF-LOCATION PIC X(40).
10 WS-NTF-REF-NUM PIC X(20).
01 WS-MQ-FIELDS.
05 WS-HCONN PIC S9(9) COMP.
05 WS-HOBJ PIC S9(9) COMP.
05 WS-COMPCODE PIC S9(9) COMP.
05 WS-REASON PIC S9(9) COMP.
05 WS-OPTIONS PIC S9(9) COMP.
05 WS-MSG-LENGTH PIC S9(9) COMP.
The program is called by any CICS transaction that needs to send a notification. It receives the notification data in the COMMAREA, builds the MQ message, and puts it to the outbound queue:
PROCEDURE DIVISION.
0000-MAIN.
PERFORM 1000-OPEN-QUEUE
PERFORM 2000-BUILD-MESSAGE
PERFORM 3000-PUT-MESSAGE
PERFORM 4000-CLOSE-QUEUE
EXEC CICS RETURN END-EXEC.
2000-BUILD-MESSAGE.
MOVE FUNCTION CURRENT-DATE
TO WS-NTF-TIMESTAMP
MOVE CA-CUSTOMER-ID TO WS-NTF-CUST-ID
MOVE CA-EVENT-TYPE TO WS-NTF-EVENT-TYPE
MOVE CA-ACCOUNT-NUM TO WS-NTF-ACCT-NUM
MOVE CA-AMOUNT TO WS-NTF-AMOUNT
MOVE CA-CURRENCY TO WS-NTF-CURRENCY
MOVE CA-DESCRIPTION TO WS-NTF-DESC
MOVE CA-MERCHANT TO WS-NTF-MERCHANT
MOVE CA-LOCATION TO WS-NTF-LOCATION
MOVE CA-REF-NUM TO WS-NTF-REF-NUM
* Set priority based on event type
EVALUATE TRUE
WHEN NTF-FRAUD-ALERT
MOVE 9 TO WS-NTF-PRIORITY
MOVE 9 TO MQMD-PRIORITY
WHEN NTF-WIRE-CLEAR
MOVE 7 TO WS-NTF-PRIORITY
MOVE 7 TO MQMD-PRIORITY
WHEN NTF-TXN-POST
MOVE 5 TO WS-NTF-PRIORITY
MOVE 5 TO MQMD-PRIORITY
WHEN OTHER
MOVE 3 TO WS-NTF-PRIORITY
MOVE 3 TO MQMD-PRIORITY
END-EVALUATE.
3000-PUT-MESSAGE.
MOVE MQMT-DATAGRAM TO MQMD-MSGTYPE
MOVE MQPER-PERSISTENT TO MQMD-PERSISTENCE
MOVE MQFMT-STRING TO MQMD-FORMAT
MOVE 36000 TO MQMD-EXPIRY
* Expire after 1 hour (3600 seconds * 10)
* Stale notifications are worse than no notification
COMPUTE MQPMO-OPTIONS =
MQPMO-SYNCPOINT +
MQPMO-NEW-MSG-ID +
MQPMO-FAIL-IF-QUIESCING
MOVE LENGTH OF WS-NOTIFY-MSG
TO WS-MSG-LENGTH
CALL 'MQPUT' USING WS-HCONN
WS-HOBJ
MQMD
MQPMO
WS-MSG-LENGTH
WS-NOTIFY-MSG
WS-COMPCODE
WS-REASON
EVALUATE WS-COMPCODE
WHEN MQCC-OK
CONTINUE
WHEN MQCC-WARNING
PERFORM 8100-LOG-WARNING
WHEN MQCC-FAILED
PERFORM 8000-HANDLE-MQ-ERROR
END-EVALUATE.
The Challenges
Challenge 1: Latency Budget
Yuki's concern about latency was legitimate. The customer experience requirement was: notification within 5 seconds of the transaction posting. Here's how the latency budget broke down:
| Segment | Target | Actual (p95) |
|---|---|---|
| COBOL MQPUT (z/OS) | < 5 ms | 1.2 ms |
| Transmission queue to channel | < 10 ms | 3.8 ms |
| Network transit (z/OS → AWS) | < 50 ms | 28 ms |
| Cloud QM receive + queue | < 10 ms | 6.1 ms |
| Consumer MQGET + transform | < 50 ms | 31 ms |
| SNS publish | < 100 ms | 72 ms |
| APNs/FCM delivery | < 2000 ms | 800 ms |
| Total | < 2225 ms | ~940 ms |
The end-to-end latency came in well under the 5-second target. "MQ wasn't the bottleneck," Yuki admitted. "Push notification delivery was. Apple and Google add more latency than our entire mainframe-to-cloud pipeline."
Challenge 2: Message Format Translation
The mainframe sends EBCDIC-encoded fixed-format COBOL records. The cloud consumer expects UTF-8 JSON. The translation happens in the cloud-side consumer:
{
"formatId": "SFNTFY01",
"version": 2,
"timestamp": "2025-03-15T14:23:07.123456Z",
"customerId": "CUST00084721",
"eventType": "TXNP",
"priority": 5,
"account": "4532XXXXXXXX7891",
"amount": -47.52,
"currency": "USD",
"description": "PURCHASE - WHOLE FOODS MARKET",
"merchant": "WHOLE FOODS MARKET #10234",
"location": "AUSTIN TX",
"referenceNumber": "TXN2025031514230"
}
The cloud consumer handles EBCDIC-to-ASCII conversion (MQ's built-in data conversion handles this via MQFMT-STRING and the channel's CONVERT(YES) attribute), then maps the fixed-format fields to JSON.
A critical design decision: the consumer must handle format versioning. When WS-NTF-VERSION is 0001, the body has a shorter layout (no merchant/location fields). When it's 0002, the full layout applies. The consumer checks the version and maps accordingly.
"We broke this once," Carlos says. "Version 3 added a field but the cloud consumer wasn't updated. It parsed the extra bytes as garbage and sent customers notifications with corrupted merchant names. We now have a contract test suite that validates both ends against the same message schema."
Challenge 3: Security Across Boundaries
The MQ channel between z/OS and AWS crosses a significant security boundary. SecureFirst's security requirements:
- Encryption in transit: TLS 1.2 with AES-256 on the MQ channel. Certificates managed by the z/OS PKI and AWS Certificate Manager.
- Mutual authentication: Both sides present certificates. The z/OS queue manager verifies the cloud queue manager's certificate, and vice versa.
- Channel authentication records (CHLAUTH): The z/OS queue manager only accepts connections from the specific IP range of the AWS VPC.
- Message-level security: Sensitive fields (account number, amount) are masked before leaving the mainframe. The notification contains only the last four digits of the account number. The cloud side never sees full account numbers.
*-------------------------------------------------------*
* Mask sensitive data before notification *
*-------------------------------------------------------*
2500-MASK-SENSITIVE-DATA.
MOVE CA-ACCOUNT-NUM TO WS-FULL-ACCT
MOVE ALL 'X' TO WS-NTF-ACCT-NUM(1:12)
MOVE WS-FULL-ACCT(13:4)
TO WS-NTF-ACCT-NUM(13:4).
The COBOL program masks the account number before the MQPUT. The full account number never leaves the mainframe. This is a defense-in-depth measure — even if the MQ channel is compromised, the attacker gets masked data.
Challenge 4: What Happens When AWS Is Down?
During the design review, Diane Okoye (visiting from Pinnacle Health as an external reviewer) asked the question that nobody wanted to answer: "What happens when AWS has an outage and the cloud queue manager is unreachable?"
The answer: messages accumulate on the z/OS transmission queue. This is the temporal decoupling benefit — the mainframe doesn't need AWS to be up. It puts messages to a local queue; MQ handles delivery when the channel is available.
But this creates a capacity planning problem. SecureFirst generates approximately 2 million notification messages per hour during peak. If AWS is down for 4 hours, that's 8 million messages on the transmission queue.
The team sized the transmission queue accordingly:
DEFINE QLOCAL(SFPQM01.TO.SFCLOUD.XMIT) +
USAGE(XMITQ) +
DEFPSIST(YES) +
MAXDEPTH(50000000) +
MAXMSGL(10000) +
STGCLASS(MQSTG01)
50 million messages of MAXDEPTH, with each message averaging 300 bytes, requires approximately 15 GB of page set storage. SecureFirst allocated 25 GB. "Over-provisioning storage is cheap," Yuki says. "Losing messages because a queue filled up is not."
They also implemented a circuit breaker pattern:
- If the transmission queue depth exceeds 10 million (roughly 5 hours of accumulation), the COBOL program checks queue depth before putting and begins dropping low-priority notifications (priority 1-2 — marketing and informational).
- Priority 7+ notifications (fraud alerts, wire transfers) are never dropped.
- When the queue depth returns below 5 million, normal operation resumes.
2800-CHECK-CIRCUIT-BREAKER.
*-------------------------------------------------------*
* Circuit breaker: check xmit queue depth *
*-------------------------------------------------------*
MOVE 'SFPQM01.TO.SFCLOUD.XMIT'
TO MQOD-OBJECTNAME
COMPUTE WS-OPTIONS = MQOO-INQUIRE
+ MQOO-FAIL-IF-QUIESCING
CALL 'MQOPEN' USING WS-HCONN
MQOD
WS-OPTIONS
WS-HOBJ-INQ
WS-COMPCODE
WS-REASON
IF WS-COMPCODE = MQCC-OK
MOVE MQIA-CURRENT-Q-DEPTH
TO WS-SELECTOR(1)
MOVE 1 TO WS-SELECTOR-COUNT
CALL 'MQINQ' USING WS-HCONN
WS-HOBJ-INQ
WS-SELECTOR-COUNT
WS-SELECTOR
WS-INT-ATTR-COUNT
WS-INT-ATTRS
WS-CHAR-ATTR-LENGTH
WS-CHAR-ATTRS
WS-COMPCODE
WS-REASON
MOVE WS-INT-ATTRS(1) TO WS-XMIT-DEPTH
END-IF
CALL 'MQCLOSE' USING WS-HCONN
WS-HOBJ-INQ
MQCO-NONE
WS-COMPCODE
WS-REASON
IF WS-XMIT-DEPTH > 10000000
SET WS-CIRCUIT-OPEN TO TRUE
ELSE IF WS-XMIT-DEPTH < 5000000
SET WS-CIRCUIT-CLOSED TO TRUE
END-IF.
Challenge 5: Notification Deduplication
A subtle problem emerged in production: when the MQ channel between z/OS and AWS went down and recovered, some messages were delivered twice. MQ guarantees at-least-once delivery for persistent messages, not exactly-once. During channel recovery, messages that were in flight might be re-sent.
The cloud consumer had to implement idempotency. Each notification message includes WS-NTF-REF-NUM — a unique reference number generated by the mainframe. The cloud consumer maintains a DynamoDB table of recently processed reference numbers (TTL: 24 hours). Before sending a push notification, it checks whether the reference number has already been processed.
This deduplication cost added 3-5 ms to the cloud-side processing — negligible in the context of the overall latency budget, but it required a DynamoDB table, TTL management, and another failure mode to handle (what if DynamoDB is unavailable?).
Production Results
Volume (Monthly)
| Metric | Value |
|---|---|
| Total notifications sent | 142 million |
| Fraud alerts | 340,000 |
| Transaction postings | 128 million |
| Wire transfer confirmations | 2.1 million |
| Account change alerts | 8.4 million |
| Other (marketing, info) | 3.2 million |
Reliability
| Metric | Value |
|---|---|
| End-to-end delivery rate | 99.97% |
| Duplicate notifications (post-dedup) | 0.001% |
| Notifications lost (z/OS to cloud) | 0 |
| Notifications dropped (circuit breaker) | 0.02% (all priority 1-2) |
| AWS outage impact events (12 months) | 3 |
| Longest AWS-related delay | 47 minutes |
Customer Impact
| Metric | Before MQ Bridge | After |
|---|---|---|
| Notification delivery (p50) | N/A (no real-time) | 0.7 sec |
| Notification delivery (p95) | N/A | 1.8 sec |
| Customer complaints (notifications) | 230/month | 12/month |
| Fraud alert delivery | Next business day (email) | < 3 seconds |
| Mobile app NPS score | 32 | 61 |
The fraud alert improvement was the biggest business win. Before MQ, fraud alerts were batch-processed and emailed overnight. Customers didn't learn about suspicious activity until the next morning — by which time additional fraudulent charges had often accumulated. With real-time MQ-based alerts, customers can freeze their card within seconds of the first suspicious transaction. Fraud losses dropped 34% in the first year.
Lessons Learned
Lesson 1: MQ's data conversion is good but not sufficient
MQ handles EBCDIC-to-ASCII conversion for character data. But COMP-3 packed decimal fields, COMP binary fields, and COBOL's fixed-format record layouts need explicit handling on the cloud side. SecureFirst spent three weeks building and testing the format translation layer. Budget time for this.
Lesson 2: At-least-once means you need idempotency
"Exactly-once delivery is a distributed systems myth," Carlos says. "MQ gives you at-least-once for persistent messages, which is the strongest practical guarantee. But it means your consumer must handle duplicates. We learned this the hard way when a customer got the same $500 wire transfer notification four times and called in a panic thinking they'd been charged four times."
Lesson 3: The circuit breaker saved them during a real outage
In month 6, an AWS region had a 3-hour partial outage. The transmission queue hit 7 million messages. The circuit breaker didn't activate (threshold was 10 million), but the team watched the queue depth climb and was prepared to manually stop low-priority flows. When the channel recovered, the queue drained in 22 minutes. Without the circuit breaker design and oversized transmission queue, they would have lost messages.
Lesson 4: Monitor both sides
SecureFirst initially only monitored the z/OS side (queue depths, channel status). They learned they also needed cloud-side monitoring: consumer lag, SNS delivery failures, DynamoDB dedup table size. A complete picture requires end-to-end visibility. They now have a dashboard showing the full pipeline from MQPUT to push notification delivery.
Lesson 5: The mainframe team and the cloud team need a shared language
The biggest non-technical challenge was communication. The mainframe team spoke in terms of queue managers, channels, and syncpoints. The cloud team spoke in terms of topics, consumers, and event buses. Carlos organized a series of joint sessions where both teams walked through the architecture together. "Once the cloud team understood that MQ gives you guaranteed delivery with transactional semantics — something SNS/SQS doesn't — they stopped asking why we weren't 'just using SQS.'"
Discussion Questions
-
SecureFirst masks account numbers before sending notifications through MQ. What other data masking or tokenization strategies would you recommend for financial data crossing the mainframe-to-cloud boundary?
-
The circuit breaker drops low-priority notifications when the transmission queue exceeds 10 million. How would you determine the correct threshold? What factors influence this decision?
-
The deduplication layer uses DynamoDB with a 24-hour TTL. What happens if a message is delayed by more than 24 hours (e.g., a very long AWS outage followed by a slow drain)? How would you handle this edge case?
-
SecureFirst chose a containerized IBM MQ instance on ECS as the cloud-side queue manager. What are the alternatives (Amazon MQ, Amazon SQS, direct MQ client connections)? What are the tradeoffs of each?
-
The latency budget shows that push notification delivery (APNs/FCM) adds the most latency. If the business requirement changed to "notification within 1 second," what architectural changes would be needed? Is this achievable?
-
Carlos mentioned that the cloud team initially wanted to "just use SQS." Compare IBM MQ and Amazon SQS for this use case. What does MQ provide that SQS doesn't? Under what circumstances would SQS be sufficient?