Case Study 2: SecureFirst's MQ-to-Cloud Bridge for Mobile Notifications


Background

SecureFirst Retail Bank launched its mobile banking app in 2022. The app was built on AWS — API Gateway, Lambda functions, DynamoDB, SNS for push notifications. The backend of record remained on the mainframe: CICS transactions, DB2 databases, VSAM files, and 30 years of COBOL business logic.

The challenge was immediate: customers expected real-time notifications. When a transaction posted, the mobile app should buzz within seconds. When a wire transfer cleared, the customer should know. When a suspicious transaction triggered a fraud alert, the customer's phone should light up.

The problem: the mainframe's COBOL programs knew about transactions the instant they happened. The mobile app lived in AWS. Between them lay a chasm of protocols, data formats, security boundaries, and organizational silos.

Carlos Vega, SecureFirst's API architect, proposed the solution: IBM MQ as the bridge. "MQ already speaks mainframe," Carlos said. "And IBM MQ has native connectors for cloud platforms. We don't need to build the bridge — we need to configure it."

Yuki Nakamura, the DevOps lead, was skeptical. "You want to put a 30-year-old messaging product in the critical path of a mobile experience? Our users expect sub-second response times."

"MQ is 30 years old the way a steel bridge is 100 years old," Carlos replied. "It's been carrying traffic the entire time."


The Architecture

The End-to-End Flow

z/OS Mainframe                    Network              AWS Cloud
+------------------+              |              +-------------------+
| COBOL Program    |              |              | MQ Client         |
| (CICS/Batch)     |              |              | (EC2/ECS)         |
|   ↓              |              |              |   ↓               |
| MQPUT to         |              |              | Consumes from     |
| NOTIFY.OUTBOUND  |              |              | cloud-side queue  |
|   ↓              |              |              |   ↓               |
| CNBPQM01         |  TLS/1414   |              | Transform to JSON |
| (z/OS QM)        |→→→→→→→→→→→→→|→→→→→→→→→→→→→| ↓               |
|                  |  Sender Ch   |  Receiver    | Route to SNS      |
+------------------+              |              | (push notif.)     |
                                  |              |   ↓               |
                                  |              | Customer's phone  |
                                  |              +-------------------+

Components

On z/OS: - Queue manager: SFPQM01 (SecureFirst Production Queue Manager 01) - Outbound notification queue: SF.NOTIFY.OUTBOUND (local queue, persistent, triggered) - Sender channel: SFPQM01.TO.SFCLOUD — TLS-encrypted, connected to the cloud-side MQ instance - Transmission queue: SFPQM01.TO.SFCLOUD.XMIT

In AWS: - IBM MQ on Amazon ECS (containerized queue manager): SFCLOUD01 - Inbound queue: SF.NOTIFY.INBOUND (receives messages from z/OS) - MQ Consumer: Java/Spring Boot application running on ECS, consuming from SF.NOTIFY.INBOUND - Amazon SNS: Distributes push notifications to iOS (APNs) and Android (FCM) - Amazon CloudWatch: Monitoring and alerting

On the z/OS side, the COBOL notification program:

       IDENTIFICATION DIVISION.
       PROGRAM-ID. SFNOTIFY.
      *-------------------------------------------------------*
      * SecureFirst Notification Publisher                      *
      * Publishes transaction events for mobile notifications  *
      *-------------------------------------------------------*

       DATA DIVISION.
       WORKING-STORAGE SECTION.
           COPY CMQV.
           COPY CMQODV.
           COPY CMQMDV.
           COPY CMQPMOV.

       01  WS-NOTIFY-MSG.
           05  WS-NTF-HEADER.
               10  WS-NTF-FORMAT    PIC X(8) VALUE 'SFNTFY01'.
               10  WS-NTF-VERSION   PIC 9(4) VALUE 0002.
               10  WS-NTF-TIMESTAMP PIC X(26).
               10  WS-NTF-CUST-ID   PIC X(12).
               10  WS-NTF-EVENT-TYPE PIC X(4).
                   88  NTF-TXN-POST     VALUE 'TXNP'.
                   88  NTF-WIRE-CLEAR   VALUE 'WIRC'.
                   88  NTF-FRAUD-ALERT  VALUE 'FRDA'.
                   88  NTF-ACCT-CHANGE  VALUE 'ACCH'.
                   88  NTF-BILL-DUE     VALUE 'BILD'.
               10  WS-NTF-PRIORITY   PIC 9(1).
           05  WS-NTF-BODY.
               10  WS-NTF-ACCT-NUM  PIC X(16).
               10  WS-NTF-AMOUNT    PIC S9(13)V99 COMP-3.
               10  WS-NTF-CURRENCY  PIC X(3).
               10  WS-NTF-DESC      PIC X(80).
               10  WS-NTF-MERCHANT  PIC X(40).
               10  WS-NTF-LOCATION  PIC X(40).
               10  WS-NTF-REF-NUM   PIC X(20).

       01  WS-MQ-FIELDS.
           05  WS-HCONN            PIC S9(9) COMP.
           05  WS-HOBJ             PIC S9(9) COMP.
           05  WS-COMPCODE         PIC S9(9) COMP.
           05  WS-REASON           PIC S9(9) COMP.
           05  WS-OPTIONS          PIC S9(9) COMP.
           05  WS-MSG-LENGTH       PIC S9(9) COMP.

The program is called by any CICS transaction that needs to send a notification. It receives the notification data in the COMMAREA, builds the MQ message, and puts it to the outbound queue:

       PROCEDURE DIVISION.
       0000-MAIN.
           PERFORM 1000-OPEN-QUEUE
           PERFORM 2000-BUILD-MESSAGE
           PERFORM 3000-PUT-MESSAGE
           PERFORM 4000-CLOSE-QUEUE
           EXEC CICS RETURN END-EXEC.

       2000-BUILD-MESSAGE.
           MOVE FUNCTION CURRENT-DATE
                                 TO WS-NTF-TIMESTAMP
           MOVE CA-CUSTOMER-ID   TO WS-NTF-CUST-ID
           MOVE CA-EVENT-TYPE    TO WS-NTF-EVENT-TYPE
           MOVE CA-ACCOUNT-NUM   TO WS-NTF-ACCT-NUM
           MOVE CA-AMOUNT        TO WS-NTF-AMOUNT
           MOVE CA-CURRENCY      TO WS-NTF-CURRENCY
           MOVE CA-DESCRIPTION   TO WS-NTF-DESC
           MOVE CA-MERCHANT      TO WS-NTF-MERCHANT
           MOVE CA-LOCATION      TO WS-NTF-LOCATION
           MOVE CA-REF-NUM       TO WS-NTF-REF-NUM

      *    Set priority based on event type
           EVALUATE TRUE
               WHEN NTF-FRAUD-ALERT
                   MOVE 9 TO WS-NTF-PRIORITY
                   MOVE 9 TO MQMD-PRIORITY
               WHEN NTF-WIRE-CLEAR
                   MOVE 7 TO WS-NTF-PRIORITY
                   MOVE 7 TO MQMD-PRIORITY
               WHEN NTF-TXN-POST
                   MOVE 5 TO WS-NTF-PRIORITY
                   MOVE 5 TO MQMD-PRIORITY
               WHEN OTHER
                   MOVE 3 TO WS-NTF-PRIORITY
                   MOVE 3 TO MQMD-PRIORITY
           END-EVALUATE.

       3000-PUT-MESSAGE.
           MOVE MQMT-DATAGRAM    TO MQMD-MSGTYPE
           MOVE MQPER-PERSISTENT  TO MQMD-PERSISTENCE
           MOVE MQFMT-STRING     TO MQMD-FORMAT
           MOVE 36000            TO MQMD-EXPIRY
      *    Expire after 1 hour (3600 seconds * 10)
      *    Stale notifications are worse than no notification

           COMPUTE MQPMO-OPTIONS =
               MQPMO-SYNCPOINT +
               MQPMO-NEW-MSG-ID +
               MQPMO-FAIL-IF-QUIESCING

           MOVE LENGTH OF WS-NOTIFY-MSG
                                 TO WS-MSG-LENGTH

           CALL 'MQPUT' USING WS-HCONN
                               WS-HOBJ
                               MQMD
                               MQPMO
                               WS-MSG-LENGTH
                               WS-NOTIFY-MSG
                               WS-COMPCODE
                               WS-REASON

           EVALUATE WS-COMPCODE
               WHEN MQCC-OK
                   CONTINUE
               WHEN MQCC-WARNING
                   PERFORM 8100-LOG-WARNING
               WHEN MQCC-FAILED
                   PERFORM 8000-HANDLE-MQ-ERROR
           END-EVALUATE.

The Challenges

Challenge 1: Latency Budget

Yuki's concern about latency was legitimate. The customer experience requirement was: notification within 5 seconds of the transaction posting. Here's how the latency budget broke down:

Segment Target Actual (p95)
COBOL MQPUT (z/OS) < 5 ms 1.2 ms
Transmission queue to channel < 10 ms 3.8 ms
Network transit (z/OS → AWS) < 50 ms 28 ms
Cloud QM receive + queue < 10 ms 6.1 ms
Consumer MQGET + transform < 50 ms 31 ms
SNS publish < 100 ms 72 ms
APNs/FCM delivery < 2000 ms 800 ms
Total < 2225 ms ~940 ms

The end-to-end latency came in well under the 5-second target. "MQ wasn't the bottleneck," Yuki admitted. "Push notification delivery was. Apple and Google add more latency than our entire mainframe-to-cloud pipeline."

Challenge 2: Message Format Translation

The mainframe sends EBCDIC-encoded fixed-format COBOL records. The cloud consumer expects UTF-8 JSON. The translation happens in the cloud-side consumer:

{
  "formatId": "SFNTFY01",
  "version": 2,
  "timestamp": "2025-03-15T14:23:07.123456Z",
  "customerId": "CUST00084721",
  "eventType": "TXNP",
  "priority": 5,
  "account": "4532XXXXXXXX7891",
  "amount": -47.52,
  "currency": "USD",
  "description": "PURCHASE - WHOLE FOODS MARKET",
  "merchant": "WHOLE FOODS MARKET #10234",
  "location": "AUSTIN TX",
  "referenceNumber": "TXN2025031514230"
}

The cloud consumer handles EBCDIC-to-ASCII conversion (MQ's built-in data conversion handles this via MQFMT-STRING and the channel's CONVERT(YES) attribute), then maps the fixed-format fields to JSON.

A critical design decision: the consumer must handle format versioning. When WS-NTF-VERSION is 0001, the body has a shorter layout (no merchant/location fields). When it's 0002, the full layout applies. The consumer checks the version and maps accordingly.

"We broke this once," Carlos says. "Version 3 added a field but the cloud consumer wasn't updated. It parsed the extra bytes as garbage and sent customers notifications with corrupted merchant names. We now have a contract test suite that validates both ends against the same message schema."

Challenge 3: Security Across Boundaries

The MQ channel between z/OS and AWS crosses a significant security boundary. SecureFirst's security requirements:

  • Encryption in transit: TLS 1.2 with AES-256 on the MQ channel. Certificates managed by the z/OS PKI and AWS Certificate Manager.
  • Mutual authentication: Both sides present certificates. The z/OS queue manager verifies the cloud queue manager's certificate, and vice versa.
  • Channel authentication records (CHLAUTH): The z/OS queue manager only accepts connections from the specific IP range of the AWS VPC.
  • Message-level security: Sensitive fields (account number, amount) are masked before leaving the mainframe. The notification contains only the last four digits of the account number. The cloud side never sees full account numbers.
      *-------------------------------------------------------*
      * Mask sensitive data before notification                 *
      *-------------------------------------------------------*
       2500-MASK-SENSITIVE-DATA.
           MOVE CA-ACCOUNT-NUM   TO WS-FULL-ACCT
           MOVE ALL 'X'          TO WS-NTF-ACCT-NUM(1:12)
           MOVE WS-FULL-ACCT(13:4)
                                 TO WS-NTF-ACCT-NUM(13:4).

The COBOL program masks the account number before the MQPUT. The full account number never leaves the mainframe. This is a defense-in-depth measure — even if the MQ channel is compromised, the attacker gets masked data.

Challenge 4: What Happens When AWS Is Down?

During the design review, Diane Okoye (visiting from Pinnacle Health as an external reviewer) asked the question that nobody wanted to answer: "What happens when AWS has an outage and the cloud queue manager is unreachable?"

The answer: messages accumulate on the z/OS transmission queue. This is the temporal decoupling benefit — the mainframe doesn't need AWS to be up. It puts messages to a local queue; MQ handles delivery when the channel is available.

But this creates a capacity planning problem. SecureFirst generates approximately 2 million notification messages per hour during peak. If AWS is down for 4 hours, that's 8 million messages on the transmission queue.

The team sized the transmission queue accordingly:

DEFINE QLOCAL(SFPQM01.TO.SFCLOUD.XMIT)  +
       USAGE(XMITQ)  +
       DEFPSIST(YES)  +
       MAXDEPTH(50000000)  +
       MAXMSGL(10000)  +
       STGCLASS(MQSTG01)

50 million messages of MAXDEPTH, with each message averaging 300 bytes, requires approximately 15 GB of page set storage. SecureFirst allocated 25 GB. "Over-provisioning storage is cheap," Yuki says. "Losing messages because a queue filled up is not."

They also implemented a circuit breaker pattern:

  • If the transmission queue depth exceeds 10 million (roughly 5 hours of accumulation), the COBOL program checks queue depth before putting and begins dropping low-priority notifications (priority 1-2 — marketing and informational).
  • Priority 7+ notifications (fraud alerts, wire transfers) are never dropped.
  • When the queue depth returns below 5 million, normal operation resumes.
       2800-CHECK-CIRCUIT-BREAKER.
      *-------------------------------------------------------*
      * Circuit breaker: check xmit queue depth                *
      *-------------------------------------------------------*
           MOVE 'SFPQM01.TO.SFCLOUD.XMIT'
                                 TO MQOD-OBJECTNAME
           COMPUTE WS-OPTIONS = MQOO-INQUIRE
                               + MQOO-FAIL-IF-QUIESCING

           CALL 'MQOPEN' USING WS-HCONN
                                MQOD
                                WS-OPTIONS
                                WS-HOBJ-INQ
                                WS-COMPCODE
                                WS-REASON

           IF WS-COMPCODE = MQCC-OK
               MOVE MQIA-CURRENT-Q-DEPTH
                                 TO WS-SELECTOR(1)
               MOVE 1            TO WS-SELECTOR-COUNT
               CALL 'MQINQ' USING WS-HCONN
                                   WS-HOBJ-INQ
                                   WS-SELECTOR-COUNT
                                   WS-SELECTOR
                                   WS-INT-ATTR-COUNT
                                   WS-INT-ATTRS
                                   WS-CHAR-ATTR-LENGTH
                                   WS-CHAR-ATTRS
                                   WS-COMPCODE
                                   WS-REASON
               MOVE WS-INT-ATTRS(1) TO WS-XMIT-DEPTH
           END-IF

           CALL 'MQCLOSE' USING WS-HCONN
                                 WS-HOBJ-INQ
                                 MQCO-NONE
                                 WS-COMPCODE
                                 WS-REASON

           IF WS-XMIT-DEPTH > 10000000
               SET WS-CIRCUIT-OPEN TO TRUE
           ELSE IF WS-XMIT-DEPTH < 5000000
               SET WS-CIRCUIT-CLOSED TO TRUE
           END-IF.

Challenge 5: Notification Deduplication

A subtle problem emerged in production: when the MQ channel between z/OS and AWS went down and recovered, some messages were delivered twice. MQ guarantees at-least-once delivery for persistent messages, not exactly-once. During channel recovery, messages that were in flight might be re-sent.

The cloud consumer had to implement idempotency. Each notification message includes WS-NTF-REF-NUM — a unique reference number generated by the mainframe. The cloud consumer maintains a DynamoDB table of recently processed reference numbers (TTL: 24 hours). Before sending a push notification, it checks whether the reference number has already been processed.

This deduplication cost added 3-5 ms to the cloud-side processing — negligible in the context of the overall latency budget, but it required a DynamoDB table, TTL management, and another failure mode to handle (what if DynamoDB is unavailable?).


Production Results

Volume (Monthly)

Metric Value
Total notifications sent 142 million
Fraud alerts 340,000
Transaction postings 128 million
Wire transfer confirmations 2.1 million
Account change alerts 8.4 million
Other (marketing, info) 3.2 million

Reliability

Metric Value
End-to-end delivery rate 99.97%
Duplicate notifications (post-dedup) 0.001%
Notifications lost (z/OS to cloud) 0
Notifications dropped (circuit breaker) 0.02% (all priority 1-2)
AWS outage impact events (12 months) 3
Longest AWS-related delay 47 minutes

Customer Impact

Metric Before MQ Bridge After
Notification delivery (p50) N/A (no real-time) 0.7 sec
Notification delivery (p95) N/A 1.8 sec
Customer complaints (notifications) 230/month 12/month
Fraud alert delivery Next business day (email) < 3 seconds
Mobile app NPS score 32 61

The fraud alert improvement was the biggest business win. Before MQ, fraud alerts were batch-processed and emailed overnight. Customers didn't learn about suspicious activity until the next morning — by which time additional fraudulent charges had often accumulated. With real-time MQ-based alerts, customers can freeze their card within seconds of the first suspicious transaction. Fraud losses dropped 34% in the first year.


Lessons Learned

Lesson 1: MQ's data conversion is good but not sufficient

MQ handles EBCDIC-to-ASCII conversion for character data. But COMP-3 packed decimal fields, COMP binary fields, and COBOL's fixed-format record layouts need explicit handling on the cloud side. SecureFirst spent three weeks building and testing the format translation layer. Budget time for this.

Lesson 2: At-least-once means you need idempotency

"Exactly-once delivery is a distributed systems myth," Carlos says. "MQ gives you at-least-once for persistent messages, which is the strongest practical guarantee. But it means your consumer must handle duplicates. We learned this the hard way when a customer got the same $500 wire transfer notification four times and called in a panic thinking they'd been charged four times."

Lesson 3: The circuit breaker saved them during a real outage

In month 6, an AWS region had a 3-hour partial outage. The transmission queue hit 7 million messages. The circuit breaker didn't activate (threshold was 10 million), but the team watched the queue depth climb and was prepared to manually stop low-priority flows. When the channel recovered, the queue drained in 22 minutes. Without the circuit breaker design and oversized transmission queue, they would have lost messages.

Lesson 4: Monitor both sides

SecureFirst initially only monitored the z/OS side (queue depths, channel status). They learned they also needed cloud-side monitoring: consumer lag, SNS delivery failures, DynamoDB dedup table size. A complete picture requires end-to-end visibility. They now have a dashboard showing the full pipeline from MQPUT to push notification delivery.

Lesson 5: The mainframe team and the cloud team need a shared language

The biggest non-technical challenge was communication. The mainframe team spoke in terms of queue managers, channels, and syncpoints. The cloud team spoke in terms of topics, consumers, and event buses. Carlos organized a series of joint sessions where both teams walked through the architecture together. "Once the cloud team understood that MQ gives you guaranteed delivery with transactional semantics — something SNS/SQS doesn't — they stopped asking why we weren't 'just using SQS.'"


Discussion Questions

  1. SecureFirst masks account numbers before sending notifications through MQ. What other data masking or tokenization strategies would you recommend for financial data crossing the mainframe-to-cloud boundary?

  2. The circuit breaker drops low-priority notifications when the transmission queue exceeds 10 million. How would you determine the correct threshold? What factors influence this decision?

  3. The deduplication layer uses DynamoDB with a 24-hour TTL. What happens if a message is delayed by more than 24 hours (e.g., a very long AWS outage followed by a slow drain)? How would you handle this edge case?

  4. SecureFirst chose a containerized IBM MQ instance on ECS as the cloud-side queue manager. What are the alternatives (Amazon MQ, Amazon SQS, direct MQ client connections)? What are the tradeoffs of each?

  5. The latency budget shows that push notification delivery (APNs/FCM) adds the most latency. If the business requirement changed to "notification within 1 second," what architectural changes would be needed? Is this achievable?

  6. Carlos mentioned that the cloud team initially wanted to "just use SQS." Compare IBM MQ and Amazon SQS for this use case. What does MQ provide that SQS doesn't? Under what circumstances would SQS be sufficient?