Case Study 2: SecureFirst's Production Hybrid — Two Years In

Background

SecureFirst Retail Bank's hybrid architecture has been in production for two years. What started as "Project Velocity" — the mobile-first initiative described in Chapter 1's case study — has evolved into a permanent hybrid architecture that serves 3 million customers through a mobile app backed by COBOL core banking on z/OS.

This case study is different from the CNB vision case study. CNB is a Tier-1 bank with a four-LPAR Parallel Sysplex, a $180M integration budget, and a dedicated 12-person integration squad. SecureFirst is a mid-size retail bank with a single LPAR, a $2M annual modernization budget, and a combined team of 11 people (5 mainframe, 6 cloud). SecureFirst's story is about doing hybrid right with limited resources — and the production incidents that taught them what "right" actually means.

Yuki Nakamura (DevOps lead) and Carlos Vega (mobile API architect) are the central characters. Two years ago, Carlos didn't know what CICS was. Today, he reads CICS trace output to debug API latency issues. That transformation — a cloud engineer learning to navigate the mainframe — is itself a case study in the organizational change that hybrid demands.

The Architecture (Deployed March 2024)

SecureFirst's hybrid architecture is a simplified version of the CNB reference architecture, adapted for their scale:

┌──────────────────────────── CLOUD (AWS) ────────────────────────────┐
│                                                                      │
│  ┌──────────┐   ┌───────────────┐   ┌──────────────┐               │
│  │  Mobile   │──▶│  AWS API       │──▶│  Cloud        │              │
│  │  App (iOS │   │  Gateway       │   │  Microservices│              │
│  │  Android) │   │  (Auth, Rate   │   │  (ECS Fargate)│              │
│  └──────────┘   │   Limit, Route)│   │  - Notif.     │              │
│                  └──────┬────────┘   │  - Prefs      │              │
│                         │            │  - Statements  │              │
│                         │            └───────┬────────┘              │
│                         │                    │                       │
│                  ┌──────▼────────┐   ┌───────▼────────┐             │
│                  │  z/OS Connect │   │  PostgreSQL    │             │
│                  │  (via VPN)    │   │  (CDC replica) │             │
│                  └──────┬────────┘   └───────▲────────┘             │
│                         │                    │                       │
└─────────────────────────┼────────────────────┼───────────────────────┘
                          │                    │
                   ┌──────▼────────────────────┼──────┐
                   │          z/OS (Single LPAR)       │
                   │                                    │
                   │  ┌────────┐  ┌────────┐  ┌──────┐│
                   │  │CICS TS │  │ DB2 12 │  │ MQ   ││
                   │  │(2 AORs │  │(single │  │(QM)  ││
                   │  │ 1 TOR) │  │instance│  │      ││
                   │  └────────┘  └────────┘  └──────┘│
                   │                                    │
                   │  CDC: IBM InfoSphere Data Rep.     │
                   │  → MQ → AWS MQ → PostgreSQL       │
                   │                                    │
                   └────────────────────────────────────┘

Key differences from CNB:

Dimension CNB SecureFirst Why Different
z/OS topology 4-LPAR Parallel Sysplex Single LPAR Scale: 3M vs. 500M txns/day
API gateway Kong (self-managed) AWS API Gateway (managed) Budget: SecureFirst can't staff a gateway team
Event mesh MQ + Kafka + custom connector MQ → Amazon MQ → SQS/SNS Simplicity: managed services reduce operational burden
CDC target Cloud data warehouse (Snowflake) PostgreSQL on RDS Scale: SecureFirst's data volume fits in a relational DB
Integration team 12-person dedicated squad Yuki + Carlos (2 people, part-time) Budget: SecureFirst can't afford a dedicated squad
Identity bridge IBM ISAM on z/OS Custom — AWS Cognito + RACF PassTicket generation Cost: ISAM licensing exceeded SecureFirst's budget

Year 1: The Launch and the Learning (2024)

The Good

The mobile app launched in March 2024 with three API-backed features: balance inquiry, transaction history, and funds transfer. All three called CICS programs through z/OS Connect.

Carlos's initial skepticism about COBOL — "it's a 60-year-old language" — transformed during the load test. "We simulated 5,000 concurrent users hitting the balance inquiry API. The CICS program responded in 12 milliseconds. Twelve. My Spring Boot microservice that does a SELECT from PostgreSQL takes 45 milliseconds on a good day. I stopped making jokes about COBOL after that load test."

The CDC pipeline — using IBM InfoSphere Data Replication to read the DB2 log, publish changes to an MQ queue, bridge to Amazon MQ via a VPN-connected MQ channel, and apply to PostgreSQL on RDS — was operational within 8 weeks. Average replication lag: 15 seconds. The mobile app's "recent transactions" view reads from PostgreSQL (eventual consistency), while the "current balance" reads from the mainframe API (strong consistency).

The Bad: Incident #1 — The VPN Tunnel Collapse (June 2024)

At 2:14 AM on a Thursday, the AWS site-to-site VPN connection between SecureFirst's cloud VPC and their on-premises z/OS LPAR went down. The VPN provider's BGP session timed out due to a router firmware bug.

Impact: Every mobile app API call that required mainframe data failed. Balance inquiry: down. Funds transfer: down. Transaction history from PostgreSQL: still working (data was replicated before the outage, now going stale).

Duration: 3 hours and 47 minutes.

Root cause: Single VPN tunnel with no redundancy. The architecture had a single point of failure in the network layer between cloud and mainframe.

Yuki's post-incident analysis: "We had redundancy in every layer except the one that mattered most — the network link between platforms. We had two CICS AORs for application redundancy. We had RDS Multi-AZ for database redundancy. But one VPN tunnel. That's like building a bridge with two lanes and one support column."

Remediation: 1. Added a second VPN tunnel through a different provider and different physical path. 2. Implemented health-check-based failover between tunnels (30-second failover time). 3. Added a circuit breaker in the API gateway: when the mainframe is unreachable, return cached data (balance as of last known good) with a "data may be stale" flag rather than a hard error. This degrades gracefully instead of failing completely. 4. Added "mainframe connectivity" as a golden signal in the monitoring dashboard with 1-minute alerting.

Cost of incident: Estimated $180K (customer service costs, regulatory notification, engineering time for remediation). Cost of redundant VPN: $24K/year. "The math was obvious in hindsight," Yuki says. "We should have spent the $24K on day one."

The Ugly: Incident #2 — The Packed Decimal Precision Bug (September 2024)

A customer reported that their balance displayed in the mobile app was $0.01 different from the balance shown at the ATM. Investigation revealed that the discrepancy existed for 12% of accounts.

Root cause: The z/OS Connect JSON transformation was converting COBOL packed decimal amounts (PIC S9(9)V99 COMP-3) to JSON numbers (IEEE 754 double-precision floating point). For most amounts, the conversion was exact. But for amounts that cannot be exactly represented in binary floating point — like $1234.56, which in binary is 1234.5599999999999... — the JSON number had a rounding error of $0.01.

The ATM system, which reads DB2 directly via a CICS transaction (no JSON conversion), displayed the exact packed decimal value. The mobile app, which received the JSON number, displayed the rounded value.

Carlos's reaction: "I've been writing financial software for six years and I've never had a penny rounding error. Because I've always worked with BigDecimal in Java. But JSON doesn't have BigDecimal — it has IEEE 754 doubles. I didn't even think about it when I designed the API schema. And z/OS Connect's default JSON conversion uses doubles. That's technically correct per the JSON spec, but it's wrong for banking."

Remediation: 1. Changed all monetary values in the API schema from JSON number type to JSON string type with a documented precision contract: "balance": "1234.56" (string, always 2 decimal places, no floating-point representation). 2. Updated the z/OS Connect service mapping to emit monetary values as strings. 3. Updated all mobile app and cloud microservice parsers to handle monetary values as strings, converting to language-appropriate decimal types (BigDecimal in Java/Kotlin, Decimal in Python). 4. Added a reconciliation check: nightly batch compares the PostgreSQL CDC replica balances against DB2 source balances, alerting on any discrepancy greater than $0.00.

Yuki's lesson: "This is exactly the kind of bug the anti-corruption layer is supposed to prevent. Our ACL — z/OS Connect in this case — was doing the conversion wrong. Not wrong per the JSON spec, but wrong per our business requirements. The ACL needs to understand the business semantics of the data it's converting, not just the technical format."

Year 2: Maturation and Expansion (2025)

Expanding the API Surface

By January 2025, SecureFirst had 14 CICS transactions exposed as REST APIs:

API Endpoint CICS Program Avg Response Daily Volume
GET /accounts/{id}/balance SFBAL01 18ms 2.1M
GET /accounts/{id}/transactions SFTXN01 32ms 1.8M
POST /transfers SFXFR01 45ms 340K
POST /bills/pay SFBIL01 52ms 180K
GET /loans/{id}/status SFLNS01 22ms 95K
POST /loans/apply SFLNA01 120ms 12K
GET /accounts/{id}/statements SFSTM01 85ms 450K
POST /accounts/open SFACC01 210ms 8K
GET /cards/{id}/details SFCRD01 25ms 380K
POST /cards/{id}/lock SFCLK01 38ms 2.5K
GET /rates/current SFRAT01 8ms 520K
POST /auth/verify SFAUT01 15ms 4.2M
GET /alerts/settings SFALR01 12ms 180K
POST /disputes/file SFDSP01 280ms 1.2K

Total daily API volume: approximately 10 million requests, of which 60% hit the mainframe and 40% are served from cloud caches or cloud-native services.

The Saga Implementation (Loan Origination)

In Q1 2025, SecureFirst implemented their first cross-platform saga: loan origination. The saga spans six steps across both platforms:

Saga: LOAN_ORIGINATION
Orchestrator: AWS Step Functions

Step 1 (Mainframe): Create loan application → SFLNA01 via API
  Compensate: Cancel loan application → SFLNC01 via API

Step 2 (Cloud): Run credit check → Equifax API
  Compensate: N/A (read-only, no state change)

Step 3 (Mainframe): Calculate terms → SFLNT01 via API
  Compensate: N/A (read-only, no state change)

Step 4 (Mainframe): Create loan account → SFLNF01 via API
  Compensate: Close loan account → SFLND01 via API

Step 5 (Cloud): Send approval notification → SNS + SES
  Compensate: Send cancellation notification

Step 6 (Cloud): Enable loan dashboard → PostgreSQL + S3
  Compensate: Disable loan dashboard

Idempotency implementation: Each mainframe API call includes a X-Request-ID header (UUID generated by the saga orchestrator). The COBOL programs check a DB2 control table (SFREQUEST_LOG) for the request ID before executing. If the request ID exists, the program returns the cached response without re-executing. This makes every step safely retriable.

Production statistics (first 3 months): - 36,000 loan applications processed through the saga - 34,800 completed successfully (96.7%) - 1,040 rejected at credit check step (2.9%) — normal business rejection, no compensation needed - 140 failed at Step 4 (0.4%) — DB2 constraint violations, compensating transactions executed for Steps 1 - 20 failed at Step 5/6 (0.06%) — cloud service failures, compensating transactions executed for Steps 4, 1 - 0 unresolvable saga failures requiring manual intervention

Carlos designed the saga: "The Step Functions state machine gives us complete visibility. Every loan application has a trace showing exactly what happened at each step — the request sent, the response received, the timestamp, and for failures, the compensating transaction that was executed. When the compliance team asked 'how do you ensure every failed application is properly unwound?', I showed them the Step Functions execution history. They were satisfied."

The CDC Evolution

The original CDC pipeline (InfoSphere → MQ → Amazon MQ → PostgreSQL) handled 4 tables. By Year 2, it handles 18 tables with significantly higher volume:

Metric Year 1 Year 2 Change
Tables replicated 4 18 +350%
Daily change events 2.1M 8.7M +314%
Average replication lag 15 sec 22 sec +47%
P99 replication lag 45 sec 68 sec +51%
CDC-related incidents 3 1 -67%

The lag increase worried Yuki: "We're approaching our 60-second SLA at the 99th percentile. The root cause is MQ channel throughput — the VPN-based MQ channel between z/OS and Amazon MQ is the bottleneck. We're evaluating two options: (1) increase MQ channel batch size to reduce per-message overhead, or (2) add a second MQ channel pair for high-volume tables."

Monitoring Evolution

SecureFirst's monitoring evolved from two separate dashboards (OMEGAMON for z/OS, CloudWatch for AWS) to a unified Grafana dashboard pulling from both sources:

z/OS metrics (collected via OMEGAMON → Prometheus exporter): - CICS transaction response time (per-transaction-ID) - CICS task count and MAXT utilization - DB2 buffer pool hit ratio, thread utilization, lock wait time - MQ queue depth, channel status, message throughput - CPU utilization, paging rate

Cloud metrics (collected via CloudWatch → Prometheus): - API Gateway latency (per-endpoint, p50/p95/p99) - ECS task health, CPU, memory - PostgreSQL connections, query latency, replication lag - Amazon MQ queue depth, consumer lag

Cross-platform metrics (calculated): - End-to-end transaction latency (API Gateway response time minus CICS response time = integration layer overhead) - CDC replication lag (DB2 commit timestamp minus PostgreSQL apply timestamp) - VPN tunnel health and latency

The Grafana dashboard has four panels: a top-level health summary (green/yellow/red per business service), a latency breakdown panel (showing where time is spent in each layer), a volume panel (requests per second across both platforms), and an alerts panel (active and recently resolved alerts).

"The single dashboard changed how we think about incidents," Yuki says. "Before, when a customer complained about slow transfers, the mainframe person checked CICS and said 'fine', the cloud person checked the API gateway and said 'fine', and nobody looked at the MQ channel between them. Now we see the whole path on one screen. When the transfer latency spike happened last month, we identified the MQ channel backlog in under 3 minutes because it was right there on the dashboard between the CICS panel and the API gateway panel."

The Organizational Journey

Carlos's Transformation

Carlos Vega joined SecureFirst knowing nothing about mainframes. Two years later, he is the de facto hybrid architect — the person both the mainframe team and the cloud team call when something doesn't work.

His learning path: - Month 1-3: "I thought CICS was just another REST endpoint. I designed API schemas without understanding COMMAREA layouts. The packed decimal bug was my wake-up call." - Month 4-6: "I learned to read COBOL source code. Not write it — I'll leave that to the experts. But read it well enough to understand what the program does, what data it expects, and how it handles errors. This changed everything about how I designed API contracts." - Month 7-12: "I learned to read CICS traces and DB2 EXPLAIN output. When the loan origination API was slow, I could look at the CICS trace and see that the COBOL program was doing a tablespace scan because a new index hadn't been created. I filed the request to Lisa — I mean, to our DBA — with the EXPLAIN output attached. She was impressed." - Month 13-24: "I started thinking about the mainframe's constraints as design parameters, not limitations. The rate limiter in the API gateway isn't because the mainframe is weak — it's because the mainframe has finite, precious resources optimized for transaction processing, and my job is to protect them from cloud traffic patterns that the mainframe was never designed to handle."

Yuki's Perspective on Team Evolution

Yuki Nakamura came from cloud DevOps and initially viewed the mainframe as an obstacle. Her perspective shifted:

"The mainframe team taught me what 'production discipline' really means. In cloud, we deploy 10 times a day and roll back if something breaks. The mainframe team deploys once a week after extensive testing because a failed deployment can take down a system processing $100M in daily transactions. Both approaches are valid for their context. Hybrid means learning when to move fast and when to move carefully — and where the boundary between those two modes lives."

The team structure evolved organically:

Role Year 1 Year 2
Yuki Nakamura DevOps lead (cloud focus) Hybrid operations lead (both platforms)
Carlos Vega Mobile API architect (cloud only) Hybrid architect (integration layer owner)
Janet Kim (mainframe) COBOL developer COBOL developer + z/OS Connect administrator
David Park (mainframe) CICS systems programmer CICS SP + monitoring integration lead
Maria Santos (cloud) Cloud developer Cloud developer + CDC pipeline owner

"We don't have a formal integration squad," Yuki says. "We can't afford one. Instead, Carlos and I are the integration function. We own the API gateway configuration, the z/OS Connect mappings, the CDC pipeline monitoring, and the saga orchestrator. It's a lot for two people, but it works because we've automated everything we can and we've documented everything we can't automate."

Cost Analysis: Two Years of Hybrid

Category Year 1 Year 2 Notes
z/OS operating cost $1.8M | $1.7M MIPS reduced 6% through API-driven offloading
AWS operating cost $420K | $580K Increased as more cloud services deployed
z/OS Connect licensing $85K | $85K Annual license
CDC licensing $60K | $60K Annual license
VPN (dual tunnel) $24K | $24K Added after Incident #1
Staff (mainframe) $650K | $650K 5 FTEs, no change
Staff (cloud) $780K | $780K 6 FTEs, no change
Training $45K | $30K Cross-training slowed in Year 2
Total $3.86M** | **$3.91M +1.3% YoY

For comparison, the pre-hybrid cost (2023) was $3.2M — entirely mainframe. The hybrid architecture costs $700K/year more, but delivers mobile banking (previously impossible), analytics capability (previously batch-only), and architectural resilience (reduced single-platform risk). The mobile banking app generates an estimated $1.2M/year in reduced branch transaction costs and new customer acquisition.

ROI calculation: $1.2M benefit - $700K incremental cost = $500K net annual benefit, achieved in 18 months. The board approved Year 3 funding unanimously.

Discussion Questions

  1. SecureFirst's hybrid architecture uses AWS managed services (API Gateway, Amazon MQ, ECS Fargate) where CNB uses self-managed equivalents (Kong, Kafka, Kubernetes). What are the tradeoffs? At what organizational size does the transition from managed to self-managed make sense?

  2. The packed decimal precision bug (Incident #2) is a data type translation error in the ACL. How could this have been caught before production? Design a testing strategy specifically for ACL data type translation.

  3. Carlos evolved from "CICS is a black box" to reading CICS traces in 12 months. What organizational conditions enabled this transformation? Could this be replicated at an organization where mainframe and cloud teams are in separate buildings or separate cities?

  4. SecureFirst's CDC replication lag is increasing (15 sec → 22 sec avg, 45 sec → 68 sec p99) as they add more tables. Using the CDC patterns from Section 37.3, design a remediation plan that keeps lag under the 60-second SLA as they continue to expand replication.

  5. The dual-VPN remediation after Incident #1 provides network redundancy, but the architecture still has a single LPAR. What other single points of failure exist in SecureFirst's architecture, and how would you address them within their $2M annual budget constraint?

  6. Yuki says "We don't have a formal integration squad — Carlos and I are the integration function." This works at SecureFirst's scale, but identify the risks. What happens if Carlos or Yuki leaves? Design a knowledge transfer and bus-factor mitigation plan.