Case Study 2: SecureFirst's Production Hybrid — Two Years In
Background
SecureFirst Retail Bank's hybrid architecture has been in production for two years. What started as "Project Velocity" — the mobile-first initiative described in Chapter 1's case study — has evolved into a permanent hybrid architecture that serves 3 million customers through a mobile app backed by COBOL core banking on z/OS.
This case study is different from the CNB vision case study. CNB is a Tier-1 bank with a four-LPAR Parallel Sysplex, a $180M integration budget, and a dedicated 12-person integration squad. SecureFirst is a mid-size retail bank with a single LPAR, a $2M annual modernization budget, and a combined team of 11 people (5 mainframe, 6 cloud). SecureFirst's story is about doing hybrid right with limited resources — and the production incidents that taught them what "right" actually means.
Yuki Nakamura (DevOps lead) and Carlos Vega (mobile API architect) are the central characters. Two years ago, Carlos didn't know what CICS was. Today, he reads CICS trace output to debug API latency issues. That transformation — a cloud engineer learning to navigate the mainframe — is itself a case study in the organizational change that hybrid demands.
The Architecture (Deployed March 2024)
SecureFirst's hybrid architecture is a simplified version of the CNB reference architecture, adapted for their scale:
┌──────────────────────────── CLOUD (AWS) ────────────────────────────┐
│ │
│ ┌──────────┐ ┌───────────────┐ ┌──────────────┐ │
│ │ Mobile │──▶│ AWS API │──▶│ Cloud │ │
│ │ App (iOS │ │ Gateway │ │ Microservices│ │
│ │ Android) │ │ (Auth, Rate │ │ (ECS Fargate)│ │
│ └──────────┘ │ Limit, Route)│ │ - Notif. │ │
│ └──────┬────────┘ │ - Prefs │ │
│ │ │ - Statements │ │
│ │ └───────┬────────┘ │
│ │ │ │
│ ┌──────▼────────┐ ┌───────▼────────┐ │
│ │ z/OS Connect │ │ PostgreSQL │ │
│ │ (via VPN) │ │ (CDC replica) │ │
│ └──────┬────────┘ └───────▲────────┘ │
│ │ │ │
└─────────────────────────┼────────────────────┼───────────────────────┘
│ │
┌──────▼────────────────────┼──────┐
│ z/OS (Single LPAR) │
│ │
│ ┌────────┐ ┌────────┐ ┌──────┐│
│ │CICS TS │ │ DB2 12 │ │ MQ ││
│ │(2 AORs │ │(single │ │(QM) ││
│ │ 1 TOR) │ │instance│ │ ││
│ └────────┘ └────────┘ └──────┘│
│ │
│ CDC: IBM InfoSphere Data Rep. │
│ → MQ → AWS MQ → PostgreSQL │
│ │
└────────────────────────────────────┘
Key differences from CNB:
| Dimension | CNB | SecureFirst | Why Different |
|---|---|---|---|
| z/OS topology | 4-LPAR Parallel Sysplex | Single LPAR | Scale: 3M vs. 500M txns/day |
| API gateway | Kong (self-managed) | AWS API Gateway (managed) | Budget: SecureFirst can't staff a gateway team |
| Event mesh | MQ + Kafka + custom connector | MQ → Amazon MQ → SQS/SNS | Simplicity: managed services reduce operational burden |
| CDC target | Cloud data warehouse (Snowflake) | PostgreSQL on RDS | Scale: SecureFirst's data volume fits in a relational DB |
| Integration team | 12-person dedicated squad | Yuki + Carlos (2 people, part-time) | Budget: SecureFirst can't afford a dedicated squad |
| Identity bridge | IBM ISAM on z/OS | Custom — AWS Cognito + RACF PassTicket generation | Cost: ISAM licensing exceeded SecureFirst's budget |
Year 1: The Launch and the Learning (2024)
The Good
The mobile app launched in March 2024 with three API-backed features: balance inquiry, transaction history, and funds transfer. All three called CICS programs through z/OS Connect.
Carlos's initial skepticism about COBOL — "it's a 60-year-old language" — transformed during the load test. "We simulated 5,000 concurrent users hitting the balance inquiry API. The CICS program responded in 12 milliseconds. Twelve. My Spring Boot microservice that does a SELECT from PostgreSQL takes 45 milliseconds on a good day. I stopped making jokes about COBOL after that load test."
The CDC pipeline — using IBM InfoSphere Data Replication to read the DB2 log, publish changes to an MQ queue, bridge to Amazon MQ via a VPN-connected MQ channel, and apply to PostgreSQL on RDS — was operational within 8 weeks. Average replication lag: 15 seconds. The mobile app's "recent transactions" view reads from PostgreSQL (eventual consistency), while the "current balance" reads from the mainframe API (strong consistency).
The Bad: Incident #1 — The VPN Tunnel Collapse (June 2024)
At 2:14 AM on a Thursday, the AWS site-to-site VPN connection between SecureFirst's cloud VPC and their on-premises z/OS LPAR went down. The VPN provider's BGP session timed out due to a router firmware bug.
Impact: Every mobile app API call that required mainframe data failed. Balance inquiry: down. Funds transfer: down. Transaction history from PostgreSQL: still working (data was replicated before the outage, now going stale).
Duration: 3 hours and 47 minutes.
Root cause: Single VPN tunnel with no redundancy. The architecture had a single point of failure in the network layer between cloud and mainframe.
Yuki's post-incident analysis: "We had redundancy in every layer except the one that mattered most — the network link between platforms. We had two CICS AORs for application redundancy. We had RDS Multi-AZ for database redundancy. But one VPN tunnel. That's like building a bridge with two lanes and one support column."
Remediation: 1. Added a second VPN tunnel through a different provider and different physical path. 2. Implemented health-check-based failover between tunnels (30-second failover time). 3. Added a circuit breaker in the API gateway: when the mainframe is unreachable, return cached data (balance as of last known good) with a "data may be stale" flag rather than a hard error. This degrades gracefully instead of failing completely. 4. Added "mainframe connectivity" as a golden signal in the monitoring dashboard with 1-minute alerting.
Cost of incident: Estimated $180K (customer service costs, regulatory notification, engineering time for remediation). Cost of redundant VPN: $24K/year. "The math was obvious in hindsight," Yuki says. "We should have spent the $24K on day one."
The Ugly: Incident #2 — The Packed Decimal Precision Bug (September 2024)
A customer reported that their balance displayed in the mobile app was $0.01 different from the balance shown at the ATM. Investigation revealed that the discrepancy existed for 12% of accounts.
Root cause: The z/OS Connect JSON transformation was converting COBOL packed decimal amounts (PIC S9(9)V99 COMP-3) to JSON numbers (IEEE 754 double-precision floating point). For most amounts, the conversion was exact. But for amounts that cannot be exactly represented in binary floating point — like $1234.56, which in binary is 1234.5599999999999... — the JSON number had a rounding error of $0.01.
The ATM system, which reads DB2 directly via a CICS transaction (no JSON conversion), displayed the exact packed decimal value. The mobile app, which received the JSON number, displayed the rounded value.
Carlos's reaction: "I've been writing financial software for six years and I've never had a penny rounding error. Because I've always worked with BigDecimal in Java. But JSON doesn't have BigDecimal — it has IEEE 754 doubles. I didn't even think about it when I designed the API schema. And z/OS Connect's default JSON conversion uses doubles. That's technically correct per the JSON spec, but it's wrong for banking."
Remediation:
1. Changed all monetary values in the API schema from JSON number type to JSON string type with a documented precision contract: "balance": "1234.56" (string, always 2 decimal places, no floating-point representation).
2. Updated the z/OS Connect service mapping to emit monetary values as strings.
3. Updated all mobile app and cloud microservice parsers to handle monetary values as strings, converting to language-appropriate decimal types (BigDecimal in Java/Kotlin, Decimal in Python).
4. Added a reconciliation check: nightly batch compares the PostgreSQL CDC replica balances against DB2 source balances, alerting on any discrepancy greater than $0.00.
Yuki's lesson: "This is exactly the kind of bug the anti-corruption layer is supposed to prevent. Our ACL — z/OS Connect in this case — was doing the conversion wrong. Not wrong per the JSON spec, but wrong per our business requirements. The ACL needs to understand the business semantics of the data it's converting, not just the technical format."
Year 2: Maturation and Expansion (2025)
Expanding the API Surface
By January 2025, SecureFirst had 14 CICS transactions exposed as REST APIs:
| API Endpoint | CICS Program | Avg Response | Daily Volume |
|---|---|---|---|
| GET /accounts/{id}/balance | SFBAL01 | 18ms | 2.1M |
| GET /accounts/{id}/transactions | SFTXN01 | 32ms | 1.8M |
| POST /transfers | SFXFR01 | 45ms | 340K |
| POST /bills/pay | SFBIL01 | 52ms | 180K |
| GET /loans/{id}/status | SFLNS01 | 22ms | 95K |
| POST /loans/apply | SFLNA01 | 120ms | 12K |
| GET /accounts/{id}/statements | SFSTM01 | 85ms | 450K |
| POST /accounts/open | SFACC01 | 210ms | 8K |
| GET /cards/{id}/details | SFCRD01 | 25ms | 380K |
| POST /cards/{id}/lock | SFCLK01 | 38ms | 2.5K |
| GET /rates/current | SFRAT01 | 8ms | 520K |
| POST /auth/verify | SFAUT01 | 15ms | 4.2M |
| GET /alerts/settings | SFALR01 | 12ms | 180K |
| POST /disputes/file | SFDSP01 | 280ms | 1.2K |
Total daily API volume: approximately 10 million requests, of which 60% hit the mainframe and 40% are served from cloud caches or cloud-native services.
The Saga Implementation (Loan Origination)
In Q1 2025, SecureFirst implemented their first cross-platform saga: loan origination. The saga spans six steps across both platforms:
Saga: LOAN_ORIGINATION
Orchestrator: AWS Step Functions
Step 1 (Mainframe): Create loan application → SFLNA01 via API
Compensate: Cancel loan application → SFLNC01 via API
Step 2 (Cloud): Run credit check → Equifax API
Compensate: N/A (read-only, no state change)
Step 3 (Mainframe): Calculate terms → SFLNT01 via API
Compensate: N/A (read-only, no state change)
Step 4 (Mainframe): Create loan account → SFLNF01 via API
Compensate: Close loan account → SFLND01 via API
Step 5 (Cloud): Send approval notification → SNS + SES
Compensate: Send cancellation notification
Step 6 (Cloud): Enable loan dashboard → PostgreSQL + S3
Compensate: Disable loan dashboard
Idempotency implementation: Each mainframe API call includes a X-Request-ID header (UUID generated by the saga orchestrator). The COBOL programs check a DB2 control table (SFREQUEST_LOG) for the request ID before executing. If the request ID exists, the program returns the cached response without re-executing. This makes every step safely retriable.
Production statistics (first 3 months): - 36,000 loan applications processed through the saga - 34,800 completed successfully (96.7%) - 1,040 rejected at credit check step (2.9%) — normal business rejection, no compensation needed - 140 failed at Step 4 (0.4%) — DB2 constraint violations, compensating transactions executed for Steps 1 - 20 failed at Step 5/6 (0.06%) — cloud service failures, compensating transactions executed for Steps 4, 1 - 0 unresolvable saga failures requiring manual intervention
Carlos designed the saga: "The Step Functions state machine gives us complete visibility. Every loan application has a trace showing exactly what happened at each step — the request sent, the response received, the timestamp, and for failures, the compensating transaction that was executed. When the compliance team asked 'how do you ensure every failed application is properly unwound?', I showed them the Step Functions execution history. They were satisfied."
The CDC Evolution
The original CDC pipeline (InfoSphere → MQ → Amazon MQ → PostgreSQL) handled 4 tables. By Year 2, it handles 18 tables with significantly higher volume:
| Metric | Year 1 | Year 2 | Change |
|---|---|---|---|
| Tables replicated | 4 | 18 | +350% |
| Daily change events | 2.1M | 8.7M | +314% |
| Average replication lag | 15 sec | 22 sec | +47% |
| P99 replication lag | 45 sec | 68 sec | +51% |
| CDC-related incidents | 3 | 1 | -67% |
The lag increase worried Yuki: "We're approaching our 60-second SLA at the 99th percentile. The root cause is MQ channel throughput — the VPN-based MQ channel between z/OS and Amazon MQ is the bottleneck. We're evaluating two options: (1) increase MQ channel batch size to reduce per-message overhead, or (2) add a second MQ channel pair for high-volume tables."
Monitoring Evolution
SecureFirst's monitoring evolved from two separate dashboards (OMEGAMON for z/OS, CloudWatch for AWS) to a unified Grafana dashboard pulling from both sources:
z/OS metrics (collected via OMEGAMON → Prometheus exporter): - CICS transaction response time (per-transaction-ID) - CICS task count and MAXT utilization - DB2 buffer pool hit ratio, thread utilization, lock wait time - MQ queue depth, channel status, message throughput - CPU utilization, paging rate
Cloud metrics (collected via CloudWatch → Prometheus): - API Gateway latency (per-endpoint, p50/p95/p99) - ECS task health, CPU, memory - PostgreSQL connections, query latency, replication lag - Amazon MQ queue depth, consumer lag
Cross-platform metrics (calculated): - End-to-end transaction latency (API Gateway response time minus CICS response time = integration layer overhead) - CDC replication lag (DB2 commit timestamp minus PostgreSQL apply timestamp) - VPN tunnel health and latency
The Grafana dashboard has four panels: a top-level health summary (green/yellow/red per business service), a latency breakdown panel (showing where time is spent in each layer), a volume panel (requests per second across both platforms), and an alerts panel (active and recently resolved alerts).
"The single dashboard changed how we think about incidents," Yuki says. "Before, when a customer complained about slow transfers, the mainframe person checked CICS and said 'fine', the cloud person checked the API gateway and said 'fine', and nobody looked at the MQ channel between them. Now we see the whole path on one screen. When the transfer latency spike happened last month, we identified the MQ channel backlog in under 3 minutes because it was right there on the dashboard between the CICS panel and the API gateway panel."
The Organizational Journey
Carlos's Transformation
Carlos Vega joined SecureFirst knowing nothing about mainframes. Two years later, he is the de facto hybrid architect — the person both the mainframe team and the cloud team call when something doesn't work.
His learning path: - Month 1-3: "I thought CICS was just another REST endpoint. I designed API schemas without understanding COMMAREA layouts. The packed decimal bug was my wake-up call." - Month 4-6: "I learned to read COBOL source code. Not write it — I'll leave that to the experts. But read it well enough to understand what the program does, what data it expects, and how it handles errors. This changed everything about how I designed API contracts." - Month 7-12: "I learned to read CICS traces and DB2 EXPLAIN output. When the loan origination API was slow, I could look at the CICS trace and see that the COBOL program was doing a tablespace scan because a new index hadn't been created. I filed the request to Lisa — I mean, to our DBA — with the EXPLAIN output attached. She was impressed." - Month 13-24: "I started thinking about the mainframe's constraints as design parameters, not limitations. The rate limiter in the API gateway isn't because the mainframe is weak — it's because the mainframe has finite, precious resources optimized for transaction processing, and my job is to protect them from cloud traffic patterns that the mainframe was never designed to handle."
Yuki's Perspective on Team Evolution
Yuki Nakamura came from cloud DevOps and initially viewed the mainframe as an obstacle. Her perspective shifted:
"The mainframe team taught me what 'production discipline' really means. In cloud, we deploy 10 times a day and roll back if something breaks. The mainframe team deploys once a week after extensive testing because a failed deployment can take down a system processing $100M in daily transactions. Both approaches are valid for their context. Hybrid means learning when to move fast and when to move carefully — and where the boundary between those two modes lives."
The team structure evolved organically:
| Role | Year 1 | Year 2 |
|---|---|---|
| Yuki Nakamura | DevOps lead (cloud focus) | Hybrid operations lead (both platforms) |
| Carlos Vega | Mobile API architect (cloud only) | Hybrid architect (integration layer owner) |
| Janet Kim (mainframe) | COBOL developer | COBOL developer + z/OS Connect administrator |
| David Park (mainframe) | CICS systems programmer | CICS SP + monitoring integration lead |
| Maria Santos (cloud) | Cloud developer | Cloud developer + CDC pipeline owner |
"We don't have a formal integration squad," Yuki says. "We can't afford one. Instead, Carlos and I are the integration function. We own the API gateway configuration, the z/OS Connect mappings, the CDC pipeline monitoring, and the saga orchestrator. It's a lot for two people, but it works because we've automated everything we can and we've documented everything we can't automate."
Cost Analysis: Two Years of Hybrid
| Category | Year 1 | Year 2 | Notes |
|---|---|---|---|
| z/OS operating cost | $1.8M | $1.7M | MIPS reduced 6% through API-driven offloading | |
| AWS operating cost | $420K | $580K | Increased as more cloud services deployed | |
| z/OS Connect licensing | $85K | $85K | Annual license | |
| CDC licensing | $60K | $60K | Annual license | |
| VPN (dual tunnel) | $24K | $24K | Added after Incident #1 | |
| Staff (mainframe) | $650K | $650K | 5 FTEs, no change | |
| Staff (cloud) | $780K | $780K | 6 FTEs, no change | |
| Training | $45K | $30K | Cross-training slowed in Year 2 | |
| Total | $3.86M** | **$3.91M | +1.3% YoY |
For comparison, the pre-hybrid cost (2023) was $3.2M — entirely mainframe. The hybrid architecture costs $700K/year more, but delivers mobile banking (previously impossible), analytics capability (previously batch-only), and architectural resilience (reduced single-platform risk). The mobile banking app generates an estimated $1.2M/year in reduced branch transaction costs and new customer acquisition.
ROI calculation: $1.2M benefit - $700K incremental cost = $500K net annual benefit, achieved in 18 months. The board approved Year 3 funding unanimously.
Discussion Questions
-
SecureFirst's hybrid architecture uses AWS managed services (API Gateway, Amazon MQ, ECS Fargate) where CNB uses self-managed equivalents (Kong, Kafka, Kubernetes). What are the tradeoffs? At what organizational size does the transition from managed to self-managed make sense?
-
The packed decimal precision bug (Incident #2) is a data type translation error in the ACL. How could this have been caught before production? Design a testing strategy specifically for ACL data type translation.
-
Carlos evolved from "CICS is a black box" to reading CICS traces in 12 months. What organizational conditions enabled this transformation? Could this be replicated at an organization where mainframe and cloud teams are in separate buildings or separate cities?
-
SecureFirst's CDC replication lag is increasing (15 sec → 22 sec avg, 45 sec → 68 sec p99) as they add more tables. Using the CDC patterns from Section 37.3, design a remediation plan that keeps lag under the 60-second SLA as they continue to expand replication.
-
The dual-VPN remediation after Incident #1 provides network redundancy, but the architecture still has a single LPAR. What other single points of failure exist in SecureFirst's architecture, and how would you address them within their $2M annual budget constraint?
-
Yuki says "We don't have a formal integration squad — Carlos and I are the integration function." This works at SecureFirst's scale, but identify the risks. What happens if Carlos or Yuki leaves? Design a knowledge transfer and bus-factor mitigation plan.