Case Study 1: SecureFirst's Strangler Fig for Mobile Banking
Background
SecureFirst Retail Bank is a mid-size retail bank with 1.2 million customers, $18 billion in assets, and a core banking system that runs on an IBM z15 with 3,200 MIPS. The COBOL/CICS/DB2 core banking system — 1.8 million lines of COBOL across 412 programs — has been in production since 1994 and processes 28 million transactions per day.
In 2022, SecureFirst's board approved a mobile-first strategy. The competitive landscape had shifted: digital-only banks were capturing younger customers with slick mobile apps and instant everything, while SecureFirst's mobile banking was a thin wrapper around green-screen functionality. The CEO's mandate: "Our mobile app should be indistinguishable from a fintech in user experience, but backed by the safety and depth of a chartered bank."
The team: - Yuki Nakamura — DevOps lead. Brought Jenkins, Git, Zowe, and VS Code to the mainframe team. Former distributed systems engineer who learned to respect the mainframe's strengths. Her mantra: "Measure everything, assume nothing, automate anything that hurts." - Carlos Vega — Mobile API architect. Java/Kotlin background, six years at a Silicon Valley startup before joining SecureFirst. Designed the microservices architecture for the new mobile platform. His initial attitude toward COBOL ("why haven't they rewritten this?") evolved into grudging respect after witnessing CICS process 28 million transactions in a day without breaking a sweat.
The Initial Architecture (March 2023)
Carlos's initial design was a clean microservices architecture on OpenShift:
Mobile App → API Gateway (Kong) → Microservices (Kotlin/Spring Boot)
→ PostgreSQL (account data replica)
→ Redis (session cache)
→ Kafka (event streaming)
The plan: replicate all account data from DB2 to PostgreSQL via CDC, build Kotlin microservices that read/write to PostgreSQL, and eventually stop using the CICS system entirely. Timeline: 18 months. Budget: $4.2 million.
Yuki raised concerns from day one. "You're building a parallel banking system. What happens when the two systems disagree?" Carlos's answer — "We'll make sure they don't disagree" — was the kind of confidence that comes from not having worked with 30-year-old batch processing systems.
The Incident (March 2024 — 9 Hours After Go-Live)
The mobile app launched to 100% of customers on a Thursday morning. Everything worked beautifully for six and a half hours. Then the nightly batch window opened.
Timeline:
- 6:00 PM — Nightly batch cycle begins. Job ACCRPOST runs first, posting daily interest accruals to 800,000 savings accounts. Each posting is a DB2 UPDATE to the ACCOUNT_MASTER table.
- 6:12 PM — CDC pipeline begins replicating the accrual postings to PostgreSQL. Normal lag: 2-3 seconds. But with 800,000 rapid-fire UPDATEs, the DB2 recovery log is generating data faster than the CDC pipeline can process it.
- 6:47 PM — CDC lag reaches 47 seconds. The pipeline is 47 seconds behind the DB2 master.
- 7:15 PM — First customer complaint. A customer checks their savings balance on the mobile app, sees $14,200. Calls the branch. The teller's green screen shows $14,247.33 (after interest accrual posting). The customer asks: "Where's my interest?"
- 7:22 PM — Second complaint. Different customer, same pattern. Pre-posting balance on the app, post-posting balance on the green screen.
- 7:30 PM — The customer service manager escalates to IT. "The mobile app is showing wrong balances."
- 7:45 PM — Carlos is paged. He checks the PostgreSQL database — balances look correct to him. He checks CICS — different numbers. His first thought: "The mainframe has a bug." (It didn't.)
- 8:30 PM — Yuki is paged. She checks the CDC dashboard immediately and sees the lag. "It's not a bug. The app is showing stale data. The CDC pipeline can't keep up with the batch window."
- 9:15 PM — The CEO is briefed. The decision: keep the mobile app running but add a banner: "Balances may be temporarily delayed during nightly processing."
- 11:42 PM — Carlos and Yuki begin root cause analysis.
- 4:00 AM — Root cause confirmed: CDC latency during batch window. Not a logic bug. Not a data corruption. A timing issue caused by the fundamental architectural decision to replicate data and serve it from a separate database.
- 4:15 AM — Yuki's recommendation: "Strangler fig. Monday morning."
The Redesign (April 2024)
Yuki and Carlos spent two weeks redesigning the architecture. The key change: instead of building a complete parallel banking system and cutting over, they would incrementally extract services from CICS, one at a time, with the CICS system remaining the source of truth until each extracted service was proven equivalent.
Architecture: Before and After
Before (big-bang):
Mobile App → Kong → Kotlin Services → PostgreSQL (replica of DB2)
↑
CDC (entire database)
After (strangler fig):
Mobile App → Kong → Routes to either:
├── Kotlin Service (for extracted services)
│ └── PostgreSQL (CDC-fed read replica)
└── z/OS Connect → CICS (for everything else)
└── DB2 (source of truth)
Extraction Order
Using the extraction scorecard from Section 33.3:
| Phase | Service | Priority Score | Timeline |
|---|---|---|---|
| 1 | Balance Inquiry | 7.25 | Months 1-4 |
| 2 | Transaction History | 7.00 | Months 4-7 |
| 3 | Account Summary | 4.80 | Months 7-10 |
| 4 | Account Alerts/Notifications | 4.50 | Months 10-12 |
| 5 | Check Image Retrieval | 3.80 | Months 12-14 |
| — | Fund Transfer | 2.30 | NOT PLANNED — stays on CICS |
| — | Bill Payment | 2.13 | NOT PLANNED — stays on CICS |
| — | Wire Transfer | 1.20 | NOT PLANNED — stays on CICS |
Phase 1: Balance Inquiry Extraction (April–July 2024)
Week 1-2: Preparation
Yuki and Carlos mapped the balance inquiry service: - CICS Program: BALINQ (3,200 lines of COBOL) - COMMAREA: 248 bytes — account number in, balance data out (available balance, ledger balance, hold amount, currency code, as-of timestamp, return code) - DB2 Access: Two SELECT statements — one from ACCOUNT_MASTER (for balances), one from ACCOUNT_HOLDS (for hold amounts) - Dependencies: BALINQ CALLs one subroutine (HOLDCALC, 800 lines) for the hold amount calculation. HOLDCALC handles seven types of holds, including the judicial garnishment hold with partial release that would later prove important. - Seams: Clean. Dedicated CICS transaction (BALINQ). Self-contained COMMAREA. Isolated DB2 access through views (V_ACCT_BAL and V_ACCT_HOLDS). No MQ, no IMS, no shared WORKING-STORAGE with other programs.
Week 3: CICS Web Service Wrapper
The team wrote BALWSSRV — the CICS web service wrapper described in Section 33.4.1 of the chapter. The wrapper translates JSON to COMMAREA, LINKs to BALINQ, and translates the COMMAREA response back to JSON. No business logic. Pure plumbing.
They deployed the wrapper via a CICS web service pipeline and tested it with z/OS Connect, confirming that the balance inquiry was now accessible as a REST API: GET /api/v2/accounts/{accountId}/balance.
Week 4-6: Modern Service Build
Carlos built the Kotlin microservice. Key decisions:
- BigDecimal for all monetary values. After the 4 AM incident, Carlos would never use double or float for money again.
- Same API contract. The Kotlin service exposes the identical REST API (GET /api/v2/accounts/{accountId}/balance) with the identical JSON response schema. The facade routes to one or the other; the consumer can't tell the difference.
- PostgreSQL read replica. The Kotlin service reads from PostgreSQL, which is fed by CDC from DB2. The service does not write to any database.
Week 7-9: Shadow Mode
The facade (Kong) was configured to route all balance-inquiry traffic to CICS (production) and simultaneously send a copy of each request to the Kotlin service (shadow). The comparison engine logged both responses.
Results after three weeks of shadow mode:
| Week | Requests | Financial Match | Non-Financial Match | Issues Found |
|---|---|---|---|---|
| 1 | 1,847,293 | 96.8% | 94.2% | COMP-3 vs. float rounding (3.2% of accounts) |
| 2 | 1,923,441 | 99.97% | 98.1% | Timezone discrepancy (UTC vs. ET), 7 garnishment hold accounts |
| 3 | 1,891,006 | 99.998% | 99.96% | 3 timing issues during batch window (expected) |
Week 1 discovery: The COMP-3 vs. floating-point rounding issue. The Kotlin service was using Double for balance arithmetic. 3.2% of accounts showed a penny discrepancy. Fix: replaced all Double types with BigDecimal. Redeployed. Discrepancy dropped to 0.03%.
Week 2 discovery: The timezone issue. The CICS program returned timestamps in Eastern Time (the mainframe's local time). The Kotlin service returned UTC. Consumers expected Eastern Time. Fix: configured the Kotlin service to return Eastern Time with an explicit timezone indicator (2024-06-15T14:23:01-04:00). Also discovered 7 accounts with judicial garnishment holds with partial release — a hold type that HOLDCALC handled but the Kotlin service hadn't implemented. Fix: implemented the garnishment hold logic in the Kotlin service, validated against HOLDCALC's output for all 7 accounts.
Week 3: Clean. The three discrepancies were all timing issues — a deposit posted in DB2 between the legacy query and the modern query. Expected and acceptable for a comparison engine with sub-second timing differences.
Week 10-14: Canary Deployment
| Ring | Users | Duration | Result |
|---|---|---|---|
| Ring 0 | 247 SecureFirst employees | 2 weeks | Zero discrepancies on financial fields |
| Ring 1 | 5% external (60,000 customers, lowest balances) | 1 week | Zero discrepancies. Response time: modern 28ms p95, legacy 42ms p95 |
| Ring 2 | 25% external (300,000 customers) | 2 weeks | 2 discrepancies, both timing-related (batch window). No logic issues |
| Ring 3 | 50% external (600,000 customers) | 2 weeks | Zero discrepancies |
| Ring 4 | 100% external (1.2M customers) | Ongoing | Legacy on standby |
Week 15: Full Migration
All balance-inquiry traffic routed to the Kotlin service. CICS BALINQ remained deployed and available for rollback. The shadow comparison continued, now in reverse — the CICS service received copies of requests for comparison, but the Kotlin service's response went to consumers.
Week 15 + 90 days: Decommission
After 90 days of standby with zero rollback triggers, BALINQ was removed from the CICS CSD. Source code was archived. The z/OS Connect service definition for balance inquiry was removed. The BALWSSRV wrapper was archived.
Outcomes
Quantitative Results (12 months after strangler fig redesign)
| Metric | Before Strangler Fig | After (12 months) |
|---|---|---|
| Services extracted | 0 | 5 (balance, history, summary, alerts, check images) |
| Mobile API response time (p95) | 180ms (via z/OS Connect for all) | 28-45ms (extracted services), 120ms (CICS services) |
| Development velocity for extracted services | 8-12 weeks per feature | 1-2 weeks per feature |
| Mainframe MIPS consumption | 3,200 | 2,650 (17% reduction) |
| Production incidents related to mobile banking | 4 (in first month of big-bang) | 0 (in 12 months of strangler fig) |
| Rollback events triggered | N/A | 2 (both resolved within 30 minutes) |
Qualitative Results
Carlos's perspective: "The strangler fig forced me to respect the legacy system. I couldn't just build something new and hope it worked — I had to prove, with data, that my new service was equivalent to something that had been running correctly for 30 years. That's humbling, and it should be."
Yuki's perspective: "The biggest win isn't technical. It's that the mainframe team and the cloud team are now one team. They pair on every extraction. The mainframe people teach the cloud people about COBOL's strengths, and the cloud people teach the mainframe people about CI/CD. The strangler fig forced collaboration."
The CEO's perspective: "I don't know which services run on the mainframe and which run on OpenShift. I just know the mobile app is fast and it hasn't been wrong since we switched to the strangler fig approach. That's what I wanted."
Lessons Learned
-
The big-bang approach failed in nine hours. The strangler fig approach has been running for twelve months with zero customer-facing incidents. The difference is not the technology — it's the methodology.
-
CDC latency is the silent killer. It works perfectly 99.9% of the time. The 0.1% when it doesn't — during batch windows, during high-volume INSERT bursts, during DB2 REORG — is when customers see wrong numbers. Design for the 0.1%.
-
COMP-3 vs. floating-point is not a theoretical problem. It's a guaranteed production incident if you use floating-point for money. Use BigDecimal. Always. No exceptions.
-
Shadow mode testing is non-negotiable. It found the garnishment hold edge case that no unit test or QA cycle would have discovered. Seven accounts out of 1.2 million — 0.0006% — and it would have been a regulatory finding.
-
Fund transfer stays on CICS. And that's fine. The strangler fig's goal is not "zero mainframe." It's "the right service on the right platform." CICS processes fund transfers at 28 million per day with sub-second response times and five-nines availability. The Kotlin service doesn't need to compete with that — it needs to compete with the fintech apps, and it does.
Discussion Questions
-
Carlos's initial architecture replicated the entire database to PostgreSQL. The strangler fig approach replicates only the tables needed by the currently extracted services. What are the implications for CDC pipeline complexity, storage costs, and data governance as more services are extracted?
-
SecureFirst chose the hybrid facade (Kong + z/OS Connect). If they had chosen Kong alone (Option A), what additional work would have been required to route traffic to CICS? What would they have lost?
-
The strangler fig left fund transfer, bill payment, and wire transfer on CICS. If a competitor launches a feature that requires fund transfers to be processed with 50ms latency (current CICS latency is 200ms via z/OS Connect), how would SecureFirst respond? Does the strangler fig architecture help or hinder this response?
-
Yuki describes the biggest win as "the mainframe team and the cloud team are now one team." How did the strangler fig methodology facilitate this cultural change, compared to a big-bang approach where the cloud team builds a replacement and the mainframe team maintains the legacy?
-
The CDC pipeline had a 47-second lag spike that caused the original incident. After the strangler fig redesign, the same CDC pipeline is still in use. What has changed — and what hasn't — about the risk of CDC latency during batch processing?