> "You don't replace a forty-year-old system by ripping it out. You replace it the way a strangler fig replaces a tree — slowly, from the outside in, until one day the original is gone and nobody noticed the transition."
Learning Objectives
- Design a strangler fig migration plan for incrementally extracting COBOL services
- Implement facade/proxy patterns that route between legacy COBOL and new services
- Manage the transition period where old and new systems coexist
- Design data synchronization strategies during migration
- Apply the strangler fig to the HA banking system
In This Chapter
- Chapter Overview
- 33.1 The Strangler Fig Metaphor
- 33.2 Architecture of the Strangler Fig Pattern
- 33.3 Identifying Extraction Candidates
- 33.4 The Facade Layer: Implementation
- 33.5 Data Synchronization During Migration
- 33.6 Testing the Transition
- 33.7 When to Stop Strangling
- 33.8 Applying the Strangler Fig to the HA Banking System
- 33.9 Real-World Patterns and Anti-Patterns
- Chapter Summary
- What's Next
"You don't replace a forty-year-old system by ripping it out. You replace it the way a strangler fig replaces a tree — slowly, from the outside in, until one day the original is gone and nobody noticed the transition." — Yuki Nakamura, DevOps Lead, SecureFirst Retail Bank
Chapter Overview
Carlos Vega remembers the exact moment he understood why the strangler fig pattern exists.
It was 11:42 PM on a Thursday in March 2024, and he was staring at his laptop in SecureFirst's operations center, surrounded by cold pizza boxes and the quiet hum of people trying very hard not to panic. SecureFirst's mobile banking app — the one Carlos had spent eighteen months building as a Java/Kotlin microservices architecture — had been live for exactly nine hours. And for the last two of those hours, balance inquiries had been returning numbers that didn't match what the CICS core banking system showed.
Not by pennies. By thousands of dollars.
The root cause, which took Carlos and Yuki Nakamura another four hours to find, was a timing issue in the batch cutover. SecureFirst's new microservice was querying a replicated PostgreSQL database that was supposed to mirror the DB2 master. But the CDC pipeline had a 47-second lag during peak batch processing, and the mobile app was showing pre-posting balances while the CICS green-screen tellers were seeing post-posting balances. Same accounts, same moment, different numbers. A customer who'd checked their balance on the app at 9:15 PM saw $14,200. The teller who looked it up thirty seconds later saw $11,847 — after the nightly mortgage payment posted.
"That," Carlos told Yuki at 4 AM, "is why you don't do big-bang cutovers for banking systems."
Yuki, who had been saying exactly this for months, was too tired to say "I told you so." Instead she said: "Monday morning, we redesign this as a strangler fig. We keep the CICS system as the source of truth and we peel services off one at a time, with parallel running, until we trust each one enough to let it fly alone."
That conversation — born of a near-disaster that could have triggered regulatory action if a customer had disputed the discrepancy — is why this chapter exists. The strangler fig pattern is the single most important execution pattern for mainframe modernization. Chapter 32 taught you what to modernize and why. This chapter teaches you how — incrementally, safely, and with the kind of paranoid attention to data consistency that banking regulators demand.
What you will learn in this chapter:
- How to design a strangler fig migration plan that incrementally extracts COBOL services into modern implementations without disrupting the running system
- How to build facade and proxy layers that route traffic between legacy COBOL and new services — using CICS web services, z/OS Connect, and API gateways
- How to manage the coexistence period where old and new systems run simultaneously, including data synchronization, conflict resolution, and rollback strategies
- How to test the transition using parallel running, canary deployments, and dark launching
- How to apply the strangler fig pattern to the HA banking system's balance-inquiry service
Learning Path Annotations:
- 🏃 Fast Track: If you've already done API-first modernization (Chapter 21), skip to Section 33.3 — you know the facade layer; what you need is the extraction decision framework and the data synchronization patterns.
- 📚 Deep Dive: Sections 33.5 and 33.6 are where the real complexity lives. Budget extra time for the data synchronization patterns — they're the difference between a smooth migration and a 4 AM incident.
- 🔗 Connection: This chapter operationalizes the "Refactor" strategy from Chapter 32's decision framework. If you scored an application as "Refactor — API-wrap and incrementally extract," this chapter is your execution playbook.
Spaced Review — Concepts from Earlier Chapters:
📊 Review from Chapter 13 (CICS Architecture): CICS region topology — AOR/TOR/FOR — is the foundation of the facade pattern. The TOR (Terminal-Owning Region) already acts as a routing layer, directing transactions to the appropriate AOR. The strangler fig's facade extends this concept: instead of routing only within CICS, we route between CICS and external services. If you're fuzzy on MRO routing and Sysplex-wide workload distribution, revisit Chapter 13, Section 13.4 before proceeding.
📊 Review from Chapter 21 (API-First COBOL): z/OS Connect and the API mediation layer you built in Chapter 21 are the technical foundation for the strangler fig's facade. The OpenAPI specifications, rate limiting, and versioning strategy from that chapter become the contract layer that both the legacy and modern services must honor. If you skipped Chapter 21, you'll need at least Sections 21.2 and 21.4 to follow the facade implementation in this chapter.
📊 Review from Chapter 32 (Modernization Strategy): The threshold concept — modernization is not migration — applies directly. The strangler fig is a modernization pattern, not a migration pattern. We're not moving everything off the mainframe. We're selectively extracting services where extraction provides clear business value, while leaving high-throughput, mission-critical transaction processing on the platform that was built for it. Sandra Chen's decision framework from Chapter 32 tells you which services to extract; this chapter tells you how.
33.1 The Strangler Fig Metaphor
Martin Fowler named the pattern in 2004, borrowing from the strangler fig trees he'd seen in the rainforests of Queensland. The Ficus genus includes species that germinate in the canopy of a host tree, send roots down along the host's trunk, and gradually — over decades — envelop the host tree entirely. The host tree eventually dies and decomposes, leaving the strangler fig standing in its place, its roots forming a hollow lattice where the host tree used to be.
The metaphor maps to software with uncomfortable precision:
| Biological Strangler Fig | Software Strangler Fig |
|---|---|
| Seeds germinate in the host's canopy | New services are built alongside the legacy system |
| Roots grow down alongside the host trunk | New services share the legacy system's data and interfaces |
| The fig gradually intercepts sunlight and nutrients | The facade gradually routes traffic to new services |
| The host tree dies slowly, from within | Legacy modules are decommissioned one by one |
| The fig stands where the tree stood | The modern system occupies the same business function |
| Nobody notices the exact moment the tree dies | Nobody notices the exact moment the last COBOL module is decommissioned |
But here's where the metaphor breaks and the reality of mainframe modernization diverges from a tidy biological analogy: the host tree doesn't fight back. A COBOL/CICS system running 500 million transactions a day has forty years of accumulated business rules, undocumented edge cases, batch dependencies, and regulatory compliance requirements that actively resist extraction. The strangler fig pattern for mainframes is less "peaceful vine grows alongside tree" and more "perform open-heart surgery on a patient who must continue running a marathon throughout the procedure."
The Pattern in Three Sentences
- Build a facade that sits in front of both the legacy system and any new services, presenting a single interface to consumers.
- Incrementally route traffic from the facade to new service implementations, one functional area at a time, while the legacy system continues handling everything else.
- Decommission legacy modules only after the new service has been proven equivalent through parallel running and validated in production.
That's it. Three sentences. The rest of this chapter is about why each of those sentences hides six months of engineering work.
Why Strangler Fig for Mainframes?
The alternative patterns — big-bang cutover, parallel build-and-switch, phased module replacement — all share a fatal flaw for mainframe systems: they require a moment where you switch from old to new. That moment is the single point of failure that has killed more modernization projects than any technical challenge.
The strangler fig eliminates that moment. There is no "go-live day." There is no "cutover weekend." There is a gradual, measurable, reversible transfer of traffic from legacy to modern, one service at a time, over months or years. If the new balance-inquiry service has a bug, you route traffic back to CICS in seconds. If the new payment service can't handle peak load, CICS absorbs the overflow. The legacy system is always there, always running, always ready to catch you when you fall.
Kwame Mensah at CNB puts it this way: "I've seen three big-bang migrations in my career. Two of them are still running on the mainframe. The third one is a Harvard Business School case study in how to destroy $800 million."
💡 Key Insight: The strangler fig pattern's greatest advantage is not technical — it's psychological and organizational. It turns modernization from a single high-stakes bet into a series of small, reversible experiments. Each experiment either succeeds (and you keep going) or fails (and you learn something). The total risk of the project is the sum of many small risks, not one catastrophic one.
What You're Really Building
Let me be precise about the architecture. A strangler fig implementation for a mainframe COBOL system has four layers:
┌─────────────────────────────────────────────────┐
│ External Consumers │
│ (Mobile App, Web Portal, Partner APIs) │
└─────────────────────┬───────────────────────────┘
│
┌─────────────────────▼───────────────────────────┐
│ FACADE / API GATEWAY │
│ (Routes requests to legacy OR modern) │
│ - Feature toggles │
│ - Traffic splitting rules │
│ - Consumer-driven contract validation │
│ - Logging / comparison engine │
└────────────┬────────────────────┬───────────────┘
│ │
┌────────▼────────┐ ┌───────▼────────────┐
│ LEGACY (COBOL) │ │ MODERN SERVICES │
│ CICS/DB2/IMS │ │ Java/Node/Go │
│ Batch/MQ │ │ Containers/K8s │
│ z/OS │ │ Cloud or zLinux │
└────────┬────────┘ └───────┬────────────┘
│ │
┌────────▼────────────────────▼────────────┐
│ DATA SYNCHRONIZATION │
│ (CDC, dual-write, event sourcing) │
│ Keeps legacy and modern data consistent │
└─────────────────────────────────────────┘
The facade is the strangler. The legacy system is the host tree. The modern services are the new roots. And the data synchronization layer — the one that kept Carlos up until 4 AM — is the part that nobody thinks about until it's too late.
33.2 Architecture of the Strangler Fig Pattern
Let's get specific. The strangler fig pattern for mainframe COBOL systems has three architectural components: the facade, the routing engine, and the extraction pipeline. Each one has design decisions that will determine whether your migration succeeds or produces a 4 AM phone call.
33.2.1 The Facade
The facade is the single entry point for all consumers. Before the strangler fig, consumers talked directly to CICS (via 3270, web services, or z/OS Connect). After the facade is in place, consumers talk to the facade, and the facade decides whether to route to CICS or to a modern service.
The facade must satisfy four requirements:
-
Transparent to consumers. No consumer should need to change their integration when you switch a service from legacy to modern. The facade's external contract is stable; only the internal routing changes.
-
Stateless. The facade routes requests but doesn't hold state. State lives in the services (legacy or modern) and in the data layer. A stateless facade can be scaled horizontally and doesn't create a single point of failure.
-
Observable. Every request through the facade must be logged with enough detail to compare legacy and modern responses during parallel running. You need to know: which service handled the request, what the response was, how long it took, and whether the response matches what the other service would have returned.
-
Reversible. Routing changes must be instantaneous. If the modern balance-inquiry service starts returning errors at 2 PM, you need to route all traffic back to CICS by 2:01 PM. No deployments, no restarts, no approvals — a configuration change that takes effect in seconds.
⚠️ Common Pitfall: Many teams build the facade as a "smart" layer that transforms data, applies business rules, or orchestrates calls to multiple backends. Don't. The facade should be dumb. Its only job is routing. Every piece of logic you put in the facade is a piece of logic you'll need to maintain, test, and debug when something goes wrong at 2 AM. Keep the facade thin.
33.2.2 Implementation Options for the Facade
There are three viable implementation options for mainframe strangler fig facades. Each has tradeoffs:
Option A: API Gateway (Kong, Apigee, AWS API Gateway)
An external API gateway sits in front of both the mainframe and modern services. Traffic from consumers hits the gateway, which routes based on path, header, or percentage-based rules.
- Pros: Mature tooling, built-in traffic splitting, no z/OS changes required, works with any backend.
- Cons: Adds network hop for mainframe traffic that was previously internal, requires the mainframe to expose HTTP/REST endpoints (Chapter 21), may introduce latency for high-throughput transactions.
- Best for: Consumer-facing APIs (mobile, web), partner integrations, systems where the mainframe already exposes REST via z/OS Connect.
Option B: z/OS Connect as Facade
z/OS Connect EE (covered in Chapter 21) can act as the facade layer running on the mainframe. It already provides API mediation for CICS and IMS transactions. With routing rules, it can direct some requests to CICS and others to external services.
- Pros: Runs on z/OS — no additional network hop for legacy requests, leverages existing z/OS Connect investment, maintains mainframe security model (RACF, SSL).
- Cons: Routing to external services requires outbound HTTP from z/OS (possible but adds complexity), limited traffic-splitting features compared to dedicated API gateways, IBM licensing costs.
- Best for: Organizations that already use z/OS Connect, systems where most traffic stays on the mainframe, regulatory environments that require all routing to be auditable on z/OS.
Option C: Hybrid — External Gateway + z/OS Connect
The external API gateway handles consumer-facing routing decisions (which service gets the request). z/OS Connect handles the mainframe-side mediation (transforming the request into a CICS LINK or IMS transaction). This is what SecureFirst implemented after Carlos's 4 AM incident.
- Pros: Best of both worlds — external gateway gets mature traffic management, z/OS Connect gets native mainframe integration. Clean separation of concerns.
- Cons: Two components to manage, monitor, and troubleshoot. More infrastructure. More potential failure points.
- Best for: Most enterprise strangler fig implementations. This is the pattern I recommend unless you have a compelling reason to choose A or B.
33.2.3 The Routing Engine
The routing engine is the decision-making component within the facade. It answers one question: "Should this request go to the legacy system or the modern service?"
The routing decision can be based on:
| Routing Strategy | Description | Use Case |
|---|---|---|
| Path-based | Route by API endpoint path (/api/v2/balance → modern, /api/v2/transfer → legacy) |
Service-by-service extraction |
| Header-based | Route by custom header (X-Backend: modern) |
Testing and debugging |
| Percentage-based | Route N% of traffic to modern, (100-N)% to legacy | Canary deployments |
| User-based | Route specific users/accounts to modern | Beta testing with internal users |
| Time-based | Route to modern during low-traffic periods, legacy during peak | Building confidence gradually |
| Feature toggle | Route based on toggle state in configuration service | Instant rollback capability |
In practice, you'll use a combination. SecureFirst's routing for balance inquiry started with:
- Internal employees only (user-based) — two weeks
- 5% of external users (percentage-based) — one week
- 25% of external users — one week
- 50% with comparison logging (percentage + path-based) — two weeks
- 100% modern, CICS on standby — ongoing until decommission
💡 Key Insight: The routing engine is where the strangler fig pattern delivers its key value — reversibility. If Step 4 reveals a discrepancy, you drop back to Step 2 in seconds. No rollback deployment, no database restoration, no emergency change advisory board meeting. Just a configuration change.
33.2.4 The Extraction Pipeline
The extraction pipeline is the process — not a technology — by which you identify a COBOL service, build the modern replacement, validate it through parallel running, and decommission the legacy code. Each extraction follows the same lifecycle:
IDENTIFY → UNDERSTAND → BUILD → SHADOW → PARALLEL → CANARY → MIGRATE → DECOMMISSION
│ │ │ │ │ │ │ │
│ │ │ │ │ │ │ └─ Remove legacy code,
│ │ │ │ │ │ │ update docs
│ │ │ │ │ │ └─ 100% traffic to modern,
│ │ │ │ │ │ legacy on standby
│ │ │ │ │ └─ Percentage-based traffic split,
│ │ │ │ │ comparison still active
│ │ │ │ └─ Both services handle real traffic,
│ │ │ │ responses compared automatically
│ │ │ └─ Modern service gets copy of traffic,
│ │ │ responses discarded (shadow mode)
│ │ └─ Implement modern service,
│ │ unit + integration tests
│ └─ Map all business rules, edge cases,
│ data dependencies
└─ Score extraction candidates (Section 33.3)
A typical extraction takes 3-6 months for a medium-complexity COBOL service (5,000-20,000 LOC, DB2 back-end, CICS front-end). High-complexity services with IMS dependencies, multiple copybook hierarchies, and undocumented business rules can take 12-18 months.
33.3 Identifying Extraction Candidates
Not every COBOL service should be extracted. The strangler fig pattern is powerful precisely because it's selective — you extract what benefits from extraction and leave the rest alone. The decision framework from Chapter 32 (portfolio assessment, three-axis scoring) gives you the strategic view. This section gives you the tactical criteria for choosing which services to extract first.
33.3.1 The Extraction Scorecard
Score each candidate service on five dimensions:
| Dimension | Score Range | What You're Measuring |
|---|---|---|
| Business Value of Extraction | 1-5 | How much does the business gain from having this as a modern service? (Mobile access, faster change velocity, new capabilities) |
| Technical Complexity | 1-5 (lower is better) | How many dependencies, edge cases, and undocumented rules does this service have? |
| Data Coupling | 1-5 (lower is better) | How tightly is this service's data coupled to other services? Does it share DB2 tables with 15 other programs? |
| Change Frequency | 1-5 | How often does the business need to change this service? High-change services benefit most from modern development practices. |
| Risk Tolerance | 1-5 | How much tolerance does the business have for errors in this service? (A display-only service has high tolerance; a payment service has near-zero tolerance.) |
Calculate the extraction priority score:
Priority = (Business Value × 3) + (Change Frequency × 2) + Risk Tolerance
─────────────────────────────────────────────────────────────────
Technical Complexity + Data Coupling
Higher scores indicate better extraction candidates. The weighting reflects the practical reality: you want high business value, high change frequency, and high risk tolerance (meaning you can afford some errors during transition), divided by the difficulty factors.
33.3.2 SecureFirst's Extraction Scorecard
Yuki and Carlos scored SecureFirst's CICS services using this framework:
| Service | Biz Value | Tech Complex | Data Coupling | Change Freq | Risk Toler. | Priority |
|---|---|---|---|---|---|---|
| Balance Inquiry | 5 | 2 | 2 | 3 | 4 | 7.25 |
| Transaction History | 5 | 2 | 3 | 3 | 4 | 7.00 |
| Account Summary | 4 | 2 | 3 | 2 | 4 | 4.80 |
| Fund Transfer | 5 | 5 | 5 | 3 | 1 | 2.30 |
| Bill Payment | 4 | 4 | 4 | 2 | 1 | 2.13 |
| Loan Origination | 3 | 5 | 4 | 1 | 1 | 1.33 |
| Wire Transfer | 3 | 5 | 5 | 1 | 1 | 1.20 |
The results matched their intuition but now had numbers behind them. Balance Inquiry scored highest because it's read-only (high risk tolerance — a wrong balance display is bad but doesn't lose money), has relatively simple logic (account lookup, hold calculation, available balance computation), low data coupling (reads from the account master table and the hold table, doesn't write anything), and delivers immediate business value (the mobile app's most-used function).
Fund Transfer scored lowest despite high business value, because the technical complexity and data coupling are extreme (it touches account masters, transaction logs, general ledger, hold tables, and regulatory tables, all within a single UOW), and the risk tolerance is effectively zero — a wrong transfer loses real money or violates regulations.
⚠️ Common Pitfall: Teams often pick Fund Transfer first because it has the highest business value. This is the wrong metric. The strangler fig pattern is about building confidence incrementally. Your first extraction must succeed — it sets the pattern, builds organizational confidence, and trains the team. Pick the service with the highest priority score, not the highest business value. You'll get to Fund Transfer eventually, but by then you'll have extracted five simpler services and learned from each one.
33.3.3 Finding the Seams
Michael Feathers (author of Working Effectively with Legacy Code) introduced the concept of a "seam" — a place in the code where you can alter behavior without editing the code itself. In a COBOL/CICS system, the natural seams are:
-
CICS transaction boundaries. Each CICS transaction (EXEC CICS LINK, EXEC CICS XCTL) is a seam. The TOR routes to a transaction; you can route that same transaction to a modern service instead.
-
COMMAREA / Channel boundaries. The data passed between CICS programs via COMMAREA or channels/containers defines the service contract. If you can replicate the COMMAREA contract in a REST API, you have a clean extraction point.
-
Copybook boundaries. Shared copybooks define data structures. If a service has its own copybook that isn't shared with unrelated services, that's a clean data boundary — a seam.
-
DB2 view boundaries. If the service reads data through a DB2 view rather than directly from base tables, the view is a seam — you can change what's behind the view without changing the service.
-
MQ queue boundaries. Services that communicate via MQ queues have explicit message contracts. The queue is a natural seam — you can put a different service on the receiving end of the queue.
At SecureFirst, the balance-inquiry extraction succeeded because it had clean seams on three dimensions: a dedicated CICS transaction (BALINQ), a well-defined COMMAREA (account number in, balance data out), and a focused DB2 access pattern (SELECT from ACCOUNT_MASTER and ACCOUNT_HOLDS). No IMS, no MQ, no shared copybooks with unrelated services.
💡 Key Insight: If you can't find clean seams, you may need to create them before extraction. This is the "prepare the patient for surgery" phase. Refactor the COBOL first — separate the service's logic into its own paragraph or subprogram, isolate its DB2 access through a data-access copybook, and ensure the COMMAREA is self-contained. This pre-extraction refactoring is valid modernization work that pays for itself even if you never extract the service.
33.4 The Facade Layer: Implementation
Let's build it. This section walks through the facade layer implementation using SecureFirst's balance-inquiry extraction as the reference. We'll cover the CICS web service wrapper, the API gateway configuration, and the routing rules.
33.4.1 The Legacy Side: CICS Web Service Wrapper
Before the strangler fig, SecureFirst's balance inquiry was a CICS transaction (BALINQ) invoked via a 3270 terminal or an internal CICS LINK. To make it accessible through the facade, Yuki's team wrapped it as a CICS web service using the CICS web services pipeline (covered in detail in Chapter 14).
The wrapper program — BALWSSRV — receives an HTTP JSON request, transforms it to the existing COMMAREA layout, LINKs to the original BALINQ program, transforms the COMMAREA response back to JSON, and returns it. The wrapper adds no business logic — it's a pure translation layer.
IDENTIFICATION DIVISION.
PROGRAM-ID. BALWSSRV.
*================================================================
* BALANCE INQUIRY WEB SERVICE WRAPPER
* Translates JSON REST request to COMMAREA for BALINQ program.
* Part of the strangler fig facade layer.
*
* This program is a TRANSLATION layer only.
* NO business logic belongs here.
* NO data access belongs here.
* If you're tempted to add a DB2 query, stop and refactor
* the underlying BALINQ program instead.
*================================================================
DATA DIVISION.
WORKING-STORAGE SECTION.
01 WS-RESP PIC S9(8) COMP VALUE 0.
01 WS-RESP2 PIC S9(8) COMP VALUE 0.
01 WS-JSON-REQUEST.
05 WS-JSON-ACCT-NUM PIC X(12).
05 WS-JSON-REQ-TYPE PIC X(8).
05 WS-JSON-CORRELATION-ID PIC X(36).
01 WS-JSON-RESPONSE.
05 WS-RESP-ACCT-NUM PIC X(12).
05 WS-RESP-AVAIL-BAL PIC S9(13)V99 COMP-3.
05 WS-RESP-LEDGER-BAL PIC S9(13)V99 COMP-3.
05 WS-RESP-HOLD-AMT PIC S9(13)V99 COMP-3.
05 WS-RESP-CURRENCY PIC X(3).
05 WS-RESP-AS-OF-TS PIC X(26).
05 WS-RESP-STATUS PIC X(2).
88 RESP-OK VALUE '00'.
88 RESP-ACCT-NOT-FOUND VALUE '04'.
88 RESP-ACCT-CLOSED VALUE '08'.
88 RESP-SYSTEM-ERROR VALUE '99'.
* COMMAREA layout for the existing BALINQ program
COPY BALCOMM.
01 WS-CONTAINER-NAME PIC X(16)
VALUE 'BALINQ-JSON-REQ'.
01 WS-CHANNEL-NAME PIC X(16)
VALUE 'BALINQ-CHANNEL'.
PROCEDURE DIVISION.
MAIN-LOGIC.
PERFORM 1000-RECEIVE-REQUEST
PERFORM 2000-MAP-TO-COMMAREA
PERFORM 3000-LINK-TO-BALINQ
PERFORM 4000-MAP-TO-RESPONSE
PERFORM 5000-SEND-RESPONSE
EXEC CICS RETURN END-EXEC
.
1000-RECEIVE-REQUEST.
* Receive JSON from the CICS web service pipeline.
* The pipeline's DFHJS2LS transformation has already
* converted JSON to the WS-JSON-REQUEST structure.
EXEC CICS GET CONTAINER(WS-CONTAINER-NAME)
CHANNEL(WS-CHANNEL-NAME)
INTO(WS-JSON-REQUEST)
RESP(WS-RESP)
RESP2(WS-RESP2)
END-EXEC
IF WS-RESP NOT = DFHRESP(NORMAL)
MOVE '99' TO WS-RESP-STATUS
PERFORM 5000-SEND-RESPONSE
EXEC CICS RETURN END-EXEC
END-IF
.
2000-MAP-TO-COMMAREA.
* Map the JSON request fields to the COMMAREA layout
* that BALINQ expects. This is the ONLY place where
* field mapping happens.
INITIALIZE BALINQ-COMMAREA
MOVE WS-JSON-ACCT-NUM
TO BAL-ACCT-NUMBER
MOVE 'INQ' TO BAL-REQUEST-TYPE
MOVE WS-JSON-CORRELATION-ID
TO BAL-CORRELATION-ID
.
3000-LINK-TO-BALINQ.
* LINK to the existing BALINQ program.
* The COMMAREA contract is the seam.
EXEC CICS LINK PROGRAM('BALINQ')
COMMAREA(BALINQ-COMMAREA)
LENGTH(LENGTH OF BALINQ-COMMAREA)
RESP(WS-RESP)
RESP2(WS-RESP2)
END-EXEC
IF WS-RESP NOT = DFHRESP(NORMAL)
MOVE '99' TO WS-RESP-STATUS
END-IF
.
4000-MAP-TO-RESPONSE.
* Map COMMAREA response back to JSON response structure.
MOVE BAL-ACCT-NUMBER TO WS-RESP-ACCT-NUM
MOVE BAL-AVAIL-BALANCE TO WS-RESP-AVAIL-BAL
MOVE BAL-LEDGER-BALANCE TO WS-RESP-LEDGER-BAL
MOVE BAL-HOLD-AMOUNT TO WS-RESP-HOLD-AMT
MOVE BAL-CURRENCY-CODE TO WS-RESP-CURRENCY
MOVE BAL-TIMESTAMP TO WS-RESP-AS-OF-TS
MOVE BAL-RETURN-CODE TO WS-RESP-STATUS
.
5000-SEND-RESPONSE.
* Place response in container for the pipeline to
* convert back to JSON.
EXEC CICS PUT CONTAINER(WS-CONTAINER-NAME)
CHANNEL(WS-CHANNEL-NAME)
FROM(WS-JSON-RESPONSE)
RESP(WS-RESP)
END-EXEC
.
The key design decisions in this wrapper:
- No business logic. The wrapper is pure plumbing. If someone wants to change how balance inquiry works, they change BALINQ, not BALWSSRV.
- Correlation ID propagation. The JSON request includes a correlation ID that flows through to the COMMAREA. This is essential for parallel running — you need to match legacy and modern responses for the same request.
- Standard error mapping. The wrapper maps CICS response codes to the JSON response's status field. The facade layer needs consistent error reporting from both legacy and modern services.
33.4.2 The API Gateway Configuration
With the CICS web service wrapper in place, the legacy balance-inquiry service is accessible via HTTP REST. Now we configure the API gateway to route between legacy and modern.
SecureFirst chose Kong as their API gateway (deployed on OpenShift alongside the mainframe). The routing configuration uses Kong's traffic-splitting plugin combined with feature toggles stored in a configuration database.
Here's the routing configuration for the balance-inquiry strangler fig:
# Kong API Gateway — Balance Inquiry Strangler Fig Routing
# SecureFirst Retail Bank
# Managed by: Yuki Nakamura / Carlos Vega
# Last updated: 2024-11-15
#
# ROUTING STRATEGY:
# Phase 1 (current): 100% legacy (CICS via z/OS Connect)
# Phase 2: Internal users → modern, external → legacy
# Phase 3: 10% → modern (canary), 90% → legacy
# Phase 4: 50/50 with comparison logging
# Phase 5: 100% modern, legacy on standby
# Phase 6: Legacy decommissioned
_format_version: "3.0"
services:
# Legacy service: CICS balance inquiry via z/OS Connect
- name: balance-inquiry-legacy
url: https://zosconnect.securefirst.internal:9443/zosConnect/apis/balanceinquiry/v1
protocol: https
connect_timeout: 5000
write_timeout: 10000
read_timeout: 15000
retries: 2
tags:
- strangler-fig
- legacy
- balance-inquiry
routes:
- name: balance-inquiry-legacy-route
paths:
- /api/v2/accounts/balance
methods:
- GET
headers:
X-Route-Override:
- legacy
strip_path: false
# Modern service: Kotlin microservice on OpenShift
- name: balance-inquiry-modern
url: http://balance-inquiry-svc.banking.svc.cluster.local:8080/api/v2/accounts/balance
protocol: http
connect_timeout: 3000
write_timeout: 5000
read_timeout: 10000
retries: 3
tags:
- strangler-fig
- modern
- balance-inquiry
routes:
- name: balance-inquiry-modern-route
paths:
- /api/v2/accounts/balance
methods:
- GET
headers:
X-Route-Override:
- modern
strip_path: false
# Traffic splitting plugin — controls the percentage split
plugins:
- name: canary
service: balance-inquiry-legacy
config:
# Percentage of traffic routed to the modern (upstream) service
# Change this value to control the strangler fig phase
percentage: 0
upstream_host: balance-inquiry-modern
upstream_uri: /api/v2/accounts/balance
upstream_port: 8080
# Hash-based routing ensures the same account always goes
# to the same backend during a session (prevents the
# confusion Carlos experienced with inconsistent balances)
hash: consumer
start: "2024-12-01T00:00:00Z"
duration: 2592000 # 30 days for gradual ramp-up
steps: 100
# Comparison logging — logs both legacy and modern responses
# for offline analysis during parallel running
- name: request-transformer
service: balance-inquiry-legacy
config:
add:
headers:
- "X-Correlation-ID:$(uuid)"
- "X-Strangler-Phase:parallel-run"
- "X-Timestamp:$(now)"
# Rate limiting — protect the legacy system from being
# overwhelmed during the transition
- name: rate-limiting
service: balance-inquiry-legacy
config:
minute: 6000
hour: 200000
policy: redis
redis_host: redis.securefirst.internal
redis_port: 6379
33.4.3 The Comparison Engine
During parallel running (Phase 4), both legacy and modern services handle real requests. The comparison engine captures both responses and compares them field by field. Discrepancies are logged, categorized, and flagged for investigation.
SecureFirst's comparison engine runs as a sidecar service alongside the API gateway. For each request:
- Route to the primary backend (whichever is handling production traffic)
- Asynchronously forward a copy to the secondary backend (shadow mode)
- Compare the two responses field by field
- Log the comparison result with the correlation ID
- Alert if the discrepancy rate exceeds a threshold
The comparison rules for balance inquiry:
FIELD: available_balance
MATCH: Exact to the penny (0.00 tolerance)
SEVERITY: Critical
ACTION: Alert immediately, halt canary ramp-up
FIELD: ledger_balance
MATCH: Exact to the penny (0.00 tolerance)
SEVERITY: Critical
ACTION: Alert immediately, halt canary ramp-up
FIELD: hold_amount
MATCH: Exact to the penny (0.00 tolerance)
SEVERITY: Critical
ACTION: Alert immediately, halt canary ramp-up
FIELD: currency_code
MATCH: Exact string match
SEVERITY: Critical
ACTION: Alert immediately
FIELD: as_of_timestamp
MATCH: Within 5 seconds (accounts for processing time difference)
SEVERITY: Warning
ACTION: Log for investigation if >1% of requests
FIELD: response_time_ms
MATCH: Modern must be within 150% of legacy
SEVERITY: Warning
ACTION: Log; investigate if modern is consistently slower
The critical insight here: for financial data, the comparison must be exact to the penny. Carlos learned this the hard way. A one-cent rounding difference in balance inquiry might seem trivial, but it means the systems disagree about the state of an account. If they disagree on balances, they'll disagree on overdraft calculations, interest accruals, and regulatory reports. A penny today is a regulatory finding tomorrow.
🔴 Production War Story: During SecureFirst's parallel running, the comparison engine flagged a 0.01 discrepancy on 3.2% of accounts. Root cause: the modern Kotlin service used IEEE 754 double-precision floating-point for balance calculations, while the COBOL program used COMP-3 (packed decimal). Floating-point representation of $1234.56 is 1234.5600000000000909... — the penny appears when enough decimal operations accumulate. The fix: the Kotlin service was rewritten to use BigDecimal for all monetary calculations. This is a known problem, documented in every COBOL migration guide, and the team still made the mistake. Use packed decimal equivalents. Always.
33.5 Data Synchronization During Migration
This is the section that matters most. Architecture diagrams are neat and routing configurations are straightforward, but data synchronization during the coexistence period is where strangler fig implementations succeed or fail. Every production incident I've seen in strangler fig migrations traces back to data synchronization.
The fundamental problem: during the coexistence period, you have two systems that both need current, consistent data. The legacy CICS system is still processing some transactions against DB2 on z/OS. The modern service is processing other transactions against its own data store (PostgreSQL, MongoDB, whatever). If either system has stale data, customers see wrong numbers, and wrong numbers in banking means regulators, lawyers, and headlines.
33.5.1 Data Synchronization Patterns
There are four patterns for keeping data synchronized during the strangler fig transition. Each has severe tradeoffs.
Pattern 1: Legacy as Source of Truth (Recommended for Phase 1-3)
The legacy DB2 database is the single source of truth. The modern service reads from a replica that's kept in sync via Change Data Capture (CDC). The modern service does not write to its own database — all writes go through the legacy system.
┌─────────────────────┐
│ Modern Service │
│ (read-only) │
│ │
│ Reads from ──────►│── PostgreSQL
│ replica │ (read replica)
└─────────────────────┘
▲
│ CDC Stream
│ (near real-time)
┌─────────────────────┐ │
│ Legacy CICS │ │
│ (read + write) │ │
│ │ │
│ DB2 Master ───────┼────┘
│ │
└─────────────────────┘
- Pros: Simple. One source of truth. No conflict resolution needed. Consistent. If the CDC stream lags, the modern service shows slightly stale data, but it's never wrong — just behind.
- Cons: The modern service can't do writes. Limits extraction to read-only services (balance inquiry, transaction history, account summary). Not viable for fund transfer, payments, or any service that modifies data.
- CDC Tools: IBM InfoSphere CDC, Debezium with the Db2 connector, IBM Data Replication (formerly IIDR).
Pattern 2: Dual-Write with Legacy Priority
Both the legacy and modern services can write. Writes go to the legacy system first, then are replicated to the modern system's data store. If there's a conflict, the legacy system wins.
┌─────────────────────┐
│ Modern Service │
│ (read + write) │
│ │
│ Writes to ───────►│── PostgreSQL
│ own DB │
└────────┬────────────┘
│ Write also sent
│ to legacy (sync)
▼
┌─────────────────────┐
│ Legacy CICS │
│ (read + write) │
│ DB2 Master ───────┼──► CDC to PostgreSQL
└─────────────────────┘
- Pros: Modern service can handle writes. Path to full extraction.
- Cons: Complex. Two databases to keep in sync. Dual-write creates a distributed transaction problem — if the legacy write succeeds but the PostgreSQL write fails (or vice versa), you have an inconsistency. Requires saga pattern or compensating transactions.
- Danger: This is where most teams get into trouble. Dual-write is not a database pattern; it's a distributed systems problem. Treat it with the respect you'd give a distributed transaction.
⚠️ Common Pitfall: "We'll just write to both databases in the same transaction." No, you won't. DB2 on z/OS and PostgreSQL on Linux cannot participate in the same two-phase commit without a distributed transaction coordinator (like CICS UOW with XA). And even if they could, the performance penalty of a cross-platform two-phase commit would make your 200ms SLA impossible. Use asynchronous replication with compensating transactions, not distributed two-phase commit.
Pattern 3: Event Sourcing
Instead of synchronizing databases, both systems consume and produce events. Every state change is an event (AccountCredited, AccountDebited, HoldPlaced, HoldReleased). Both systems derive their current state from the event log.
- Pros: Elegant. No direct database coupling. Natural audit trail. Both systems can independently maintain their own materialized view of the data.
- Cons: Requires a complete rearchitecture of the data model. Existing COBOL programs don't produce events — they do SQL UPDATEs. Retrofitting event sourcing onto a 40-year-old COBOL system is a multi-year project in itself.
- Verdict: Theoretically ideal, practically impossible for most mainframe brownfield environments. Consider this for greenfield services that are being built alongside the legacy system, not for extracting existing services.
Pattern 4: Shared Database
Both the legacy and modern services access the same DB2 database on z/OS. No replication, no synchronization — they share the data.
- Pros: Simple. No synchronization issues. Data is always consistent because there's only one copy.
- Cons: The modern service must be able to connect to DB2 on z/OS, which means DRDA or z/OS Connect data services. Performance is limited by the network hop between the modern service and z/OS. The modern service is tightly coupled to the DB2 schema, which prevents you from evolving the data model. And you're still paying for DB2 on z/OS, which means you haven't reduced mainframe costs.
- Verdict: Viable as a transitional pattern (Phase 2-3) but not a long-term architecture. The modern service needs its own data store eventually. Use this pattern to get the modern service running, then migrate to Pattern 1 or 2.
33.5.2 SecureFirst's Data Synchronization Strategy
SecureFirst chose a phased approach that moved through the patterns over twelve months:
Months 1-3 (Balance Inquiry extraction): Pattern 1 — Legacy as Source of Truth. The modern Kotlin balance-inquiry service reads from a PostgreSQL replica kept in sync via IBM InfoSphere CDC. Latency target: <5 seconds. Achieved: 2-3 seconds average, 47-second spike during batch window (the incident that started this chapter).
Resolution of the 47-second spike: Yuki's team added a "freshness indicator" to the API response. If the CDC lag exceeds 10 seconds, the response includes a header (X-Data-Freshness: delayed) and the mobile app displays "Balance as of [timestamp]" instead of implying the balance is real-time. This is honest engineering — don't pretend data is fresh when it isn't.
Months 4-8 (Transaction History extraction): Pattern 1 continues. Transaction history is also read-only. The same CDC pipeline feeds both balance data and transaction data to PostgreSQL.
Months 9-12 (Fund Transfer extraction — planned): Pattern 2 — Dual-Write with Legacy Priority. When the modern service handles a fund transfer, it writes to PostgreSQL first (for its own use), then sends the transfer instruction to the legacy CICS system via an MQ message. CICS processes the transfer against DB2 (the source of truth), and the result flows back to the modern service via CDC. If the CICS processing fails, the modern service initiates a compensating transaction to reverse its PostgreSQL write.
💡 Key Insight: Notice the progression. You don't start with the hardest synchronization pattern. You start with Pattern 1 (read-only, CDC), prove that the CDC pipeline is reliable, learn its failure modes, understand its latency characteristics, and then move to Pattern 2 (dual-write). Each phase builds on the infrastructure and the team's experience from the previous phase. This is the strangler fig philosophy applied to the data layer — incremental, reversible, confidence-building.
33.5.3 Change Data Capture in Practice
CDC for DB2 on z/OS is well-established technology. IBM InfoSphere Data Replication (IIDR) reads the DB2 recovery log (BSDS) and streams row-level changes to a target. Debezium, the open-source CDC framework, has a Db2 connector that works for Db2 on distributed platforms, but for Db2 on z/OS, you typically need IIDR or IBM Data Gate.
Key operational considerations for CDC in a strangler fig context:
Latency. CDC from DB2 z/OS to a cloud-hosted PostgreSQL has three latency components: log read latency (how quickly the CDC tool reads from the DB2 log), network transfer latency (z/OS to cloud), and apply latency (how quickly the target database applies the changes). Under normal load, total latency is 2-5 seconds. During peak batch processing, when the DB2 log is generating gigabytes of changes per hour, latency can spike to 30-60 seconds.
Schema evolution. When the DB2 table schema changes (ALTER TABLE ADD COLUMN), the CDC pipeline must handle the schema change without data loss and without stopping replication. This requires coordination between the DBA, the CDC team, and the modern service team. Add schema change coordination to your strangler fig runbook.
Initial load. Before CDC can stream changes, the target database needs a full copy of the source data. For a large DB2 table (billions of rows), this initial load can take hours or days. Plan for this. Run the initial load during a maintenance window, and start CDC replication from a known log position after the load completes.
Monitoring. CDC pipelines fail silently. The DB2 log wraps, a network hiccup drops a batch of changes, a schema mismatch causes the apply to stop — and nobody notices until a customer complains about stale data. Monitor three metrics:
- Replication lag (seconds behind source) — alert if >30 seconds
- Apply throughput (rows/second) — alert if drops to zero
- Error count — alert on any non-zero value
33.6 Testing the Transition
You've built the facade. You've configured the routing. You've set up CDC. Now comes the most important phase of the strangler fig: proving that the new service is equivalent to the old one. In banking, "equivalent" means "produces identical results for every possible input, including edge cases that haven't occurred in three years but could occur tomorrow."
33.6.1 Shadow Mode Testing
Shadow mode (also called "dark launching") is the first testing phase. The facade routes all production traffic to the legacy system, but also sends a copy of each request to the modern service. The modern service processes the request and returns a response, but the response is discarded — only the legacy response goes back to the consumer. The comparison engine logs both responses for offline analysis.
Shadow mode answers one question: does the modern service produce the same result as the legacy service for real production traffic?
Duration: 2-4 weeks minimum. SecureFirst ran shadow mode for three weeks on balance inquiry and discovered:
- The COMP-3 vs. floating-point rounding issue (3.2% of accounts, described in Section 33.4.3)
- A timezone discrepancy where the modern service returned timestamps in UTC but the legacy service returned them in ET (Eastern Time). Both were "correct" by their own conventions, but consumers expected ET.
- Seven accounts with data that triggered an edge case in the legacy COBOL program (accounts with a specific combination of hold types) that the modern service hadn't implemented. These seven accounts represented 0.0003% of the customer base and used an obscure hold type (judicial garnishment hold with partial release) that the extraction team hadn't encountered in their analysis.
That last one is why shadow mode testing with production traffic is non-negotiable. You cannot discover every edge case through unit tests, integration tests, or even months of QA testing with synthetic data. Only real production traffic — with its full diversity of account states, hold types, fee structures, and regulatory flags — reveals the edge cases that matter.
⚠️ Common Pitfall: "We'll run shadow mode for a few days and if everything matches, we'll go to canary." A few days isn't enough. You need to cover at least one full monthly cycle (to catch month-end processing effects), one full statement period, and ideally one quarter-end (to catch quarterly regulatory calculations). Three weeks is the minimum; a full month is better.
33.6.2 Parallel Running
Once shadow mode testing achieves a target match rate (SecureFirst's threshold: 99.99% exact match on financial fields, 99.9% on non-financial fields for 14 consecutive days), you move to parallel running.
In parallel running, both services handle real production traffic. The facade routes each request to one service (the "primary") and shadows it to the other. The primary's response goes to the consumer. The comparison engine continues logging discrepancies.
The difference from shadow mode: in parallel running, you gradually shift which service is primary. Week one: legacy is primary for 100% of requests. Week two: modern is primary for 10%. Week three: 25%. Week four: 50%. Each increase happens only if the discrepancy rate stays below threshold.
Parallel Running Dashboard Metrics:
╔══════════════════════════════════════════════════════════╗
║ STRANGLER FIG DASHBOARD — Balance Inquiry ║
║ Phase: Parallel Running — Week 3 (25% modern primary) ║
╠══════════════════════════════════════════════════════════╣
║ ║
║ Traffic Split: ║
║ Legacy primary: 75.0% (148,293 requests/hour) ║
║ Modern primary: 25.0% ( 49,431 requests/hour) ║
║ ║
║ Match Rate (last 24h): ║
║ Financial fields: 99.997% (6 mismatches / 1.97M) ║
║ Non-financial: 99.94% (118 mismatches / 1.97M) ║
║ ║
║ Response Time (p95): ║
║ Legacy: 42ms ║
║ Modern: 28ms ║
║ ║
║ Error Rate: ║
║ Legacy: 0.002% ║
║ Modern: 0.003% ║
║ ║
║ CDC Lag: 3.1 seconds (target: <30s) ║
║ ║
║ Ramp-Up Decision: ║
║ ✅ Financial match rate > 99.99% ║
║ ⚠️ Non-financial match rate < 99.95% (investigate) ║
║ ✅ Response time within 150% ║
║ ✅ Error rate within tolerance ║
║ ✅ CDC lag within tolerance ║
║ ║
║ RECOMMENDATION: Hold at 25% — investigate ║
║ non-financial mismatches before ramp-up ║
╚══════════════════════════════════════════════════════════╝
The six financial field mismatches in the dashboard above? Three were caused by a race condition where a deposit posted between the legacy and modern queries (a timing issue, not a logic issue — expected and acceptable). Two were caused by a CDC lag spike during batch processing. One was a genuine bug in the modern service's handling of accounts with negative available balances (overdrawn accounts with pending deposits). That last one was fixed, and the ramp-up continued after a 48-hour observation period.
33.6.3 Canary Deployment
Canary deployment is a refinement of parallel running where you expose the modern service to a small, controlled subset of real users. Unlike shadow mode (where the modern service's response is discarded), canary users actually see the modern service's response.
SecureFirst's canary strategy for balance inquiry:
| Canary Ring | Users | Duration | Rollback Trigger |
|---|---|---|---|
| Ring 0 | SecureFirst employees (internal) | 2 weeks | Any financial mismatch |
| Ring 1 | 5% of external users (lowest-balance accounts) | 1 week | >0.01% financial mismatch rate |
| Ring 2 | 25% of external users | 2 weeks | >0.001% financial mismatch rate |
| Ring 3 | 50% of external users | 2 weeks | >0.001% financial mismatch rate |
| Ring 4 | 100% of external users | Indefinite | >0.0001% financial mismatch rate |
Why lowest-balance accounts for Ring 1? Because if there's a display error, the dollar impact is smallest. Showing a customer with $47 in their account a balance of $46.99 is a problem, but showing a customer with $470,000 a balance of $469,999.99 is a much bigger problem — and a much bigger regulatory exposure.
💡 Key Insight: Canary rings are not just about percentage of traffic — they're about controlling blast radius. Route canary traffic to accounts where the impact of an error is lowest. This means: smallest balances, fewest pending transactions, simplest account structures. Save the complex accounts (trusts, joint accounts with multiple signers, business accounts with sub-accounts) for Ring 3 or later, after you've built confidence with simple accounts.
33.6.4 Rollback Strategy
Every phase of the strangler fig must have a rollback plan that can be executed in under 60 seconds. The facade is the rollback mechanism — changing the routing configuration instantly redirects traffic to the legacy system.
SecureFirst's rollback procedure:
STRANGLER FIG ROLLBACK PROCEDURE — Balance Inquiry
═══════════════════════════════════════════════════
TRIGGER CONDITIONS (any one triggers rollback):
- Financial field mismatch rate > threshold for current phase
- Modern service error rate > 0.1%
- Modern service p99 response time > 500ms
- CDC replication lag > 60 seconds for > 5 minutes
- Manual trigger by on-call engineer
ROLLBACK STEPS:
1. Execute: kong config set canary.percentage=0
Effect: All traffic immediately routes to legacy
Time: < 5 seconds
Verify: kong config get canary.percentage returns 0
2. Verify legacy service health:
- Check CICS BALINQ transaction rate (should increase to 100%)
- Check DB2 thread utilization (should handle full load)
- Check response times (should be normal)
3. Notify:
- Page on-call engineer (if not already paged)
- Email strangler-fig-team distribution list
- Update #strangler-fig Slack channel
4. DO NOT:
- Shut down the modern service (leave it running for diagnostics)
- Reset the CDC pipeline (it's still needed for other services)
- Panic (the legacy system has been handling this traffic for 15 years)
POST-ROLLBACK:
- Conduct root cause analysis within 24 hours
- Document the failure in the strangler fig journal
- Fix the issue and re-enter shadow mode for the affected service
- Resume canary ramp-up only after 7 consecutive days of clean shadow
33.7 When to Stop Strangling
This is the question nobody talks about, and it's arguably more important than how to start. The strangler fig pattern has an implicit assumption: eventually, you'll extract everything and the legacy system will be decommissioned. But that assumption is wrong for most mainframe systems.
33.7.1 The Asymptotic Problem
In practice, strangler fig migrations follow an asymptotic curve. The first 20% of services (by transaction volume) are extracted in the first 30% of the timeline. The next 30% take another 40% of the timeline. The last 50% — the complex, tightly coupled, mission-critical services — would take the remaining 80% of the timeline, except that the timeline is now over budget and over schedule, and the business has stopped seeing ROI from the migration.
This is what Diane Okoye at Pinnacle Health calls "the last mile problem." At Pinnacle, the strangler fig successfully extracted eligibility checking, provider lookup, and claims status inquiry to modern microservices in eighteen months. But claims adjudication — the core engine, 340,000 lines of COBOL with 40 years of health insurance business rules — resisted extraction. Two years of analysis produced a 200-page specification document and the conclusion that rewriting the adjudication engine would cost $45 million and take three years, with no guarantee of regulatory compliance.
Diane's recommendation: "Stop strangling. The adjudication engine stays on CICS. It works. It's fast. It's compliant. We've extracted everything around it that benefits from extraction. The strangler fig's job is done — and the fig has a hollow center where the tree used to be, and that's fine."
33.7.2 Completion Criteria
Design your strangler fig plan with explicit completion criteria. Don't define success as "everything is off the mainframe." Define success as "the business outcomes that motivated the modernization have been achieved."
SecureFirst's completion criteria:
| Business Outcome | Metric | Target | Status |
|---|---|---|---|
| Mobile banking parity | Features available on mobile vs. branch | 90% | In progress |
| API response time | p95 latency for mobile APIs | <200ms | Achieved for balance, history |
| Development velocity | Time to deliver a new feature | <2 weeks (down from 3 months) | Achieved for extracted services |
| Operational cost | Annual mainframe MIPS cost reduction | 30% | 18% achieved |
| Staff flexibility | Developers who can work on both mainframe and modern | 60% of team | 45% achieved |
Notice: "decommission the mainframe" is not on this list. The mainframe stays. Fund transfer, wire transfer, and loan origination stay on CICS/DB2 — they're working, they're fast, they're compliant, and extracting them would cost more than the savings would justify.
33.7.3 The Hybrid Steady State
The endpoint of most mainframe strangler fig implementations is not "no mainframe." It's a hybrid architecture where:
- The mainframe handles high-throughput OLTP (fund transfer, payment processing, ledger updates), complex batch processing (nightly posting, regulatory reporting), and services where z/OS provides unique value (Parallel Sysplex availability, WLM, RACF security).
- Modern services handle consumer-facing APIs (mobile, web), analytics and reporting (data warehouse, ML models), services with high change frequency (promotions, notifications, personalization), and new capabilities that don't exist on the mainframe.
- The facade/gateway mediates between them, providing a single API surface to consumers regardless of which backend serves the request.
This is the "hybrid architecture" that Chapter 37 will elaborate on. The strangler fig doesn't kill the tree — it creates a symbiotic architecture where each platform handles what it does best.
💡 Key Insight: Know when to stop. A strangler fig migration that extracts 60% of services and achieves 90% of the business value is a success. A strangler fig migration that tries to extract 100% and runs over budget by 200% is a failure — even if it eventually completes. The goal is business value, not architectural purity.
33.7.4 The Decommissioning Decision
For each legacy COBOL module, the decommission decision follows this flowchart:
Is all traffic routed to the modern service?
├── No → Keep the module active
└── Yes → Has parallel running shown 99.99%+ match for 30+ days?
├── No → Continue parallel running
└── Yes → Is any other module dependent on this one?
├── Yes → Can the dependent module be modified to use the modern service?
│ ├── No → Keep the module active as an internal dependency
│ └── Yes → Modify the dependent, then reassess
└── No → Move to standby
└── Has the module been in standby for 90+ days with no issues?
├── No → Continue standby
└── Yes → DECOMMISSION
├── Remove from CICS CSD
├── Archive source code (do NOT delete)
├── Update documentation
└── Update RACF profiles
The 90-day standby period is non-negotiable. Sandra Chen at FBA learned this the hard way when a decommissioned module turned out to be called by a quarterly regulatory reporting batch job. The module had been "unused" for 88 days — two days short of the quarter-end batch run that needed it.
⚠️ Common Pitfall: "The module has been in standby for a month with no calls. Let's decommission." One month is not enough. You need to cover all periodic cycles: monthly close, quarterly reporting, annual regulatory submissions, leap year calculations, and fiscal year-end processing. The minimum standby period should be the longest periodic cycle plus a buffer — typically 90-180 days.
33.8 Applying the Strangler Fig to the HA Banking System
Let's apply everything from this chapter to your progressive project. In Chapter 32, you assessed the HA Banking Transaction Processing System and identified which components to modernize. Now you'll design the strangler fig plan for the first extraction: balance inquiry.
33.8.1 Extraction Scorecard for the HA Banking System
Using the extraction scorecard from Section 33.3, score each component of your HA banking system:
| Component | Biz Value | Tech Complex | Data Coupling | Change Freq | Risk Toler. | Priority |
|---|---|---|---|---|---|---|
| Balance Inquiry | 5 | 2 | 2 | 4 | 4 | 8.00 |
| Transaction History | 5 | 2 | 3 | 3 | 4 | 7.00 |
| Account Summary | 4 | 2 | 2 | 2 | 4 | 5.50 |
| Statement Generation | 3 | 3 | 3 | 1 | 3 | 2.33 |
| Fund Transfer | 5 | 5 | 5 | 3 | 1 | 2.30 |
| ACH Processing | 4 | 5 | 5 | 2 | 1 | 2.00 |
| Interest Calculation | 3 | 4 | 4 | 1 | 1 | 1.50 |
| Regulatory Reporting | 2 | 5 | 5 | 1 | 1 | 0.90 |
The extraction order is clear: Balance Inquiry first, then Transaction History, then Account Summary. Fund Transfer and everything below it stays on CICS for now — the complexity and risk aren't justified by the business value of extraction.
33.8.2 The Balance Inquiry Strangler Fig Plan
Here's the specific plan for extracting balance inquiry from your HA banking system. This is what you'll implement in the project checkpoint.
Phase 0: Prepare (Weeks 1-4)
- Wrap the existing BALINQ CICS transaction as a REST web service using z/OS Connect (you built this in Chapter 21's project checkpoint)
- Deploy an API gateway (Kong on OpenShift or AWS API Gateway)
- Set up CDC from DB2 z/OS to PostgreSQL on the target platform
- Define the API contract (OpenAPI 3.0 specification) for balance inquiry
- Establish the comparison engine and monitoring dashboard
Phase 1: Build the Modern Service (Weeks 5-10)
- Implement the balance-inquiry microservice in the target language/platform
- Use BigDecimal (or equivalent packed decimal) for all monetary calculations
- Implement the same COMMAREA-equivalent contract as the CICS service
- Unit test against the BALINQ copybook's field specifications
- Integration test against the PostgreSQL replica
Phase 2: Shadow Mode (Weeks 11-14)
- Deploy the modern service alongside the legacy
- Configure the facade to shadow all balance-inquiry traffic to the modern service
- Run the comparison engine, targeting 99.99% financial field match
- Investigate and fix all discrepancies
- Minimum duration: cover one full month-end cycle
Phase 3: Canary Deployment (Weeks 15-20)
- Ring 0: Internal users, 2 weeks
- Ring 1: 5% of external users (lowest balances), 1 week
- Ring 2: 25%, 2 weeks
- Ring 3: 50%, 2 weeks (with comparison logging)
- Rollback triggers defined and tested
Phase 4: Full Migration (Weeks 21-24)
- 100% of traffic to modern service
- Legacy CICS BALINQ on standby (still receiving shadow traffic for comparison)
- CDC pipeline continues (other services still use DB2 master)
Phase 5: Decommission (Week 24 + 90 days)
- After 90 days of standby with no issues, decommission the CICS BALINQ program
- Archive source code and COMMAREA copybook
- Update CSD, RACF profiles, and documentation
- Remove the z/OS Connect service definition for balance inquiry
- Celebrate — but quietly, because fund transfer is next
33.8.3 Risk Register
Every strangler fig plan needs a risk register. Here's the one for the HA banking system balance-inquiry extraction:
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| CDC lag spike during batch window causes stale balances in modern service | High | Medium | Implement freshness indicator; alert on lag >30s; route to legacy during batch window if lag exceeds SLA |
| COMP-3 vs. floating-point rounding differences | High | High | Mandate BigDecimal/packed-decimal equivalents in modern service; validate in shadow mode before canary |
| Edge case in balance calculation not discovered until canary | Medium | High | Extend shadow mode to cover month-end, quarter-end; compare 100% of requests, not a sample |
| API gateway becomes single point of failure | Low | Critical | Deploy gateway in HA configuration (multiple instances, health checks, auto-failover); define bypass route directly to z/OS Connect |
| Team lacks experience with both mainframe and modern | Medium | Medium | Pair Kwame (mainframe) with Carlos-equivalent (modern) on every extraction; knowledge transfer is a project deliverable, not a side effect |
| Regulatory concern about data residency/sovereignty during CDC | Low | High | Confirm with compliance that CDC replication complies with data residency requirements; document the data flow for auditors |
33.9 Real-World Patterns and Anti-Patterns
Twenty-five years of watching modernization projects has given me a fairly reliable list of what works and what doesn't. Let me share the patterns that have saved projects and the anti-patterns that have killed them.
Patterns That Work
Pattern: Start with the API, not the service.
Before extracting a single service, expose the entire legacy system through an API facade (Chapter 21). This gives consumers a modern interface immediately, decouples them from the mainframe's internal structure, and establishes the contract layer that the strangler fig's routing engine will use. Many organizations discover that exposing COBOL through APIs is 80% of the business value they were looking for, and the urgency to extract individual services decreases.
Pattern: Extract read services first, write services later.
Read-only services (balance inquiry, transaction history, account summary) have dramatically simpler data synchronization requirements than write services (fund transfer, payment processing). Extract all the read services first. By the time you get to write services, your CDC pipeline is battle-tested, your team has learned the extraction lifecycle, and your comparison engine is proven.
Pattern: Keep the legacy system funded and maintained.
The worst thing you can do during a strangler fig migration is treat the legacy system as "dead." It's not dead — it's the safety net. Keep maintaining it. Keep patching it. Keep the COBOL developers on staff. The moment the legacy system becomes a neglected afterthought, you lose your rollback capability, and the strangler fig becomes a high-wire act without a net.
Pattern: Celebrate each extraction as a milestone.
Strangler fig migrations take years. If the only milestone is "done," the team will lose morale two years before they get there. Celebrate each service extraction as a completed project. SecureFirst threw a team dinner when balance inquiry went to 100% modern. Yuki bought Carlos a bonsai tree with a note: "One service down. It's growing."
Anti-Patterns That Kill
Anti-pattern: Extracting multiple services simultaneously.
"We'll save time by extracting balance inquiry, transaction history, and account summary in parallel." No, you won't. You'll triple your complexity, triple your data synchronization challenges, and triple your debugging surface area when something goes wrong. Extract one service at a time. The second extraction will go twice as fast as the first because you've learned the process.
Anti-pattern: Skipping shadow mode.
"Our unit tests have 95% coverage, so we'll go straight to canary." Unit tests cover the cases you thought of. Shadow mode covers the cases you didn't. The seven accounts with judicial garnishment holds at SecureFirst would never have been caught by unit tests — no one thought to write a test for that edge case because no one knew it existed.
Anti-pattern: Building the facade as a monolith.
If your API gateway is a hand-built Java application with custom routing logic, business rule validation, data transformation, and monitoring — congratulations, you've built a new legacy system. The facade should be as thin as possible, using commodity infrastructure (Kong, Apigee, z/OS Connect) that your team doesn't have to maintain.
Anti-pattern: Defining success as "mainframe off."
This one kills more projects than all the technical anti-patterns combined. If your definition of success requires the mainframe to be decommissioned, you will either fail (because the last 30% of services resist extraction) or succeed at catastrophic cost. Define success in business terms: API response times, development velocity, operational cost, staff flexibility. Let the architecture be whatever achieves those outcomes.
🔗 Forward Reference: Chapter 37 (Hybrid Architecture) will bring together the strangler fig's endpoint, the cloud integration patterns from Chapter 34, and the DevOps pipeline from Chapter 36 into a unified architectural vision. The strangler fig is one execution pattern; Chapter 37 shows how all the patterns compose into a sustainable long-term architecture.
Chapter Summary
The strangler fig pattern is the most important execution pattern for mainframe modernization. It transforms a single, catastrophic bet (big-bang migration) into a series of small, reversible experiments (incremental service extraction). The pattern has four components:
- The facade — a stateless, observable, reversible routing layer that sits in front of both legacy and modern services
- The routing engine — the decision-making component that controls which service handles each request
- The extraction pipeline — the lifecycle from identification through shadow mode, parallel running, canary deployment, and decommission
- The data synchronization layer — the hardest part, using CDC, dual-write, or shared database patterns to keep legacy and modern data consistent
The key lessons from SecureFirst's implementation:
- Start with read-only services — they have simpler data synchronization and higher risk tolerance
- Use the extraction scorecard to choose candidates based on priority (business value and change frequency divided by technical complexity and data coupling), not just business value alone
- Shadow mode testing with production traffic is non-negotiable — it discovers the edge cases that unit tests can't
- Design for rollback — every phase must be reversible in under 60 seconds
- Know when to stop — the hybrid steady state is the realistic endpoint for most organizations
The strangler fig doesn't kill the tree. It creates a partnership between the old and the new, where each component runs on the platform that serves it best. That's not a compromise — it's good architecture.
What's Next
Chapter 34 (COBOL-to-Cloud Integration) builds on the strangler fig by addressing the specific challenges of running extracted services in cloud environments — container packaging, cloud-native data stores, network architecture between z/OS and cloud, and the cost models that determine whether cloud hosting actually saves money. The strangler fig gets the service out of CICS; Chapter 34 gets it into the cloud.
Chapter 33 of 40. Progressive project checkpoint: code/project-checkpoint.md. Estimated study time: 5 hours. Recommended approach: Read Sections 33.1-33.3 in one session, 33.4-33.5 in a second session, 33.6-33.9 in a third session.
Related Reading
Explore this topic in other books
Advanced COBOL COBOL to Cloud Learning COBOL Legacy Maintenance and Modernization Intermediate COBOL Migration and Modernization IBM DB2 DB2 in the Cloud