44 min read

> "You don't replace a forty-year-old system by ripping it out. You replace it the way a strangler fig replaces a tree — slowly, from the outside in, until one day the original is gone and nobody noticed the transition."

Learning Objectives

  • Design a strangler fig migration plan for incrementally extracting COBOL services
  • Implement facade/proxy patterns that route between legacy COBOL and new services
  • Manage the transition period where old and new systems coexist
  • Design data synchronization strategies during migration
  • Apply the strangler fig to the HA banking system

"You don't replace a forty-year-old system by ripping it out. You replace it the way a strangler fig replaces a tree — slowly, from the outside in, until one day the original is gone and nobody noticed the transition." — Yuki Nakamura, DevOps Lead, SecureFirst Retail Bank

Chapter Overview

Carlos Vega remembers the exact moment he understood why the strangler fig pattern exists.

It was 11:42 PM on a Thursday in March 2024, and he was staring at his laptop in SecureFirst's operations center, surrounded by cold pizza boxes and the quiet hum of people trying very hard not to panic. SecureFirst's mobile banking app — the one Carlos had spent eighteen months building as a Java/Kotlin microservices architecture — had been live for exactly nine hours. And for the last two of those hours, balance inquiries had been returning numbers that didn't match what the CICS core banking system showed.

Not by pennies. By thousands of dollars.

The root cause, which took Carlos and Yuki Nakamura another four hours to find, was a timing issue in the batch cutover. SecureFirst's new microservice was querying a replicated PostgreSQL database that was supposed to mirror the DB2 master. But the CDC pipeline had a 47-second lag during peak batch processing, and the mobile app was showing pre-posting balances while the CICS green-screen tellers were seeing post-posting balances. Same accounts, same moment, different numbers. A customer who'd checked their balance on the app at 9:15 PM saw $14,200. The teller who looked it up thirty seconds later saw $11,847 — after the nightly mortgage payment posted.

"That," Carlos told Yuki at 4 AM, "is why you don't do big-bang cutovers for banking systems."

Yuki, who had been saying exactly this for months, was too tired to say "I told you so." Instead she said: "Monday morning, we redesign this as a strangler fig. We keep the CICS system as the source of truth and we peel services off one at a time, with parallel running, until we trust each one enough to let it fly alone."

That conversation — born of a near-disaster that could have triggered regulatory action if a customer had disputed the discrepancy — is why this chapter exists. The strangler fig pattern is the single most important execution pattern for mainframe modernization. Chapter 32 taught you what to modernize and why. This chapter teaches you how — incrementally, safely, and with the kind of paranoid attention to data consistency that banking regulators demand.

What you will learn in this chapter:

  1. How to design a strangler fig migration plan that incrementally extracts COBOL services into modern implementations without disrupting the running system
  2. How to build facade and proxy layers that route traffic between legacy COBOL and new services — using CICS web services, z/OS Connect, and API gateways
  3. How to manage the coexistence period where old and new systems run simultaneously, including data synchronization, conflict resolution, and rollback strategies
  4. How to test the transition using parallel running, canary deployments, and dark launching
  5. How to apply the strangler fig pattern to the HA banking system's balance-inquiry service

Learning Path Annotations:

  • 🏃 Fast Track: If you've already done API-first modernization (Chapter 21), skip to Section 33.3 — you know the facade layer; what you need is the extraction decision framework and the data synchronization patterns.
  • 📚 Deep Dive: Sections 33.5 and 33.6 are where the real complexity lives. Budget extra time for the data synchronization patterns — they're the difference between a smooth migration and a 4 AM incident.
  • 🔗 Connection: This chapter operationalizes the "Refactor" strategy from Chapter 32's decision framework. If you scored an application as "Refactor — API-wrap and incrementally extract," this chapter is your execution playbook.

Spaced Review — Concepts from Earlier Chapters:

📊 Review from Chapter 13 (CICS Architecture): CICS region topology — AOR/TOR/FOR — is the foundation of the facade pattern. The TOR (Terminal-Owning Region) already acts as a routing layer, directing transactions to the appropriate AOR. The strangler fig's facade extends this concept: instead of routing only within CICS, we route between CICS and external services. If you're fuzzy on MRO routing and Sysplex-wide workload distribution, revisit Chapter 13, Section 13.4 before proceeding.

📊 Review from Chapter 21 (API-First COBOL): z/OS Connect and the API mediation layer you built in Chapter 21 are the technical foundation for the strangler fig's facade. The OpenAPI specifications, rate limiting, and versioning strategy from that chapter become the contract layer that both the legacy and modern services must honor. If you skipped Chapter 21, you'll need at least Sections 21.2 and 21.4 to follow the facade implementation in this chapter.

📊 Review from Chapter 32 (Modernization Strategy): The threshold concept — modernization is not migration — applies directly. The strangler fig is a modernization pattern, not a migration pattern. We're not moving everything off the mainframe. We're selectively extracting services where extraction provides clear business value, while leaving high-throughput, mission-critical transaction processing on the platform that was built for it. Sandra Chen's decision framework from Chapter 32 tells you which services to extract; this chapter tells you how.


33.1 The Strangler Fig Metaphor

Martin Fowler named the pattern in 2004, borrowing from the strangler fig trees he'd seen in the rainforests of Queensland. The Ficus genus includes species that germinate in the canopy of a host tree, send roots down along the host's trunk, and gradually — over decades — envelop the host tree entirely. The host tree eventually dies and decomposes, leaving the strangler fig standing in its place, its roots forming a hollow lattice where the host tree used to be.

The metaphor maps to software with uncomfortable precision:

Biological Strangler Fig Software Strangler Fig
Seeds germinate in the host's canopy New services are built alongside the legacy system
Roots grow down alongside the host trunk New services share the legacy system's data and interfaces
The fig gradually intercepts sunlight and nutrients The facade gradually routes traffic to new services
The host tree dies slowly, from within Legacy modules are decommissioned one by one
The fig stands where the tree stood The modern system occupies the same business function
Nobody notices the exact moment the tree dies Nobody notices the exact moment the last COBOL module is decommissioned

But here's where the metaphor breaks and the reality of mainframe modernization diverges from a tidy biological analogy: the host tree doesn't fight back. A COBOL/CICS system running 500 million transactions a day has forty years of accumulated business rules, undocumented edge cases, batch dependencies, and regulatory compliance requirements that actively resist extraction. The strangler fig pattern for mainframes is less "peaceful vine grows alongside tree" and more "perform open-heart surgery on a patient who must continue running a marathon throughout the procedure."

The Pattern in Three Sentences

  1. Build a facade that sits in front of both the legacy system and any new services, presenting a single interface to consumers.
  2. Incrementally route traffic from the facade to new service implementations, one functional area at a time, while the legacy system continues handling everything else.
  3. Decommission legacy modules only after the new service has been proven equivalent through parallel running and validated in production.

That's it. Three sentences. The rest of this chapter is about why each of those sentences hides six months of engineering work.

Why Strangler Fig for Mainframes?

The alternative patterns — big-bang cutover, parallel build-and-switch, phased module replacement — all share a fatal flaw for mainframe systems: they require a moment where you switch from old to new. That moment is the single point of failure that has killed more modernization projects than any technical challenge.

The strangler fig eliminates that moment. There is no "go-live day." There is no "cutover weekend." There is a gradual, measurable, reversible transfer of traffic from legacy to modern, one service at a time, over months or years. If the new balance-inquiry service has a bug, you route traffic back to CICS in seconds. If the new payment service can't handle peak load, CICS absorbs the overflow. The legacy system is always there, always running, always ready to catch you when you fall.

Kwame Mensah at CNB puts it this way: "I've seen three big-bang migrations in my career. Two of them are still running on the mainframe. The third one is a Harvard Business School case study in how to destroy $800 million."

💡 Key Insight: The strangler fig pattern's greatest advantage is not technical — it's psychological and organizational. It turns modernization from a single high-stakes bet into a series of small, reversible experiments. Each experiment either succeeds (and you keep going) or fails (and you learn something). The total risk of the project is the sum of many small risks, not one catastrophic one.

What You're Really Building

Let me be precise about the architecture. A strangler fig implementation for a mainframe COBOL system has four layers:

┌─────────────────────────────────────────────────┐
│              External Consumers                  │
│    (Mobile App, Web Portal, Partner APIs)        │
└─────────────────────┬───────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────┐
│              FACADE / API GATEWAY                │
│    (Routes requests to legacy OR modern)         │
│    - Feature toggles                             │
│    - Traffic splitting rules                     │
│    - Consumer-driven contract validation         │
│    - Logging / comparison engine                 │
└────────────┬────────────────────┬───────────────┘
             │                    │
    ┌────────▼────────┐  ┌───────▼────────────┐
    │  LEGACY (COBOL)  │  │  MODERN SERVICES   │
    │  CICS/DB2/IMS   │  │  Java/Node/Go      │
    │  Batch/MQ       │  │  Containers/K8s    │
    │  z/OS            │  │  Cloud or zLinux   │
    └────────┬────────┘  └───────┬────────────┘
             │                    │
    ┌────────▼────────────────────▼────────────┐
    │         DATA SYNCHRONIZATION             │
    │  (CDC, dual-write, event sourcing)       │
    │  Keeps legacy and modern data consistent │
    └─────────────────────────────────────────┘

The facade is the strangler. The legacy system is the host tree. The modern services are the new roots. And the data synchronization layer — the one that kept Carlos up until 4 AM — is the part that nobody thinks about until it's too late.


33.2 Architecture of the Strangler Fig Pattern

Let's get specific. The strangler fig pattern for mainframe COBOL systems has three architectural components: the facade, the routing engine, and the extraction pipeline. Each one has design decisions that will determine whether your migration succeeds or produces a 4 AM phone call.

33.2.1 The Facade

The facade is the single entry point for all consumers. Before the strangler fig, consumers talked directly to CICS (via 3270, web services, or z/OS Connect). After the facade is in place, consumers talk to the facade, and the facade decides whether to route to CICS or to a modern service.

The facade must satisfy four requirements:

  1. Transparent to consumers. No consumer should need to change their integration when you switch a service from legacy to modern. The facade's external contract is stable; only the internal routing changes.

  2. Stateless. The facade routes requests but doesn't hold state. State lives in the services (legacy or modern) and in the data layer. A stateless facade can be scaled horizontally and doesn't create a single point of failure.

  3. Observable. Every request through the facade must be logged with enough detail to compare legacy and modern responses during parallel running. You need to know: which service handled the request, what the response was, how long it took, and whether the response matches what the other service would have returned.

  4. Reversible. Routing changes must be instantaneous. If the modern balance-inquiry service starts returning errors at 2 PM, you need to route all traffic back to CICS by 2:01 PM. No deployments, no restarts, no approvals — a configuration change that takes effect in seconds.

⚠️ Common Pitfall: Many teams build the facade as a "smart" layer that transforms data, applies business rules, or orchestrates calls to multiple backends. Don't. The facade should be dumb. Its only job is routing. Every piece of logic you put in the facade is a piece of logic you'll need to maintain, test, and debug when something goes wrong at 2 AM. Keep the facade thin.

33.2.2 Implementation Options for the Facade

There are three viable implementation options for mainframe strangler fig facades. Each has tradeoffs:

Option A: API Gateway (Kong, Apigee, AWS API Gateway)

An external API gateway sits in front of both the mainframe and modern services. Traffic from consumers hits the gateway, which routes based on path, header, or percentage-based rules.

  • Pros: Mature tooling, built-in traffic splitting, no z/OS changes required, works with any backend.
  • Cons: Adds network hop for mainframe traffic that was previously internal, requires the mainframe to expose HTTP/REST endpoints (Chapter 21), may introduce latency for high-throughput transactions.
  • Best for: Consumer-facing APIs (mobile, web), partner integrations, systems where the mainframe already exposes REST via z/OS Connect.

Option B: z/OS Connect as Facade

z/OS Connect EE (covered in Chapter 21) can act as the facade layer running on the mainframe. It already provides API mediation for CICS and IMS transactions. With routing rules, it can direct some requests to CICS and others to external services.

  • Pros: Runs on z/OS — no additional network hop for legacy requests, leverages existing z/OS Connect investment, maintains mainframe security model (RACF, SSL).
  • Cons: Routing to external services requires outbound HTTP from z/OS (possible but adds complexity), limited traffic-splitting features compared to dedicated API gateways, IBM licensing costs.
  • Best for: Organizations that already use z/OS Connect, systems where most traffic stays on the mainframe, regulatory environments that require all routing to be auditable on z/OS.

Option C: Hybrid — External Gateway + z/OS Connect

The external API gateway handles consumer-facing routing decisions (which service gets the request). z/OS Connect handles the mainframe-side mediation (transforming the request into a CICS LINK or IMS transaction). This is what SecureFirst implemented after Carlos's 4 AM incident.

  • Pros: Best of both worlds — external gateway gets mature traffic management, z/OS Connect gets native mainframe integration. Clean separation of concerns.
  • Cons: Two components to manage, monitor, and troubleshoot. More infrastructure. More potential failure points.
  • Best for: Most enterprise strangler fig implementations. This is the pattern I recommend unless you have a compelling reason to choose A or B.

33.2.3 The Routing Engine

The routing engine is the decision-making component within the facade. It answers one question: "Should this request go to the legacy system or the modern service?"

The routing decision can be based on:

Routing Strategy Description Use Case
Path-based Route by API endpoint path (/api/v2/balance → modern, /api/v2/transfer → legacy) Service-by-service extraction
Header-based Route by custom header (X-Backend: modern) Testing and debugging
Percentage-based Route N% of traffic to modern, (100-N)% to legacy Canary deployments
User-based Route specific users/accounts to modern Beta testing with internal users
Time-based Route to modern during low-traffic periods, legacy during peak Building confidence gradually
Feature toggle Route based on toggle state in configuration service Instant rollback capability

In practice, you'll use a combination. SecureFirst's routing for balance inquiry started with:

  1. Internal employees only (user-based) — two weeks
  2. 5% of external users (percentage-based) — one week
  3. 25% of external users — one week
  4. 50% with comparison logging (percentage + path-based) — two weeks
  5. 100% modern, CICS on standby — ongoing until decommission

💡 Key Insight: The routing engine is where the strangler fig pattern delivers its key value — reversibility. If Step 4 reveals a discrepancy, you drop back to Step 2 in seconds. No rollback deployment, no database restoration, no emergency change advisory board meeting. Just a configuration change.

33.2.4 The Extraction Pipeline

The extraction pipeline is the process — not a technology — by which you identify a COBOL service, build the modern replacement, validate it through parallel running, and decommission the legacy code. Each extraction follows the same lifecycle:

IDENTIFY → UNDERSTAND → BUILD → SHADOW → PARALLEL → CANARY → MIGRATE → DECOMMISSION
   │            │          │        │         │          │         │           │
   │            │          │        │         │          │         │           └─ Remove legacy code,
   │            │          │        │         │          │         │              update docs
   │            │          │        │         │          │         └─ 100% traffic to modern,
   │            │          │        │         │          │            legacy on standby
   │            │          │        │         │          └─ Percentage-based traffic split,
   │            │          │        │         │             comparison still active
   │            │          │        │         └─ Both services handle real traffic,
   │            │          │        │            responses compared automatically
   │            │          │        └─ Modern service gets copy of traffic,
   │            │          │           responses discarded (shadow mode)
   │            │          └─ Implement modern service,
   │            │             unit + integration tests
   │            └─ Map all business rules, edge cases,
   │               data dependencies
   └─ Score extraction candidates (Section 33.3)

A typical extraction takes 3-6 months for a medium-complexity COBOL service (5,000-20,000 LOC, DB2 back-end, CICS front-end). High-complexity services with IMS dependencies, multiple copybook hierarchies, and undocumented business rules can take 12-18 months.


33.3 Identifying Extraction Candidates

Not every COBOL service should be extracted. The strangler fig pattern is powerful precisely because it's selective — you extract what benefits from extraction and leave the rest alone. The decision framework from Chapter 32 (portfolio assessment, three-axis scoring) gives you the strategic view. This section gives you the tactical criteria for choosing which services to extract first.

33.3.1 The Extraction Scorecard

Score each candidate service on five dimensions:

Dimension Score Range What You're Measuring
Business Value of Extraction 1-5 How much does the business gain from having this as a modern service? (Mobile access, faster change velocity, new capabilities)
Technical Complexity 1-5 (lower is better) How many dependencies, edge cases, and undocumented rules does this service have?
Data Coupling 1-5 (lower is better) How tightly is this service's data coupled to other services? Does it share DB2 tables with 15 other programs?
Change Frequency 1-5 How often does the business need to change this service? High-change services benefit most from modern development practices.
Risk Tolerance 1-5 How much tolerance does the business have for errors in this service? (A display-only service has high tolerance; a payment service has near-zero tolerance.)

Calculate the extraction priority score:

Priority = (Business Value × 3) + (Change Frequency × 2) + Risk Tolerance
           ─────────────────────────────────────────────────────────────────
                        Technical Complexity + Data Coupling

Higher scores indicate better extraction candidates. The weighting reflects the practical reality: you want high business value, high change frequency, and high risk tolerance (meaning you can afford some errors during transition), divided by the difficulty factors.

33.3.2 SecureFirst's Extraction Scorecard

Yuki and Carlos scored SecureFirst's CICS services using this framework:

Service Biz Value Tech Complex Data Coupling Change Freq Risk Toler. Priority
Balance Inquiry 5 2 2 3 4 7.25
Transaction History 5 2 3 3 4 7.00
Account Summary 4 2 3 2 4 4.80
Fund Transfer 5 5 5 3 1 2.30
Bill Payment 4 4 4 2 1 2.13
Loan Origination 3 5 4 1 1 1.33
Wire Transfer 3 5 5 1 1 1.20

The results matched their intuition but now had numbers behind them. Balance Inquiry scored highest because it's read-only (high risk tolerance — a wrong balance display is bad but doesn't lose money), has relatively simple logic (account lookup, hold calculation, available balance computation), low data coupling (reads from the account master table and the hold table, doesn't write anything), and delivers immediate business value (the mobile app's most-used function).

Fund Transfer scored lowest despite high business value, because the technical complexity and data coupling are extreme (it touches account masters, transaction logs, general ledger, hold tables, and regulatory tables, all within a single UOW), and the risk tolerance is effectively zero — a wrong transfer loses real money or violates regulations.

⚠️ Common Pitfall: Teams often pick Fund Transfer first because it has the highest business value. This is the wrong metric. The strangler fig pattern is about building confidence incrementally. Your first extraction must succeed — it sets the pattern, builds organizational confidence, and trains the team. Pick the service with the highest priority score, not the highest business value. You'll get to Fund Transfer eventually, but by then you'll have extracted five simpler services and learned from each one.

33.3.3 Finding the Seams

Michael Feathers (author of Working Effectively with Legacy Code) introduced the concept of a "seam" — a place in the code where you can alter behavior without editing the code itself. In a COBOL/CICS system, the natural seams are:

  1. CICS transaction boundaries. Each CICS transaction (EXEC CICS LINK, EXEC CICS XCTL) is a seam. The TOR routes to a transaction; you can route that same transaction to a modern service instead.

  2. COMMAREA / Channel boundaries. The data passed between CICS programs via COMMAREA or channels/containers defines the service contract. If you can replicate the COMMAREA contract in a REST API, you have a clean extraction point.

  3. Copybook boundaries. Shared copybooks define data structures. If a service has its own copybook that isn't shared with unrelated services, that's a clean data boundary — a seam.

  4. DB2 view boundaries. If the service reads data through a DB2 view rather than directly from base tables, the view is a seam — you can change what's behind the view without changing the service.

  5. MQ queue boundaries. Services that communicate via MQ queues have explicit message contracts. The queue is a natural seam — you can put a different service on the receiving end of the queue.

At SecureFirst, the balance-inquiry extraction succeeded because it had clean seams on three dimensions: a dedicated CICS transaction (BALINQ), a well-defined COMMAREA (account number in, balance data out), and a focused DB2 access pattern (SELECT from ACCOUNT_MASTER and ACCOUNT_HOLDS). No IMS, no MQ, no shared copybooks with unrelated services.

💡 Key Insight: If you can't find clean seams, you may need to create them before extraction. This is the "prepare the patient for surgery" phase. Refactor the COBOL first — separate the service's logic into its own paragraph or subprogram, isolate its DB2 access through a data-access copybook, and ensure the COMMAREA is self-contained. This pre-extraction refactoring is valid modernization work that pays for itself even if you never extract the service.


33.4 The Facade Layer: Implementation

Let's build it. This section walks through the facade layer implementation using SecureFirst's balance-inquiry extraction as the reference. We'll cover the CICS web service wrapper, the API gateway configuration, and the routing rules.

33.4.1 The Legacy Side: CICS Web Service Wrapper

Before the strangler fig, SecureFirst's balance inquiry was a CICS transaction (BALINQ) invoked via a 3270 terminal or an internal CICS LINK. To make it accessible through the facade, Yuki's team wrapped it as a CICS web service using the CICS web services pipeline (covered in detail in Chapter 14).

The wrapper program — BALWSSRV — receives an HTTP JSON request, transforms it to the existing COMMAREA layout, LINKs to the original BALINQ program, transforms the COMMAREA response back to JSON, and returns it. The wrapper adds no business logic — it's a pure translation layer.

       IDENTIFICATION DIVISION.
       PROGRAM-ID. BALWSSRV.
      *================================================================
      * BALANCE INQUIRY WEB SERVICE WRAPPER
      * Translates JSON REST request to COMMAREA for BALINQ program.
      * Part of the strangler fig facade layer.
      *
      * This program is a TRANSLATION layer only.
      * NO business logic belongs here.
      * NO data access belongs here.
      * If you're tempted to add a DB2 query, stop and refactor
      * the underlying BALINQ program instead.
      *================================================================
       DATA DIVISION.
       WORKING-STORAGE SECTION.

       01  WS-RESP                     PIC S9(8) COMP VALUE 0.
       01  WS-RESP2                    PIC S9(8) COMP VALUE 0.

       01  WS-JSON-REQUEST.
           05  WS-JSON-ACCT-NUM        PIC X(12).
           05  WS-JSON-REQ-TYPE        PIC X(8).
           05  WS-JSON-CORRELATION-ID   PIC X(36).

       01  WS-JSON-RESPONSE.
           05  WS-RESP-ACCT-NUM        PIC X(12).
           05  WS-RESP-AVAIL-BAL       PIC S9(13)V99 COMP-3.
           05  WS-RESP-LEDGER-BAL      PIC S9(13)V99 COMP-3.
           05  WS-RESP-HOLD-AMT        PIC S9(13)V99 COMP-3.
           05  WS-RESP-CURRENCY        PIC X(3).
           05  WS-RESP-AS-OF-TS        PIC X(26).
           05  WS-RESP-STATUS          PIC X(2).
               88  RESP-OK             VALUE '00'.
               88  RESP-ACCT-NOT-FOUND VALUE '04'.
               88  RESP-ACCT-CLOSED    VALUE '08'.
               88  RESP-SYSTEM-ERROR   VALUE '99'.

      * COMMAREA layout for the existing BALINQ program
       COPY BALCOMM.

       01  WS-CONTAINER-NAME          PIC X(16)
                                      VALUE 'BALINQ-JSON-REQ'.
       01  WS-CHANNEL-NAME            PIC X(16)
                                      VALUE 'BALINQ-CHANNEL'.

       PROCEDURE DIVISION.
       MAIN-LOGIC.
           PERFORM 1000-RECEIVE-REQUEST
           PERFORM 2000-MAP-TO-COMMAREA
           PERFORM 3000-LINK-TO-BALINQ
           PERFORM 4000-MAP-TO-RESPONSE
           PERFORM 5000-SEND-RESPONSE
           EXEC CICS RETURN END-EXEC
           .

       1000-RECEIVE-REQUEST.
      *    Receive JSON from the CICS web service pipeline.
      *    The pipeline's DFHJS2LS transformation has already
      *    converted JSON to the WS-JSON-REQUEST structure.
           EXEC CICS GET CONTAINER(WS-CONTAINER-NAME)
                CHANNEL(WS-CHANNEL-NAME)
                INTO(WS-JSON-REQUEST)
                RESP(WS-RESP)
                RESP2(WS-RESP2)
           END-EXEC

           IF WS-RESP NOT = DFHRESP(NORMAL)
               MOVE '99' TO WS-RESP-STATUS
               PERFORM 5000-SEND-RESPONSE
               EXEC CICS RETURN END-EXEC
           END-IF
           .

       2000-MAP-TO-COMMAREA.
      *    Map the JSON request fields to the COMMAREA layout
      *    that BALINQ expects. This is the ONLY place where
      *    field mapping happens.
           INITIALIZE BALINQ-COMMAREA
           MOVE WS-JSON-ACCT-NUM
                                   TO BAL-ACCT-NUMBER
           MOVE 'INQ'              TO BAL-REQUEST-TYPE
           MOVE WS-JSON-CORRELATION-ID
                                   TO BAL-CORRELATION-ID
           .

       3000-LINK-TO-BALINQ.
      *    LINK to the existing BALINQ program.
      *    The COMMAREA contract is the seam.
           EXEC CICS LINK PROGRAM('BALINQ')
                COMMAREA(BALINQ-COMMAREA)
                LENGTH(LENGTH OF BALINQ-COMMAREA)
                RESP(WS-RESP)
                RESP2(WS-RESP2)
           END-EXEC

           IF WS-RESP NOT = DFHRESP(NORMAL)
               MOVE '99' TO WS-RESP-STATUS
           END-IF
           .

       4000-MAP-TO-RESPONSE.
      *    Map COMMAREA response back to JSON response structure.
           MOVE BAL-ACCT-NUMBER    TO WS-RESP-ACCT-NUM
           MOVE BAL-AVAIL-BALANCE  TO WS-RESP-AVAIL-BAL
           MOVE BAL-LEDGER-BALANCE TO WS-RESP-LEDGER-BAL
           MOVE BAL-HOLD-AMOUNT    TO WS-RESP-HOLD-AMT
           MOVE BAL-CURRENCY-CODE  TO WS-RESP-CURRENCY
           MOVE BAL-TIMESTAMP      TO WS-RESP-AS-OF-TS
           MOVE BAL-RETURN-CODE    TO WS-RESP-STATUS
           .

       5000-SEND-RESPONSE.
      *    Place response in container for the pipeline to
      *    convert back to JSON.
           EXEC CICS PUT CONTAINER(WS-CONTAINER-NAME)
                CHANNEL(WS-CHANNEL-NAME)
                FROM(WS-JSON-RESPONSE)
                RESP(WS-RESP)
           END-EXEC
           .

The key design decisions in this wrapper:

  1. No business logic. The wrapper is pure plumbing. If someone wants to change how balance inquiry works, they change BALINQ, not BALWSSRV.
  2. Correlation ID propagation. The JSON request includes a correlation ID that flows through to the COMMAREA. This is essential for parallel running — you need to match legacy and modern responses for the same request.
  3. Standard error mapping. The wrapper maps CICS response codes to the JSON response's status field. The facade layer needs consistent error reporting from both legacy and modern services.

33.4.2 The API Gateway Configuration

With the CICS web service wrapper in place, the legacy balance-inquiry service is accessible via HTTP REST. Now we configure the API gateway to route between legacy and modern.

SecureFirst chose Kong as their API gateway (deployed on OpenShift alongside the mainframe). The routing configuration uses Kong's traffic-splitting plugin combined with feature toggles stored in a configuration database.

Here's the routing configuration for the balance-inquiry strangler fig:

# Kong API Gateway — Balance Inquiry Strangler Fig Routing
# SecureFirst Retail Bank
# Managed by: Yuki Nakamura / Carlos Vega
# Last updated: 2024-11-15
#
# ROUTING STRATEGY:
# Phase 1 (current): 100% legacy (CICS via z/OS Connect)
# Phase 2: Internal users → modern, external → legacy
# Phase 3: 10% → modern (canary), 90% → legacy
# Phase 4: 50/50 with comparison logging
# Phase 5: 100% modern, legacy on standby
# Phase 6: Legacy decommissioned

_format_version: "3.0"

services:
  # Legacy service: CICS balance inquiry via z/OS Connect
  - name: balance-inquiry-legacy
    url: https://zosconnect.securefirst.internal:9443/zosConnect/apis/balanceinquiry/v1
    protocol: https
    connect_timeout: 5000
    write_timeout: 10000
    read_timeout: 15000
    retries: 2
    tags:
      - strangler-fig
      - legacy
      - balance-inquiry
    routes:
      - name: balance-inquiry-legacy-route
        paths:
          - /api/v2/accounts/balance
        methods:
          - GET
        headers:
          X-Route-Override:
            - legacy
        strip_path: false

  # Modern service: Kotlin microservice on OpenShift
  - name: balance-inquiry-modern
    url: http://balance-inquiry-svc.banking.svc.cluster.local:8080/api/v2/accounts/balance
    protocol: http
    connect_timeout: 3000
    write_timeout: 5000
    read_timeout: 10000
    retries: 3
    tags:
      - strangler-fig
      - modern
      - balance-inquiry
    routes:
      - name: balance-inquiry-modern-route
        paths:
          - /api/v2/accounts/balance
        methods:
          - GET
        headers:
          X-Route-Override:
            - modern
        strip_path: false

# Traffic splitting plugin — controls the percentage split
plugins:
  - name: canary
    service: balance-inquiry-legacy
    config:
      # Percentage of traffic routed to the modern (upstream) service
      # Change this value to control the strangler fig phase
      percentage: 0
      upstream_host: balance-inquiry-modern
      upstream_uri: /api/v2/accounts/balance
      upstream_port: 8080
      # Hash-based routing ensures the same account always goes
      # to the same backend during a session (prevents the
      # confusion Carlos experienced with inconsistent balances)
      hash: consumer
      start: "2024-12-01T00:00:00Z"
      duration: 2592000  # 30 days for gradual ramp-up
      steps: 100

  # Comparison logging — logs both legacy and modern responses
  # for offline analysis during parallel running
  - name: request-transformer
    service: balance-inquiry-legacy
    config:
      add:
        headers:
          - "X-Correlation-ID:$(uuid)"
          - "X-Strangler-Phase:parallel-run"
          - "X-Timestamp:$(now)"

  # Rate limiting — protect the legacy system from being
  # overwhelmed during the transition
  - name: rate-limiting
    service: balance-inquiry-legacy
    config:
      minute: 6000
      hour: 200000
      policy: redis
      redis_host: redis.securefirst.internal
      redis_port: 6379

33.4.3 The Comparison Engine

During parallel running (Phase 4), both legacy and modern services handle real requests. The comparison engine captures both responses and compares them field by field. Discrepancies are logged, categorized, and flagged for investigation.

SecureFirst's comparison engine runs as a sidecar service alongside the API gateway. For each request:

  1. Route to the primary backend (whichever is handling production traffic)
  2. Asynchronously forward a copy to the secondary backend (shadow mode)
  3. Compare the two responses field by field
  4. Log the comparison result with the correlation ID
  5. Alert if the discrepancy rate exceeds a threshold

The comparison rules for balance inquiry:

FIELD: available_balance
  MATCH: Exact to the penny (0.00 tolerance)
  SEVERITY: Critical
  ACTION: Alert immediately, halt canary ramp-up

FIELD: ledger_balance
  MATCH: Exact to the penny (0.00 tolerance)
  SEVERITY: Critical
  ACTION: Alert immediately, halt canary ramp-up

FIELD: hold_amount
  MATCH: Exact to the penny (0.00 tolerance)
  SEVERITY: Critical
  ACTION: Alert immediately, halt canary ramp-up

FIELD: currency_code
  MATCH: Exact string match
  SEVERITY: Critical
  ACTION: Alert immediately

FIELD: as_of_timestamp
  MATCH: Within 5 seconds (accounts for processing time difference)
  SEVERITY: Warning
  ACTION: Log for investigation if >1% of requests

FIELD: response_time_ms
  MATCH: Modern must be within 150% of legacy
  SEVERITY: Warning
  ACTION: Log; investigate if modern is consistently slower

The critical insight here: for financial data, the comparison must be exact to the penny. Carlos learned this the hard way. A one-cent rounding difference in balance inquiry might seem trivial, but it means the systems disagree about the state of an account. If they disagree on balances, they'll disagree on overdraft calculations, interest accruals, and regulatory reports. A penny today is a regulatory finding tomorrow.

🔴 Production War Story: During SecureFirst's parallel running, the comparison engine flagged a 0.01 discrepancy on 3.2% of accounts. Root cause: the modern Kotlin service used IEEE 754 double-precision floating-point for balance calculations, while the COBOL program used COMP-3 (packed decimal). Floating-point representation of $1234.56 is 1234.5600000000000909... — the penny appears when enough decimal operations accumulate. The fix: the Kotlin service was rewritten to use BigDecimal for all monetary calculations. This is a known problem, documented in every COBOL migration guide, and the team still made the mistake. Use packed decimal equivalents. Always.


33.5 Data Synchronization During Migration

This is the section that matters most. Architecture diagrams are neat and routing configurations are straightforward, but data synchronization during the coexistence period is where strangler fig implementations succeed or fail. Every production incident I've seen in strangler fig migrations traces back to data synchronization.

The fundamental problem: during the coexistence period, you have two systems that both need current, consistent data. The legacy CICS system is still processing some transactions against DB2 on z/OS. The modern service is processing other transactions against its own data store (PostgreSQL, MongoDB, whatever). If either system has stale data, customers see wrong numbers, and wrong numbers in banking means regulators, lawyers, and headlines.

33.5.1 Data Synchronization Patterns

There are four patterns for keeping data synchronized during the strangler fig transition. Each has severe tradeoffs.

Pattern 1: Legacy as Source of Truth (Recommended for Phase 1-3)

The legacy DB2 database is the single source of truth. The modern service reads from a replica that's kept in sync via Change Data Capture (CDC). The modern service does not write to its own database — all writes go through the legacy system.

                ┌─────────────────────┐
                │   Modern Service    │
                │   (read-only)       │
                │                     │
                │   Reads from ──────►│── PostgreSQL
                │   replica           │   (read replica)
                └─────────────────────┘
                                           ▲
                                           │  CDC Stream
                                           │  (near real-time)
                ┌─────────────────────┐    │
                │   Legacy CICS       │    │
                │   (read + write)    │    │
                │                     │    │
                │   DB2 Master ───────┼────┘
                │                     │
                └─────────────────────┘
  • Pros: Simple. One source of truth. No conflict resolution needed. Consistent. If the CDC stream lags, the modern service shows slightly stale data, but it's never wrong — just behind.
  • Cons: The modern service can't do writes. Limits extraction to read-only services (balance inquiry, transaction history, account summary). Not viable for fund transfer, payments, or any service that modifies data.
  • CDC Tools: IBM InfoSphere CDC, Debezium with the Db2 connector, IBM Data Replication (formerly IIDR).

Pattern 2: Dual-Write with Legacy Priority

Both the legacy and modern services can write. Writes go to the legacy system first, then are replicated to the modern system's data store. If there's a conflict, the legacy system wins.

                ┌─────────────────────┐
                │   Modern Service    │
                │   (read + write)    │
                │                     │
                │   Writes to ───────►│── PostgreSQL
                │   own DB            │
                └────────┬────────────┘
                         │ Write also sent
                         │ to legacy (sync)
                         ▼
                ┌─────────────────────┐
                │   Legacy CICS       │
                │   (read + write)    │
                │   DB2 Master ───────┼──► CDC to PostgreSQL
                └─────────────────────┘
  • Pros: Modern service can handle writes. Path to full extraction.
  • Cons: Complex. Two databases to keep in sync. Dual-write creates a distributed transaction problem — if the legacy write succeeds but the PostgreSQL write fails (or vice versa), you have an inconsistency. Requires saga pattern or compensating transactions.
  • Danger: This is where most teams get into trouble. Dual-write is not a database pattern; it's a distributed systems problem. Treat it with the respect you'd give a distributed transaction.

⚠️ Common Pitfall: "We'll just write to both databases in the same transaction." No, you won't. DB2 on z/OS and PostgreSQL on Linux cannot participate in the same two-phase commit without a distributed transaction coordinator (like CICS UOW with XA). And even if they could, the performance penalty of a cross-platform two-phase commit would make your 200ms SLA impossible. Use asynchronous replication with compensating transactions, not distributed two-phase commit.

Pattern 3: Event Sourcing

Instead of synchronizing databases, both systems consume and produce events. Every state change is an event (AccountCredited, AccountDebited, HoldPlaced, HoldReleased). Both systems derive their current state from the event log.

  • Pros: Elegant. No direct database coupling. Natural audit trail. Both systems can independently maintain their own materialized view of the data.
  • Cons: Requires a complete rearchitecture of the data model. Existing COBOL programs don't produce events — they do SQL UPDATEs. Retrofitting event sourcing onto a 40-year-old COBOL system is a multi-year project in itself.
  • Verdict: Theoretically ideal, practically impossible for most mainframe brownfield environments. Consider this for greenfield services that are being built alongside the legacy system, not for extracting existing services.

Pattern 4: Shared Database

Both the legacy and modern services access the same DB2 database on z/OS. No replication, no synchronization — they share the data.

  • Pros: Simple. No synchronization issues. Data is always consistent because there's only one copy.
  • Cons: The modern service must be able to connect to DB2 on z/OS, which means DRDA or z/OS Connect data services. Performance is limited by the network hop between the modern service and z/OS. The modern service is tightly coupled to the DB2 schema, which prevents you from evolving the data model. And you're still paying for DB2 on z/OS, which means you haven't reduced mainframe costs.
  • Verdict: Viable as a transitional pattern (Phase 2-3) but not a long-term architecture. The modern service needs its own data store eventually. Use this pattern to get the modern service running, then migrate to Pattern 1 or 2.

33.5.2 SecureFirst's Data Synchronization Strategy

SecureFirst chose a phased approach that moved through the patterns over twelve months:

Months 1-3 (Balance Inquiry extraction): Pattern 1 — Legacy as Source of Truth. The modern Kotlin balance-inquiry service reads from a PostgreSQL replica kept in sync via IBM InfoSphere CDC. Latency target: <5 seconds. Achieved: 2-3 seconds average, 47-second spike during batch window (the incident that started this chapter).

Resolution of the 47-second spike: Yuki's team added a "freshness indicator" to the API response. If the CDC lag exceeds 10 seconds, the response includes a header (X-Data-Freshness: delayed) and the mobile app displays "Balance as of [timestamp]" instead of implying the balance is real-time. This is honest engineering — don't pretend data is fresh when it isn't.

Months 4-8 (Transaction History extraction): Pattern 1 continues. Transaction history is also read-only. The same CDC pipeline feeds both balance data and transaction data to PostgreSQL.

Months 9-12 (Fund Transfer extraction — planned): Pattern 2 — Dual-Write with Legacy Priority. When the modern service handles a fund transfer, it writes to PostgreSQL first (for its own use), then sends the transfer instruction to the legacy CICS system via an MQ message. CICS processes the transfer against DB2 (the source of truth), and the result flows back to the modern service via CDC. If the CICS processing fails, the modern service initiates a compensating transaction to reverse its PostgreSQL write.

💡 Key Insight: Notice the progression. You don't start with the hardest synchronization pattern. You start with Pattern 1 (read-only, CDC), prove that the CDC pipeline is reliable, learn its failure modes, understand its latency characteristics, and then move to Pattern 2 (dual-write). Each phase builds on the infrastructure and the team's experience from the previous phase. This is the strangler fig philosophy applied to the data layer — incremental, reversible, confidence-building.

33.5.3 Change Data Capture in Practice

CDC for DB2 on z/OS is well-established technology. IBM InfoSphere Data Replication (IIDR) reads the DB2 recovery log (BSDS) and streams row-level changes to a target. Debezium, the open-source CDC framework, has a Db2 connector that works for Db2 on distributed platforms, but for Db2 on z/OS, you typically need IIDR or IBM Data Gate.

Key operational considerations for CDC in a strangler fig context:

Latency. CDC from DB2 z/OS to a cloud-hosted PostgreSQL has three latency components: log read latency (how quickly the CDC tool reads from the DB2 log), network transfer latency (z/OS to cloud), and apply latency (how quickly the target database applies the changes). Under normal load, total latency is 2-5 seconds. During peak batch processing, when the DB2 log is generating gigabytes of changes per hour, latency can spike to 30-60 seconds.

Schema evolution. When the DB2 table schema changes (ALTER TABLE ADD COLUMN), the CDC pipeline must handle the schema change without data loss and without stopping replication. This requires coordination between the DBA, the CDC team, and the modern service team. Add schema change coordination to your strangler fig runbook.

Initial load. Before CDC can stream changes, the target database needs a full copy of the source data. For a large DB2 table (billions of rows), this initial load can take hours or days. Plan for this. Run the initial load during a maintenance window, and start CDC replication from a known log position after the load completes.

Monitoring. CDC pipelines fail silently. The DB2 log wraps, a network hiccup drops a batch of changes, a schema mismatch causes the apply to stop — and nobody notices until a customer complains about stale data. Monitor three metrics:

  1. Replication lag (seconds behind source) — alert if >30 seconds
  2. Apply throughput (rows/second) — alert if drops to zero
  3. Error count — alert on any non-zero value

33.6 Testing the Transition

You've built the facade. You've configured the routing. You've set up CDC. Now comes the most important phase of the strangler fig: proving that the new service is equivalent to the old one. In banking, "equivalent" means "produces identical results for every possible input, including edge cases that haven't occurred in three years but could occur tomorrow."

33.6.1 Shadow Mode Testing

Shadow mode (also called "dark launching") is the first testing phase. The facade routes all production traffic to the legacy system, but also sends a copy of each request to the modern service. The modern service processes the request and returns a response, but the response is discarded — only the legacy response goes back to the consumer. The comparison engine logs both responses for offline analysis.

Shadow mode answers one question: does the modern service produce the same result as the legacy service for real production traffic?

Duration: 2-4 weeks minimum. SecureFirst ran shadow mode for three weeks on balance inquiry and discovered:

  • The COMP-3 vs. floating-point rounding issue (3.2% of accounts, described in Section 33.4.3)
  • A timezone discrepancy where the modern service returned timestamps in UTC but the legacy service returned them in ET (Eastern Time). Both were "correct" by their own conventions, but consumers expected ET.
  • Seven accounts with data that triggered an edge case in the legacy COBOL program (accounts with a specific combination of hold types) that the modern service hadn't implemented. These seven accounts represented 0.0003% of the customer base and used an obscure hold type (judicial garnishment hold with partial release) that the extraction team hadn't encountered in their analysis.

That last one is why shadow mode testing with production traffic is non-negotiable. You cannot discover every edge case through unit tests, integration tests, or even months of QA testing with synthetic data. Only real production traffic — with its full diversity of account states, hold types, fee structures, and regulatory flags — reveals the edge cases that matter.

⚠️ Common Pitfall: "We'll run shadow mode for a few days and if everything matches, we'll go to canary." A few days isn't enough. You need to cover at least one full monthly cycle (to catch month-end processing effects), one full statement period, and ideally one quarter-end (to catch quarterly regulatory calculations). Three weeks is the minimum; a full month is better.

33.6.2 Parallel Running

Once shadow mode testing achieves a target match rate (SecureFirst's threshold: 99.99% exact match on financial fields, 99.9% on non-financial fields for 14 consecutive days), you move to parallel running.

In parallel running, both services handle real production traffic. The facade routes each request to one service (the "primary") and shadows it to the other. The primary's response goes to the consumer. The comparison engine continues logging discrepancies.

The difference from shadow mode: in parallel running, you gradually shift which service is primary. Week one: legacy is primary for 100% of requests. Week two: modern is primary for 10%. Week three: 25%. Week four: 50%. Each increase happens only if the discrepancy rate stays below threshold.

Parallel Running Dashboard Metrics:

╔══════════════════════════════════════════════════════════╗
║  STRANGLER FIG DASHBOARD — Balance Inquiry              ║
║  Phase: Parallel Running — Week 3 (25% modern primary)  ║
╠══════════════════════════════════════════════════════════╣
║                                                          ║
║  Traffic Split:                                          ║
║    Legacy primary:  75.0% (148,293 requests/hour)        ║
║    Modern primary:  25.0% ( 49,431 requests/hour)        ║
║                                                          ║
║  Match Rate (last 24h):                                  ║
║    Financial fields:  99.997% (6 mismatches / 1.97M)     ║
║    Non-financial:     99.94%  (118 mismatches / 1.97M)   ║
║                                                          ║
║  Response Time (p95):                                    ║
║    Legacy:   42ms                                        ║
║    Modern:   28ms                                        ║
║                                                          ║
║  Error Rate:                                             ║
║    Legacy:   0.002%                                      ║
║    Modern:   0.003%                                      ║
║                                                          ║
║  CDC Lag:   3.1 seconds (target: <30s)                   ║
║                                                          ║
║  Ramp-Up Decision:                                       ║
║    ✅ Financial match rate > 99.99%                       ║
║    ⚠️  Non-financial match rate < 99.95% (investigate)   ║
║    ✅ Response time within 150%                           ║
║    ✅ Error rate within tolerance                         ║
║    ✅ CDC lag within tolerance                            ║
║                                                          ║
║  RECOMMENDATION: Hold at 25% — investigate               ║
║  non-financial mismatches before ramp-up                  ║
╚══════════════════════════════════════════════════════════╝

The six financial field mismatches in the dashboard above? Three were caused by a race condition where a deposit posted between the legacy and modern queries (a timing issue, not a logic issue — expected and acceptable). Two were caused by a CDC lag spike during batch processing. One was a genuine bug in the modern service's handling of accounts with negative available balances (overdrawn accounts with pending deposits). That last one was fixed, and the ramp-up continued after a 48-hour observation period.

33.6.3 Canary Deployment

Canary deployment is a refinement of parallel running where you expose the modern service to a small, controlled subset of real users. Unlike shadow mode (where the modern service's response is discarded), canary users actually see the modern service's response.

SecureFirst's canary strategy for balance inquiry:

Canary Ring Users Duration Rollback Trigger
Ring 0 SecureFirst employees (internal) 2 weeks Any financial mismatch
Ring 1 5% of external users (lowest-balance accounts) 1 week >0.01% financial mismatch rate
Ring 2 25% of external users 2 weeks >0.001% financial mismatch rate
Ring 3 50% of external users 2 weeks >0.001% financial mismatch rate
Ring 4 100% of external users Indefinite >0.0001% financial mismatch rate

Why lowest-balance accounts for Ring 1? Because if there's a display error, the dollar impact is smallest. Showing a customer with $47 in their account a balance of $46.99 is a problem, but showing a customer with $470,000 a balance of $469,999.99 is a much bigger problem — and a much bigger regulatory exposure.

💡 Key Insight: Canary rings are not just about percentage of traffic — they're about controlling blast radius. Route canary traffic to accounts where the impact of an error is lowest. This means: smallest balances, fewest pending transactions, simplest account structures. Save the complex accounts (trusts, joint accounts with multiple signers, business accounts with sub-accounts) for Ring 3 or later, after you've built confidence with simple accounts.

33.6.4 Rollback Strategy

Every phase of the strangler fig must have a rollback plan that can be executed in under 60 seconds. The facade is the rollback mechanism — changing the routing configuration instantly redirects traffic to the legacy system.

SecureFirst's rollback procedure:

STRANGLER FIG ROLLBACK PROCEDURE — Balance Inquiry
═══════════════════════════════════════════════════

TRIGGER CONDITIONS (any one triggers rollback):
  - Financial field mismatch rate > threshold for current phase
  - Modern service error rate > 0.1%
  - Modern service p99 response time > 500ms
  - CDC replication lag > 60 seconds for > 5 minutes
  - Manual trigger by on-call engineer

ROLLBACK STEPS:
  1. Execute:  kong config set canary.percentage=0
     Effect:   All traffic immediately routes to legacy
     Time:     < 5 seconds
     Verify:   kong config get canary.percentage returns 0

  2. Verify legacy service health:
     - Check CICS BALINQ transaction rate (should increase to 100%)
     - Check DB2 thread utilization (should handle full load)
     - Check response times (should be normal)

  3. Notify:
     - Page on-call engineer (if not already paged)
     - Email strangler-fig-team distribution list
     - Update #strangler-fig Slack channel

  4. DO NOT:
     - Shut down the modern service (leave it running for diagnostics)
     - Reset the CDC pipeline (it's still needed for other services)
     - Panic (the legacy system has been handling this traffic for 15 years)

POST-ROLLBACK:
  - Conduct root cause analysis within 24 hours
  - Document the failure in the strangler fig journal
  - Fix the issue and re-enter shadow mode for the affected service
  - Resume canary ramp-up only after 7 consecutive days of clean shadow

33.7 When to Stop Strangling

This is the question nobody talks about, and it's arguably more important than how to start. The strangler fig pattern has an implicit assumption: eventually, you'll extract everything and the legacy system will be decommissioned. But that assumption is wrong for most mainframe systems.

33.7.1 The Asymptotic Problem

In practice, strangler fig migrations follow an asymptotic curve. The first 20% of services (by transaction volume) are extracted in the first 30% of the timeline. The next 30% take another 40% of the timeline. The last 50% — the complex, tightly coupled, mission-critical services — would take the remaining 80% of the timeline, except that the timeline is now over budget and over schedule, and the business has stopped seeing ROI from the migration.

This is what Diane Okoye at Pinnacle Health calls "the last mile problem." At Pinnacle, the strangler fig successfully extracted eligibility checking, provider lookup, and claims status inquiry to modern microservices in eighteen months. But claims adjudication — the core engine, 340,000 lines of COBOL with 40 years of health insurance business rules — resisted extraction. Two years of analysis produced a 200-page specification document and the conclusion that rewriting the adjudication engine would cost $45 million and take three years, with no guarantee of regulatory compliance.

Diane's recommendation: "Stop strangling. The adjudication engine stays on CICS. It works. It's fast. It's compliant. We've extracted everything around it that benefits from extraction. The strangler fig's job is done — and the fig has a hollow center where the tree used to be, and that's fine."

33.7.2 Completion Criteria

Design your strangler fig plan with explicit completion criteria. Don't define success as "everything is off the mainframe." Define success as "the business outcomes that motivated the modernization have been achieved."

SecureFirst's completion criteria:

Business Outcome Metric Target Status
Mobile banking parity Features available on mobile vs. branch 90% In progress
API response time p95 latency for mobile APIs <200ms Achieved for balance, history
Development velocity Time to deliver a new feature <2 weeks (down from 3 months) Achieved for extracted services
Operational cost Annual mainframe MIPS cost reduction 30% 18% achieved
Staff flexibility Developers who can work on both mainframe and modern 60% of team 45% achieved

Notice: "decommission the mainframe" is not on this list. The mainframe stays. Fund transfer, wire transfer, and loan origination stay on CICS/DB2 — they're working, they're fast, they're compliant, and extracting them would cost more than the savings would justify.

33.7.3 The Hybrid Steady State

The endpoint of most mainframe strangler fig implementations is not "no mainframe." It's a hybrid architecture where:

  • The mainframe handles high-throughput OLTP (fund transfer, payment processing, ledger updates), complex batch processing (nightly posting, regulatory reporting), and services where z/OS provides unique value (Parallel Sysplex availability, WLM, RACF security).
  • Modern services handle consumer-facing APIs (mobile, web), analytics and reporting (data warehouse, ML models), services with high change frequency (promotions, notifications, personalization), and new capabilities that don't exist on the mainframe.
  • The facade/gateway mediates between them, providing a single API surface to consumers regardless of which backend serves the request.

This is the "hybrid architecture" that Chapter 37 will elaborate on. The strangler fig doesn't kill the tree — it creates a symbiotic architecture where each platform handles what it does best.

💡 Key Insight: Know when to stop. A strangler fig migration that extracts 60% of services and achieves 90% of the business value is a success. A strangler fig migration that tries to extract 100% and runs over budget by 200% is a failure — even if it eventually completes. The goal is business value, not architectural purity.

33.7.4 The Decommissioning Decision

For each legacy COBOL module, the decommission decision follows this flowchart:

Is all traffic routed to the modern service?
├── No → Keep the module active
└── Yes → Has parallel running shown 99.99%+ match for 30+ days?
    ├── No → Continue parallel running
    └── Yes → Is any other module dependent on this one?
        ├── Yes → Can the dependent module be modified to use the modern service?
        │   ├── No → Keep the module active as an internal dependency
        │   └── Yes → Modify the dependent, then reassess
        └── No → Move to standby
            └── Has the module been in standby for 90+ days with no issues?
                ├── No → Continue standby
                └── Yes → DECOMMISSION
                    ├── Remove from CICS CSD
                    ├── Archive source code (do NOT delete)
                    ├── Update documentation
                    └── Update RACF profiles

The 90-day standby period is non-negotiable. Sandra Chen at FBA learned this the hard way when a decommissioned module turned out to be called by a quarterly regulatory reporting batch job. The module had been "unused" for 88 days — two days short of the quarter-end batch run that needed it.

⚠️ Common Pitfall: "The module has been in standby for a month with no calls. Let's decommission." One month is not enough. You need to cover all periodic cycles: monthly close, quarterly reporting, annual regulatory submissions, leap year calculations, and fiscal year-end processing. The minimum standby period should be the longest periodic cycle plus a buffer — typically 90-180 days.


33.8 Applying the Strangler Fig to the HA Banking System

Let's apply everything from this chapter to your progressive project. In Chapter 32, you assessed the HA Banking Transaction Processing System and identified which components to modernize. Now you'll design the strangler fig plan for the first extraction: balance inquiry.

33.8.1 Extraction Scorecard for the HA Banking System

Using the extraction scorecard from Section 33.3, score each component of your HA banking system:

Component Biz Value Tech Complex Data Coupling Change Freq Risk Toler. Priority
Balance Inquiry 5 2 2 4 4 8.00
Transaction History 5 2 3 3 4 7.00
Account Summary 4 2 2 2 4 5.50
Statement Generation 3 3 3 1 3 2.33
Fund Transfer 5 5 5 3 1 2.30
ACH Processing 4 5 5 2 1 2.00
Interest Calculation 3 4 4 1 1 1.50
Regulatory Reporting 2 5 5 1 1 0.90

The extraction order is clear: Balance Inquiry first, then Transaction History, then Account Summary. Fund Transfer and everything below it stays on CICS for now — the complexity and risk aren't justified by the business value of extraction.

33.8.2 The Balance Inquiry Strangler Fig Plan

Here's the specific plan for extracting balance inquiry from your HA banking system. This is what you'll implement in the project checkpoint.

Phase 0: Prepare (Weeks 1-4)

  • Wrap the existing BALINQ CICS transaction as a REST web service using z/OS Connect (you built this in Chapter 21's project checkpoint)
  • Deploy an API gateway (Kong on OpenShift or AWS API Gateway)
  • Set up CDC from DB2 z/OS to PostgreSQL on the target platform
  • Define the API contract (OpenAPI 3.0 specification) for balance inquiry
  • Establish the comparison engine and monitoring dashboard

Phase 1: Build the Modern Service (Weeks 5-10)

  • Implement the balance-inquiry microservice in the target language/platform
  • Use BigDecimal (or equivalent packed decimal) for all monetary calculations
  • Implement the same COMMAREA-equivalent contract as the CICS service
  • Unit test against the BALINQ copybook's field specifications
  • Integration test against the PostgreSQL replica

Phase 2: Shadow Mode (Weeks 11-14)

  • Deploy the modern service alongside the legacy
  • Configure the facade to shadow all balance-inquiry traffic to the modern service
  • Run the comparison engine, targeting 99.99% financial field match
  • Investigate and fix all discrepancies
  • Minimum duration: cover one full month-end cycle

Phase 3: Canary Deployment (Weeks 15-20)

  • Ring 0: Internal users, 2 weeks
  • Ring 1: 5% of external users (lowest balances), 1 week
  • Ring 2: 25%, 2 weeks
  • Ring 3: 50%, 2 weeks (with comparison logging)
  • Rollback triggers defined and tested

Phase 4: Full Migration (Weeks 21-24)

  • 100% of traffic to modern service
  • Legacy CICS BALINQ on standby (still receiving shadow traffic for comparison)
  • CDC pipeline continues (other services still use DB2 master)

Phase 5: Decommission (Week 24 + 90 days)

  • After 90 days of standby with no issues, decommission the CICS BALINQ program
  • Archive source code and COMMAREA copybook
  • Update CSD, RACF profiles, and documentation
  • Remove the z/OS Connect service definition for balance inquiry
  • Celebrate — but quietly, because fund transfer is next

33.8.3 Risk Register

Every strangler fig plan needs a risk register. Here's the one for the HA banking system balance-inquiry extraction:

Risk Probability Impact Mitigation
CDC lag spike during batch window causes stale balances in modern service High Medium Implement freshness indicator; alert on lag >30s; route to legacy during batch window if lag exceeds SLA
COMP-3 vs. floating-point rounding differences High High Mandate BigDecimal/packed-decimal equivalents in modern service; validate in shadow mode before canary
Edge case in balance calculation not discovered until canary Medium High Extend shadow mode to cover month-end, quarter-end; compare 100% of requests, not a sample
API gateway becomes single point of failure Low Critical Deploy gateway in HA configuration (multiple instances, health checks, auto-failover); define bypass route directly to z/OS Connect
Team lacks experience with both mainframe and modern Medium Medium Pair Kwame (mainframe) with Carlos-equivalent (modern) on every extraction; knowledge transfer is a project deliverable, not a side effect
Regulatory concern about data residency/sovereignty during CDC Low High Confirm with compliance that CDC replication complies with data residency requirements; document the data flow for auditors

33.9 Real-World Patterns and Anti-Patterns

Twenty-five years of watching modernization projects has given me a fairly reliable list of what works and what doesn't. Let me share the patterns that have saved projects and the anti-patterns that have killed them.

Patterns That Work

Pattern: Start with the API, not the service.

Before extracting a single service, expose the entire legacy system through an API facade (Chapter 21). This gives consumers a modern interface immediately, decouples them from the mainframe's internal structure, and establishes the contract layer that the strangler fig's routing engine will use. Many organizations discover that exposing COBOL through APIs is 80% of the business value they were looking for, and the urgency to extract individual services decreases.

Pattern: Extract read services first, write services later.

Read-only services (balance inquiry, transaction history, account summary) have dramatically simpler data synchronization requirements than write services (fund transfer, payment processing). Extract all the read services first. By the time you get to write services, your CDC pipeline is battle-tested, your team has learned the extraction lifecycle, and your comparison engine is proven.

Pattern: Keep the legacy system funded and maintained.

The worst thing you can do during a strangler fig migration is treat the legacy system as "dead." It's not dead — it's the safety net. Keep maintaining it. Keep patching it. Keep the COBOL developers on staff. The moment the legacy system becomes a neglected afterthought, you lose your rollback capability, and the strangler fig becomes a high-wire act without a net.

Pattern: Celebrate each extraction as a milestone.

Strangler fig migrations take years. If the only milestone is "done," the team will lose morale two years before they get there. Celebrate each service extraction as a completed project. SecureFirst threw a team dinner when balance inquiry went to 100% modern. Yuki bought Carlos a bonsai tree with a note: "One service down. It's growing."

Anti-Patterns That Kill

Anti-pattern: Extracting multiple services simultaneously.

"We'll save time by extracting balance inquiry, transaction history, and account summary in parallel." No, you won't. You'll triple your complexity, triple your data synchronization challenges, and triple your debugging surface area when something goes wrong. Extract one service at a time. The second extraction will go twice as fast as the first because you've learned the process.

Anti-pattern: Skipping shadow mode.

"Our unit tests have 95% coverage, so we'll go straight to canary." Unit tests cover the cases you thought of. Shadow mode covers the cases you didn't. The seven accounts with judicial garnishment holds at SecureFirst would never have been caught by unit tests — no one thought to write a test for that edge case because no one knew it existed.

Anti-pattern: Building the facade as a monolith.

If your API gateway is a hand-built Java application with custom routing logic, business rule validation, data transformation, and monitoring — congratulations, you've built a new legacy system. The facade should be as thin as possible, using commodity infrastructure (Kong, Apigee, z/OS Connect) that your team doesn't have to maintain.

Anti-pattern: Defining success as "mainframe off."

This one kills more projects than all the technical anti-patterns combined. If your definition of success requires the mainframe to be decommissioned, you will either fail (because the last 30% of services resist extraction) or succeed at catastrophic cost. Define success in business terms: API response times, development velocity, operational cost, staff flexibility. Let the architecture be whatever achieves those outcomes.

🔗 Forward Reference: Chapter 37 (Hybrid Architecture) will bring together the strangler fig's endpoint, the cloud integration patterns from Chapter 34, and the DevOps pipeline from Chapter 36 into a unified architectural vision. The strangler fig is one execution pattern; Chapter 37 shows how all the patterns compose into a sustainable long-term architecture.


Chapter Summary

The strangler fig pattern is the most important execution pattern for mainframe modernization. It transforms a single, catastrophic bet (big-bang migration) into a series of small, reversible experiments (incremental service extraction). The pattern has four components:

  1. The facade — a stateless, observable, reversible routing layer that sits in front of both legacy and modern services
  2. The routing engine — the decision-making component that controls which service handles each request
  3. The extraction pipeline — the lifecycle from identification through shadow mode, parallel running, canary deployment, and decommission
  4. The data synchronization layer — the hardest part, using CDC, dual-write, or shared database patterns to keep legacy and modern data consistent

The key lessons from SecureFirst's implementation:

  • Start with read-only services — they have simpler data synchronization and higher risk tolerance
  • Use the extraction scorecard to choose candidates based on priority (business value and change frequency divided by technical complexity and data coupling), not just business value alone
  • Shadow mode testing with production traffic is non-negotiable — it discovers the edge cases that unit tests can't
  • Design for rollback — every phase must be reversible in under 60 seconds
  • Know when to stop — the hybrid steady state is the realistic endpoint for most organizations

The strangler fig doesn't kill the tree. It creates a partnership between the old and the new, where each component runs on the platform that serves it best. That's not a compromise — it's good architecture.


What's Next

Chapter 34 (COBOL-to-Cloud Integration) builds on the strangler fig by addressing the specific challenges of running extracted services in cloud environments — container packaging, cloud-native data stores, network architecture between z/OS and cloud, and the cost models that determine whether cloud hosting actually saves money. The strangler fig gets the service out of CICS; Chapter 34 gets it into the cloud.


Chapter 33 of 40. Progressive project checkpoint: code/project-checkpoint.md. Estimated study time: 5 hours. Recommended approach: Read Sections 33.1-33.3 in one session, 33.4-33.5 in a second session, 33.6-33.9 in a third session.