Chapter 33 Key Takeaways: Strangler Fig Pattern for Mainframes
Core Principle
The strangler fig pattern transforms mainframe modernization from a single, catastrophic bet into a series of small, reversible experiments. There is no "go-live day." There is no "cutover weekend." There is a gradual, measurable, reversible transfer of traffic from legacy to modern, one service at a time, over months or years. If a new service has a problem, you route traffic back to the legacy system in seconds.
The Three Architectural Components
| Component | Role | Key Property |
|---|---|---|
| Facade | Single entry point for all consumers; routes to legacy or modern | Transparent, stateless, observable, reversible |
| Routing Engine | Decides which service handles each request | Configurable (path, percentage, user, time, feature toggle) |
| Data Synchronization | Keeps legacy and modern data consistent during coexistence | The hardest part — source of most production incidents |
Extraction Scorecard
Priority = (Business Value × 3) + (Change Frequency × 2) + Risk Tolerance
─────────────────────────────────────────────────────────────────
Technical Complexity + Data Coupling
Extraction order rule: Start with read-only services that have high business value, low complexity, and low data coupling. Balance inquiry is almost always the first candidate. Fund transfer is almost always the last.
Data Synchronization Patterns
| Pattern | When to Use | Data Consistency | Complexity |
|---|---|---|---|
| Legacy as Source of Truth | Phase 1-3, read-only services | Eventual (CDC lag) | Low |
| Dual-Write with Legacy Priority | Phase 3+, write services | Eventual + compensating transactions | High |
| Event Sourcing | Greenfield only (not brownfield) | Event-log-based | Very high |
| Shared Database | Transitional only | Immediate (same DB) | Low, but limits evolution |
Start with Pattern 1. Move to Pattern 2 only when you need write capability in the modern service, and only after the CDC pipeline has been battle-tested.
The Extraction Pipeline Lifecycle
IDENTIFY → UNDERSTAND → BUILD → SHADOW → PARALLEL → CANARY → MIGRATE → DECOMMISSION
Critical phases:
- UNDERSTAND: Map every business rule, every edge case, every undocumented behavior. This is the knowledge capture phase. It may be the most valuable output of the entire project.
- SHADOW: Route copies of production traffic to the modern service; compare responses. Non-negotiable — it finds edge cases that no test suite can.
- DECOMMISSION: Only after 90+ days of standby with no issues. Archive source code; never delete it.
Five Natural Seams in COBOL/CICS
- CICS transaction boundaries — route the transaction elsewhere
- COMMAREA / channel boundaries — the service contract
- Copybook boundaries — isolated data structures = clean extraction
- DB2 view boundaries — change what's behind the view
- MQ queue boundaries — change the consumer
If you can't find clean seams, create them through pre-extraction refactoring.
Shadow Mode and Parallel Running
| Phase | What Happens | Duration |
|---|---|---|
| Shadow mode | Modern service gets copies of traffic; responses discarded; comparison engine logs results | 2-4 weeks minimum (one full monthly cycle) |
| Parallel running | Both services handle real traffic; primary's response goes to consumer; comparison continues | 4-8 weeks with gradual traffic shift |
| Canary deployment | Controlled subset of real users see modern service's response | 4-12 weeks across escalating rings |
Comparison rule for financial data: Exact to the penny. No tolerance. A one-cent rounding difference means the systems disagree about account state, and that disagreement compounds into regulatory findings.
Critical Technical Lessons
COMP-3 vs. Floating-Point
COBOL COMP-3 (packed decimal) stores monetary values exactly. IEEE 754 floating-point does not. Use BigDecimal (Java/Kotlin) or Decimal (Python) for all monetary calculations in the modern service. This is a guaranteed production incident if you use float or double.
CDC Latency During Batch Windows
CDC from DB2 works perfectly until the batch window opens and the recovery log generates gigabytes per hour. Plan for 30-60 second latency spikes during batch processing. Options: freshness indicators, time-based routing to legacy during batch, or accepting the latency.
Correlation ID Propagation
Every request through the facade must carry a correlation ID that flows through to both legacy and modern services. Without it, you can't match responses in the comparison engine.
When to Stop Strangling
- Define success in business terms (API response time, development velocity, cost reduction, staff flexibility) — not as "mainframe off"
- The hybrid steady state is the realistic endpoint: mainframe for high-throughput OLTP and complex batch; modern services for consumer-facing APIs and high-change-frequency capabilities
- The asymptotic problem: the first 50% of services take 30% of the effort; the last 30% would take 70% of the effort and may not be worth extracting
- Standby before decommission: 90-180 days minimum, covering all periodic cycles (monthly, quarterly, annual)
Patterns That Work
- Start with the API, not the service (expose everything through a facade first)
- Extract read services first, write services later
- Keep the legacy system funded and maintained (it's your safety net)
- Celebrate each extraction as a milestone
- Pair mainframe and modern developers on every extraction
Anti-Patterns That Kill
- Extracting multiple services simultaneously
- Skipping shadow mode
- Building the facade as a monolith (with business logic, transformations, orchestration)
- Defining success as "mainframe off"
- Picking the highest-value service first instead of the highest-priority-score service
The One Sentence Summary
The strangler fig doesn't kill the tree — it creates a hybrid where each component runs on the platform that serves it best, and that's not a compromise; it's good architecture.