Case Study 1: MidWest Mutual's Failed COBOL-to-Cloud Migration — The $47 Million Lesson

Background

MidWest Mutual Life Insurance is a Tier-2 insurer based in Des Moines, Iowa, with 4.2 million policyholders, $18 billion in assets under management, and a mainframe running IBM z15 hardware with 5,800 MIPS. The mainframe processes 1.8 million policy administration transactions per month through CICS, runs 6,400 batch jobs nightly, and manages 4.2TB of DB2 data and 1.8TB of VSAM files.

In January 2023, MidWest Mutual's board approved a three-year, $52 million program to "migrate the mainframe to the cloud." The stated goals were:

  1. Eliminate the $24 million annual mainframe operating cost
  2. Enable faster time-to-market for new insurance products
  3. Attract younger technology talent by offering a modern technology stack
  4. Reduce vendor lock-in with IBM

The program was led by Thomas Harrington, a newly hired CTO with a strong AWS background and no mainframe experience. The consulting partner was a major systems integrator whose mainframe modernization practice had grown 300% in three years. Neither Thomas nor the consulting partner had successfully completed a mainframe migration of this scale.

The Plan

The approved plan had four phases:

Phase 1 (Months 1-6): Foundation - Establish AWS landing zone - Set up Micro Focus Enterprise Server on EC2 - Migrate 3 development environments to cloud - POC: migrate 10 batch programs

Phase 2 (Months 7-18): Batch Migration - Migrate all 6,400 batch jobs to AWS - Convert VSAM files to Amazon Aurora PostgreSQL - Implement AWS Step Functions for job scheduling - Parallel run for 3 months

Phase 3 (Months 19-30): Online Migration - Migrate all CICS transactions to Micro Focus Enterprise Server on EC2 - Convert BMS maps to web-based UI - Implement API gateway for external integrations - Parallel run for 3 months

Phase 4 (Months 31-36): Decommission - Turn off the mainframe - Terminate IBM contracts - Achieve full cloud operations

The timeline, the budget, and the plan were all approved based on the vendor's proposal.

What Actually Happened

Phase 1: Success Breeds Overconfidence (Months 1-8)

Phase 1 went well — better than expected, in fact. The three development environments were running on AWS by month 4. The 10-program batch POC was completed by month 6. The POC programs compiled, ran, and produced correct output on Micro Focus Enterprise Server.

Thomas presented the Phase 1 results to the board with visible pride. "We've proven the platform. COBOL runs on cloud. Phase 2 is a scaling exercise."

Ellen Park, MidWest Mutual's head of actuarial systems and a 22-year mainframe veteran, raised a concern during the board meeting: "The 10 POC programs were all simple sequential read-and-calculate programs. They don't touch CICS, they don't use GDGs, and they don't have inter-program CALL chains. The hard batch programs are the 380 that use SORT exits, dynamic CALLs, and VSAM alternate indexes. Those weren't in the POC."

Thomas's response: "Phase 2 will handle those. The platform is proven."

Ellen later told her team: "He tested the bicycle and concluded the truck would fit across the creek."

Lesson 1: A POC that validates the easy 20% tells you nothing about the hard 80%. Design your POC to include the most complex programs, not the simplest.

Phase 2: The Batch Migration Wall (Months 7-22)

Phase 2 began on schedule and fell behind within six weeks. The problems were cumulative, technical, and relentless.

Problem 1: VSAM-to-PostgreSQL Conversion (4 months behind)

The plan called for converting 1.8TB of VSAM files to PostgreSQL tables. The conversion tool handled the data transformation — EBCDIC to UTF-8, packed decimal to numeric, COMP to integer — but could not handle:

  • Alternate indexes. 340 VSAM files had alternate indexes (AIX). The COBOL programs did READ with ALTERNATE KEY, which maps to a different access pattern than a PostgreSQL secondary index. The Micro Focus VSAM emulation supported primary keys and one alternate key per file. 47 files had two or more alternate keys. Each required manual redesign.

  • REDEFINES in VSAM records. 128 VSAM files used REDEFINES clauses in their record layouts — the same physical bytes interpreted differently depending on a record-type field. The PostgreSQL conversion had to either: (a) create separate tables for each record type (breaking programs that read all record types in a single COBOL READ), or (b) store the data as a BLOB and interpret it in the COBOL program (losing queryability). Neither option was clean.

  • Relative record datasets (RRDS). 23 VSAM files were RRDS — accessed by relative record number, not by key. PostgreSQL doesn't have a native relative-record concept. The conversion team implemented a workaround using auto-incrementing integer primary keys, but programs that calculated record numbers using arithmetic (COMPUTE WS-RECNUM = WS-ACCOUNT-NUM / 1000) produced wrong results because the PostgreSQL row IDs didn't match the VSAM relative record numbers.

Problem 2: JCL Complexity (3 months behind)

Of the 6,400 batch jobs, 4,100 used features that the Micro Focus JCL interpreter handled correctly. The remaining 2,300 used features that required manual intervention:

  • GDG (Generation Data Group) dependencies. 890 jobs used GDGs with inter-job dependencies — Job A creates GDG(+1), Job B reads GDG(0). The Micro Focus GDG emulation worked for simple linear chains but failed for GDG management where downstream jobs conditionally referenced different generations based on return codes from upstream jobs.

  • DFSORT exit routines. 147 jobs used DFSORT with user-written exit routines (E15, E35) in COBOL or Assembler. Micro Focus's SORT emulation supported standard SORT control statements but not all exit routine interfaces. Each exit routine had to be manually analyzed and, in many cases, rewritten.

  • Conditional JCL. 412 jobs used complex IF/THEN/ELSE/ENDIF logic that referenced system symbols, return codes from previous steps, and generation data set names. The edge cases in conditional evaluation — particularly around COND parameters combined with IF statements — differed between JES2 and the Micro Focus interpreter.

  • Cataloged procedures with overrides. 680 jobs invoked cataloged procedures (PROCs) with DD statement overrides. The override resolution logic worked correctly for simple cases but produced different results when overriding concatenated DD statements or when the PROC itself invoked another PROC (nested PROC calls with overrides at multiple levels).

Problem 3: Data Conversion Errors (ongoing)

The EBCDIC-to-ASCII conversion was applied to all data files. Most conversions were correct. The exceptions were devastating:

  • Packed decimal with non-standard signs. 12 VSAM files contained packed decimal fields where the sign nibble was X'A' (EBCDIC positive variant) instead of the standard X'C'. These fields had been created by Assembler programs written in 1987. The conversion tool treated X'A' as an invalid sign and converted the fields to zero. The error wasn't detected until a monthly actuarial run produced reserve calculations that were $340 million too low. (Caught during parallel run, thankfully.)

  • Mixed EBCDIC/binary records. 8 VSAM files contained records with both EBCDIC text fields and binary data fields (COMP, COMP-1, COMP-2). The conversion tool applied EBCDIC-to-ASCII conversion to the entire record, corrupting the binary fields. The fix — applying conversion selectively based on the copybook field definitions — required custom conversion logic for each file.

  • Codepage-specific characters. The actuarial system used Code Page 037 (US English EBCDIC). The Canadian business used Code Page 037 for English and Code Page 500 for French. 14 programs switched between code pages using the SET CODEPAGE statement. The conversion tool assumed a single code page. French-language policy documents came through garbled.

Problem 4: Batch Window Expansion (never resolved)

On z/OS, the nightly batch window was 4 hours 15 minutes, managed by CA7 with a finely tuned dependency graph and WLM-prioritized execution. On AWS:

  • Sequential batch programs ran 1.5-2x slower due to I/O differences (same pattern as CNB's reporting)
  • Parallel batch programs ran 2-4x slower because the cloud environment couldn't match the z/OS I/O subsystem's ability to service concurrent random I/O from multiple address spaces
  • The CA7-equivalent scheduling (implemented using a combination of AWS Step Functions and Apache Airflow) took 5 months to build and still couldn't express all the conditional dependencies in the original CA7 schedule

The nightly batch window on AWS: 11 hours 40 minutes. Nearly three times the mainframe window. This meant batch was still running when the online window started, creating contention for database resources that didn't exist on z/OS (where WLM separated batch and online workloads).

Phase 2 status at month 22: 60% of batch jobs migrated and validated. Original target: 100% by month 18.

Phase 3: The CICS Migration That Never Started (Months 19-28)

Phase 3 was supposed to begin at month 19. It didn't begin until month 24 because Phase 2 wasn't complete. When the team finally turned to CICS migration, they encountered problems that were fundamentally different from the batch challenges.

The Transaction Volume Problem.

MidWest Mutual's CICS environment processed 1.8 million transactions per month — approximately 1,000 transactions per hour during business hours, with peaks of 3,500/hour during enrollment periods. On z/OS CICS, the p99 response time was 4.2ms.

On Micro Focus Enterprise Server (EC2 r6i.8xlarge), the same transactions ran at 45-80ms p99. This was above MidWest Mutual's SLA of 50ms for agent-facing transactions.

The team attempted tuning: larger instances, faster storage, connection pooling, JVM memory optimization (for the Micro Focus runtime). After three months of tuning, p99 was 52ms — just barely above the SLA.

The BMS Map Problem.

MidWest Mutual had 1,247 BMS maps for its 3270 terminal interface. The plan was to convert these to web-based interfaces. The conversion tool produced HTML pages that functioned — fields appeared, data could be entered, transactions could be invoked — but:

  • The HTML rendering didn't match the 3270 layout precisely, which confused the 800 insurance agents who had memorized field positions
  • PF key shortcuts (PF3 = exit, PF5 = refresh, PF7/PF8 = page up/down) didn't translate cleanly to browser interactions
  • The conversion didn't handle SEND MAP ERASE/SEND MAP ACCUM patterns correctly — screens that were built incrementally with multiple SEND MAP calls rendered incorrectly
  • Performance: the BMS emulation added 15-20ms per screen interaction on top of the transaction latency

The Two-Phase Commit Problem.

34 critical CICS transactions coordinated updates across DB2 and MQ using two-phase commit. On z/OS, the CICS recovery manager coordinated with the DB2 and MQ resource managers through the RRS (Resource Recovery Services) protocol.

On Micro Focus Enterprise Server, two-phase commit was supported for local database connections but not for the combination of PostgreSQL (the migrated DB2) and a cloud-hosted MQ equivalent. The team investigated three options:

  1. Implement XA transactions across PostgreSQL and Amazon MQ — technically possible but untested at MidWest Mutual's scale
  2. Rewrite the 34 transactions to use eventual consistency (saga pattern) — a fundamental architectural change requiring business logic redesign
  3. Keep the 34 critical transactions on the mainframe and route to them via API — which meant keeping the mainframe running

Phase 3 status at month 28: 15% of CICS transactions migrated (the simple ones). The 34 two-phase-commit transactions were deferred. The BMS conversion was 40% complete. Thomas Harrington, under pressure from the board, authorized spending an additional $8M to bring in a second consulting firm specializing in CICS migration.

The Cancellation (Month 30)

At month 30, the board called for a full program review. The numbers were:

Category Budget Actual Spend Projected to Complete
Phase 1 (Foundation) $4.2M | $3.8M $3.8M (complete)
Phase 2 (Batch) $18M | $22.4M $26M (projected)
Phase 3 (Online) $22M | $12.6M $38M (projected, 15% done)
Phase 4 (Decommission) $3M | $0 $3M (unchanged)
Additional consulting $0 | $4.2M $8M (second firm)
TOTAL $47.2M** | **$43.0M $75M+

The projected cost to complete had risen from $52M to $75M or more — and the revised timeline was 48-54 months, not 36. Meanwhile:

  • The mainframe was still running at full capacity (Phase 2 batch was running in parallel, not as a replacement)
  • Annual mainframe costs were still $24M/year (no savings yet)
  • The dual-environment operational cost (mainframe + cloud) was $31M/year
  • The batch window on cloud was 11 hours 40 minutes vs. 4 hours 15 minutes on mainframe
  • 34 critical CICS transactions had no cloud migration path without architectural redesign

Ellen Park presented her analysis to the board:

"We've spent $43 million. We have a partially working batch environment on cloud that runs three times slower than the mainframe. We have 15% of our CICS transactions migrated, all of them the simple ones. The 34 transactions that handle premium payments, claims disbursements, and policy issuance — the transactions that *are* our business — can't be migrated without redesigning our transaction architecture. The estimated cost to complete is $75M, and that estimate will go up because it's based on the same methodology that produced the original $52M estimate."

She paused. "I recommend we stop the cloud migration for online systems, keep the batch workloads that are working on cloud, bring CICS back to the mainframe, and use the remaining budget for API-wrapping the mainframe transactions so our digital channels can consume them."

The board adopted Ellen's recommendation. Thomas Harrington resigned two weeks later.

Post-Mortem: What Went Wrong

Ellen conducted a formal post-mortem with the team. The findings:

Root Cause 1: Wrong Mental Model

The program treated the mainframe as a single, monolithic system to be replaced. It was actually an ecosystem of tightly integrated subsystems — CICS transaction manager, DB2 database, MQ messaging, VSAM file system, JES2 job management, WLM workload management, RACF security — each providing capabilities that required separate replacement strategies on cloud. The "two boxes and an arrow" vendor slide had led to a "two boxes and an arrow" architecture assumption.

Root Cause 2: POC Selection Bias

The Phase 1 POC selected the easiest programs — simple sequential batch with no CICS, no GDGs, no SORT exits, no inter-program CALLs. This created false confidence about platform readiness. A better POC would have selected the hardest programs to identify the technical walls early.

Root Cause 3: No Mainframe Expertise on the Migration Team

The migration team had 42 people. Three had mainframe experience. Of those three, two were junior developers and one was a contractor who left at month 9. The team didn't understand what they were migrating from, only what they were migrating to. When they hit VSAM alternate index issues or DFSORT exit routines or GDG conditional references, they had to learn what these features were before they could design replacements.

Root Cause 4: TCO Based on Allocated Cost

The $24M/year mainframe cost was the total operating cost. The program assumed this entire cost would be eliminated. In reality, even if the migration had succeeded, the mainframe would need to remain operational during 12+ months of parallel running. The actual Year 1 savings (if the migration completed on time) would have been approximately $6M, not $24M.

Root Cause 5: Underestimating Data Conversion

The EBCDIC-to-ASCII conversion was treated as a one-time, automated task. It was a months-long, iterative process requiring COBOL expertise to understand the data layouts, conversion expertise to handle edge cases (packed decimal signs, mixed binary/text records, multi-codepage data), and extensive validation to verify correctness.

The Salvage

MidWest Mutual spent $43 million and ended up with:

  1. Dev/test on cloud — the one clear win. Three dev environments running on AWS, saving approximately $2.1M/year in mainframe MIPS.
  2. Batch reporting on cloud — about 40% of the nightly batch (read-only reporting and analytics) successfully running on AWS, saving approximately $1.4M/year in marginal MIPS (these jobs DID run during peak hours, unlike CNB's reporting).
  3. A $43 million education in what cloud can and cannot do for mainframe COBOL workloads.
  4. An API modernization program — Ellen's recommendation, funded from the remaining budget, which wrapped 12 key CICS transactions in z/OS Connect APIs within 8 months and enabled the digital channel integration that the CTO had originally wanted.

Ellen's summary, delivered with the weary precision of someone who had been saying this for two years: "We spent $43 million to learn that the mainframe is good at what the mainframe does. We could have learned that by reading a book."

Key Takeaways for Practitioners

  1. Never run a POC with easy programs. The POC must include the hardest 10% — the SORT exits, the GDG conditional logic, the REDEFINES-heavy VSAM, the two-phase-commit CICS transactions. If those work, the rest will work. If those don't work, you know your boundaries before you've committed $52M.

  2. Every migration team needs mainframe expertise. Not optional. Not "nice to have." At least 30% of the team should be experienced mainframe professionals who understand what they're migrating from. You can't migrate what you don't understand.

  3. Calculate marginal cost, not allocated cost. The mainframe bill doesn't disappear when you move workloads. It decreases marginally, based on which workloads run during the peak MIPS window. Build the TCO model on marginal savings.

  4. Plan for the CICS wall. Batch migration is the easy part. CICS migration is where projects die, because CICS is not just a runtime — it's a transaction manager, a resource manager, a recovery manager, and a workload manager integrated into a single system. Nothing on cloud replicates that integration.

  5. Hybrid is not a fallback — it's the architecture. MidWest Mutual ended up with a hybrid architecture (dev/test + reporting on cloud, everything else on mainframe) after spending $43M to get there. If they had started with hybrid as the target, they could have achieved the same outcome for $8-12M.

Discussion Questions

  1. At what point during the program should the board have demanded a course correction? What early warning signals were available?

  2. Thomas Harrington had no mainframe experience. Should that have disqualified him from leading the program? What mainframe expertise should a CTO have to make informed migration decisions?

  3. Ellen Park raised concerns during the Phase 1 board meeting that were dismissed. What organizational dynamics allowed valid technical concerns to be overridden? How should architecture governance prevent this?

  4. The vendor's original proposal was $52M / 36 months. The actual cost was $43M spent + $32M+ remaining. What accountability should the vendor bear? How should the contract have been structured to share risk?

  5. Compare MidWest Mutual's outcome with CNB's outcome. Both started with COBOL-to-cloud ambitions. Why did CNB succeed (with a targeted hybrid approach) while MidWest Mutual failed (with a full migration approach)? What organizational differences drove the different outcomes?