Case Study 1: The Copybook That Broke Everything

Background

Heartland Insurance had been modernizing their claims processing system for eight months. The team, led by a senior developer with 12 years of COBOL experience, had successfully completed Phase 1 (documentation) and Phase 2 (refactoring) and was partway through Phase 3 (DB2 migration). Morale was high. Production incidents had dropped by 40%. The modernization was ahead of schedule.

Then, on a Tuesday morning, the claims team reported that provider payments were wrong. Not slightly wrong — catastrophically wrong. Some providers were being paid $0.00 for legitimate claims. Others were being paid the charged amount instead of the allowed amount. The total payment error for the overnight batch run was $2.3 million.

The Investigation

The team immediately switched to the pre-modernization version of the payment program (CLM-PAY), which had been preserved as a rollback option. The Wednesday overnight batch used the legacy version and payments were correct. The crisis was contained, but the root cause was unknown.

The investigation began with the payment calculation program. The modernized version of CLM-PAY had been in production for three weeks without incident. What changed?

The answer was a copybook. On Monday — the day before the payment errors — a different team member had updated the CLMREC copybook as part of the DB2 migration work. The change was small: adding a new field (CLM-DB2-TIMESTAMP) to support the DB2 migration. To make room, the developer reduced FILLER from 20 bytes to 4 bytes.

The problem: CLM-PAY had been compiled against the OLD version of the copybook. It had been modernized three weeks ago, tested, and promoted to production. Nobody recompiled it when the copybook changed.

With the old copybook, CLM-PAID-AMOUNT started at byte offset 142. With the new copybook, CLM-PAID-AMOUNT started at byte offset 158 — because the new field shifted everything after it by 16 bytes. CLM-PAY was reading bytes 142-149 and interpreting them as the paid amount, but those bytes now contained part of the new timestamp field.

The Root Cause Chain

  1. Developer A modified the CLMREC copybook (added 16 bytes, reduced FILLER by 16)
  2. Developer A recompiled the programs they were working on (CLM-INTAKE, CLM-ADJUD)
  3. Developer A did NOT recompile CLM-PAY because it was "not part of their project"
  4. The CLMREC copybook was promoted to production
  5. CLM-PAY continued using the in-memory version compiled against the old copybook
  6. Data written by CLM-ADJUD (new layout) was read by CLM-PAY (old layout) — offset mismatch

Lessons Learned

  1. Copybook changes affect every program. When a copybook is modified, every program that uses it must be recompiled and retested. There are no exceptions. The team created a cross-reference list mapping every copybook to every program that COPYs it.

  2. Automated dependency tracking is essential. Manual tracking of copybook dependencies is error-prone. The team implemented a script that scans all COBOL source for COPY statements and generates a dependency matrix. Before any copybook promotion, the script identifies all programs that need recompilation.

  3. FILLER is not free space to consume. The FILLER at the end of a copybook is reserved for emergency expansion, not for routine field additions. When more space is needed, the record length should be increased — which forces a review of all affected programs and VSAM definitions.

  4. Parallel runs must be repeated after every change. The original modernization had been verified with parallel runs. But after the copybook change, no parallel run was performed. If it had been, the offset mismatch would have been caught immediately.

  5. Rollback saved the business. Because the legacy version of CLM-PAY was preserved, the team could revert to correct payments within one business day. Without the rollback option, the recovery would have taken a week.

Discussion Questions

  1. How would you design a build system that automatically recompiles all programs affected by a copybook change?
  2. What testing could have caught this error before it reached production?
  3. The developer who modified the copybook was following a reasonable process — they recompiled the programs they were working on. What process change would prevent this from happening again?
  4. How would a DB2-based system (using DCLGEN copybooks) handle this differently than a VSAM-based system?