Case Study 1: The Timezone That Cost $2 Million

Background

Federal Benefits Corp (FBC) and National Pension Fund (NPF) had completed a migration from batch to real-time pension payment processing, similar to the GlobalBank-MedClaim project described in this chapter. The migration had passed a 30-day parallel run with zero discrepancies. Cutover was declared successful.

Three weeks after cutover, NPF's accounting department discovered that 847 pension payments totaling $2.1 million had been processed twice — once through the real-time path and once through a batch path that was supposed to be disabled.

The Investigation

The investigation revealed a subtle timing issue rooted in timezone handling. FBC's adjudication system ran on a mainframe in Denver (Mountain Time). NPF's payment system ran on a mainframe in New York (Eastern Time). The MQ infrastructure sat in a Chicago data center (Central Time).

When FBC adjudicated a claim at 11:30 PM Mountain Time on March 15, the MQ message was timestamped 11:30 PM MT. NPF's consumer program received the message and processed it on March 16 (12:30 AM Eastern). The DB2 transaction log recorded it as March 16.

FBC's batch extract — which was supposed to be disabled — had been reconfigured to extract only claims adjudicated "today" based on FBC's system date. At 11:30 PM Mountain Time, "today" was still March 15 in Denver. The batch extract included the claim. It was transmitted to NPF, where the batch program processed it on March 16 morning.

The reconciliation program compared March 15 batch records against March 15 real-time records. The real-time system had logged the transaction as March 16 (Eastern Time), so it did not appear in the March 15 reconciliation. The March 16 reconciliation showed the real-time transaction but not the batch transaction (which was in the March 15 batch file). Each day's reconciliation was clean — but the transaction was processed on both paths.

Root Cause

The root cause was not a bug in any single program. It was a system-level failure: the migration design did not account for timezone differences between organizations. Specifically:

  1. The "batch disabled" mechanism was date-based, not absolute. It disabled batch for claims adjudicated "today" in Mountain Time.
  2. Claims adjudicated between 10 PM and midnight Mountain Time (midnight to 2 AM Eastern) fell into a "twilight zone" where they were in "today" for FBC but "tomorrow" for NPF.
  3. The reconciliation was date-based, comparing same-date records. Cross-date records were invisible.

The Fix

  1. All timestamps were standardized to UTC. MQ messages, DB2 records, and batch extracts all used UTC timestamps. Local time was used only for display purposes.
  2. The batch disable mechanism was changed from date-based to flag-based. A control table in DB2 had a "BATCH_ENABLED" flag. When set to 'N', the batch extract produced an empty file regardless of date.
  3. The reconciliation window was expanded. Instead of comparing same-date records, the reconciliation now compared records within a 48-hour window, catching cross-timezone matches.

Recovery

The 847 duplicate payments were reversed over a two-week period. Each reversal required an explanation letter to the pension recipient and a correcting transaction in both systems. The total cost of the incident — including investigation, recovery, customer communication, and regulatory reporting — was approximately $350,000 beyond the $2.1 million in duplicate payments.

Discussion Questions

  1. How could the parallel-run period have been designed to catch this timezone issue before cutover?
  2. Why is UTC the standard for inter-system timestamps? What are the drawbacks of using local time?
  3. The reconciliation program compared same-date records. Design a reconciliation that handles cross-timezone, cross-date transactions. What additional data would you need?
  4. The batch disable mechanism was date-based. What are the advantages of a flag-based disable mechanism? Are there any disadvantages?
  5. What organizational process changes would prevent similar timezone-related issues in future cross-organization integrations?