Case Study 1: GlobalBank's Funds Transfer Incident

The Incident

On a Tuesday evening, GlobalBank's customer service center received 47 calls from customers reporting incorrect account balances. Investigation revealed that a batch funds transfer program (XFER-BATCH) had abended midway through processing, leaving 312 transfers in an inconsistent state: source accounts had been debited but destination accounts had not been credited. Total discrepancy: $2.8 million.

The Root Cause

The original XFER-BATCH program, written in 2003, processed transfers in this order:

       PERFORM UNTIL END-OF-TRANSFERS
           READ TRANSFER-FILE INTO WS-XFER-REC
               AT END SET END-OF-TRANSFERS TO TRUE
           END-READ

           IF NOT END-OF-TRANSFERS
      *--- Debit source account ---*
               EXEC SQL UPDATE ACCOUNT
                   SET BALANCE = BALANCE - :WS-AMOUNT
                   WHERE ACCT_NUMBER = :WS-SOURCE
               END-EXEC
               EXEC SQL COMMIT END-EXEC

      *--- Credit destination account ---*
               EXEC SQL UPDATE ACCOUNT
                   SET BALANCE = BALANCE + :WS-AMOUNT
                   WHERE ACCT_NUMBER = :WS-DEST
               END-EXEC
               EXEC SQL COMMIT END-EXEC

               ADD 1 TO WS-XFER-COUNT
           END-IF
       END-PERFORM.

The problem: each debit and credit was committed separately. When the program abended (due to a DB2 tablespace running out of space) between a debit commit and the corresponding credit commit, the debit was permanent but the credit never happened.

Furthermore, the program had no checkpoint/restart logic. After the fix, they had to manually identify which transfers had completed fully, which were half-done, and which had not started.

The Fix

Maria Chen redesigned the transfer program with proper transaction boundaries:

       PERFORM UNTIL END-OF-TRANSFERS
           READ TRANSFER-FILE INTO WS-XFER-REC
               AT END SET END-OF-TRANSFERS TO TRUE
           END-READ

           IF NOT END-OF-TRANSFERS
               SET XFER-OK TO TRUE

      *--- Both operations in ONE unit of work ---*
               EXEC SQL UPDATE ACCOUNT
                   SET BALANCE = BALANCE - :WS-AMOUNT
                   WHERE ACCT_NUMBER = :WS-SOURCE
               END-EXEC
               IF SQLCODE NOT = 0
                   SET XFER-FAILED TO TRUE
               END-IF

               IF XFER-OK
                   EXEC SQL UPDATE ACCOUNT
                       SET BALANCE = BALANCE + :WS-AMOUNT
                       WHERE ACCT_NUMBER = :WS-DEST
                   END-EXEC
                   IF SQLCODE NOT = 0
                       SET XFER-FAILED TO TRUE
                   END-IF
               END-IF

      *--- Single commit/rollback decision ---*
               IF XFER-OK
                   EXEC SQL COMMIT END-EXEC
                   ADD 1 TO WS-XFER-COUNT
               ELSE
                   EXEC SQL ROLLBACK END-EXEC
                   ADD 1 TO WS-ERROR-COUNT
                   PERFORM WRITE-ERROR-RECORD
               END-IF
           END-IF
       END-PERFORM.

She also added: - Idempotency checking (transfer ID logged to prevent reprocessing) - Checkpoint every 500 transfers with restart logic - A reconciliation step that verifies total debits equal total credits

The Aftermath

The incident prompted GlobalBank to audit all batch programs that update multiple tables. They found 14 other programs with similar "chatty transaction" patterns — separate commits for operations that should be atomic. A remediation project was launched to fix all 14 programs.

Discussion Questions

  1. The original programmer may have committed after each UPDATE to minimize lock duration. Was that a reasonable concern? How else could lock duration be managed without breaking atomicity?
  2. Why is checkpoint/restart logic essential for batch programs that process thousands of records?
  3. The fix includes writing failed transfers to an error file. Why is this better than simply stopping the program on the first error?
  4. How would you design an automated reconciliation process that detects debit/credit imbalances?
  5. This bug existed in production for 20 years before triggering an incident. What does that tell you about the nature of latent bugs?

Lessons Learned

  • A COMMIT after each individual UPDATE breaks the logical unit of work
  • Transaction boundaries must align with business operations, not individual SQL statements
  • Checkpoint/restart is not optional for large batch programs
  • Idempotency prevents reprocessing damage during restarts
  • Code review should specifically check transaction boundary correctness