Case Study 1: GlobalBank's Funds Transfer Incident
The Incident
On a Tuesday evening, GlobalBank's customer service center received 47 calls from customers reporting incorrect account balances. Investigation revealed that a batch funds transfer program (XFER-BATCH) had abended midway through processing, leaving 312 transfers in an inconsistent state: source accounts had been debited but destination accounts had not been credited. Total discrepancy: $2.8 million.
The Root Cause
The original XFER-BATCH program, written in 2003, processed transfers in this order:
PERFORM UNTIL END-OF-TRANSFERS
READ TRANSFER-FILE INTO WS-XFER-REC
AT END SET END-OF-TRANSFERS TO TRUE
END-READ
IF NOT END-OF-TRANSFERS
*--- Debit source account ---*
EXEC SQL UPDATE ACCOUNT
SET BALANCE = BALANCE - :WS-AMOUNT
WHERE ACCT_NUMBER = :WS-SOURCE
END-EXEC
EXEC SQL COMMIT END-EXEC
*--- Credit destination account ---*
EXEC SQL UPDATE ACCOUNT
SET BALANCE = BALANCE + :WS-AMOUNT
WHERE ACCT_NUMBER = :WS-DEST
END-EXEC
EXEC SQL COMMIT END-EXEC
ADD 1 TO WS-XFER-COUNT
END-IF
END-PERFORM.
The problem: each debit and credit was committed separately. When the program abended (due to a DB2 tablespace running out of space) between a debit commit and the corresponding credit commit, the debit was permanent but the credit never happened.
Furthermore, the program had no checkpoint/restart logic. After the fix, they had to manually identify which transfers had completed fully, which were half-done, and which had not started.
The Fix
Maria Chen redesigned the transfer program with proper transaction boundaries:
PERFORM UNTIL END-OF-TRANSFERS
READ TRANSFER-FILE INTO WS-XFER-REC
AT END SET END-OF-TRANSFERS TO TRUE
END-READ
IF NOT END-OF-TRANSFERS
SET XFER-OK TO TRUE
*--- Both operations in ONE unit of work ---*
EXEC SQL UPDATE ACCOUNT
SET BALANCE = BALANCE - :WS-AMOUNT
WHERE ACCT_NUMBER = :WS-SOURCE
END-EXEC
IF SQLCODE NOT = 0
SET XFER-FAILED TO TRUE
END-IF
IF XFER-OK
EXEC SQL UPDATE ACCOUNT
SET BALANCE = BALANCE + :WS-AMOUNT
WHERE ACCT_NUMBER = :WS-DEST
END-EXEC
IF SQLCODE NOT = 0
SET XFER-FAILED TO TRUE
END-IF
END-IF
*--- Single commit/rollback decision ---*
IF XFER-OK
EXEC SQL COMMIT END-EXEC
ADD 1 TO WS-XFER-COUNT
ELSE
EXEC SQL ROLLBACK END-EXEC
ADD 1 TO WS-ERROR-COUNT
PERFORM WRITE-ERROR-RECORD
END-IF
END-IF
END-PERFORM.
She also added: - Idempotency checking (transfer ID logged to prevent reprocessing) - Checkpoint every 500 transfers with restart logic - A reconciliation step that verifies total debits equal total credits
The Aftermath
The incident prompted GlobalBank to audit all batch programs that update multiple tables. They found 14 other programs with similar "chatty transaction" patterns — separate commits for operations that should be atomic. A remediation project was launched to fix all 14 programs.
Discussion Questions
- The original programmer may have committed after each UPDATE to minimize lock duration. Was that a reasonable concern? How else could lock duration be managed without breaking atomicity?
- Why is checkpoint/restart logic essential for batch programs that process thousands of records?
- The fix includes writing failed transfers to an error file. Why is this better than simply stopping the program on the first error?
- How would you design an automated reconciliation process that detects debit/credit imbalances?
- This bug existed in production for 20 years before triggering an incident. What does that tell you about the nature of latent bugs?
Lessons Learned
- A COMMIT after each individual UPDATE breaks the logical unit of work
- Transaction boundaries must align with business operations, not individual SQL statements
- Checkpoint/restart is not optional for large batch programs
- Idempotency prevents reprocessing damage during restarts
- Code review should specifically check transaction boundary correctness