Case Study 1: The Midnight Balance Break
Background
First National Credit Union runs a COBOL-based core banking system similar in architecture to the one Derek Washington built in this chapter. Their nightly batch job stream processes approximately 45,000 transactions against 120,000 member accounts. The system has been in production for 22 years and generally runs without incident.
On Tuesday night, the batch run completed normally — all programs set return code 0, no ABENDs, no error messages. But when the Accounting department ran their morning reconciliation on Wednesday, they discovered a problem: the total of all account balances was $147,283.50 less than expected.
"The debits and credits don't balance," said the Controller. "We're short $147,283.50 and I can't explain where it went."
The Investigation
The operations team began by examining the batch run output. Transaction processing had handled 44,872 transactions: 18,340 credits, 21,104 debits, 4,218 transfers, and 1,210 fee assessments. The summary report showed totals for each category, and the totals matched the input file. The audit trail contained 44,872 records — one for each transaction.
At first glance, everything looked correct.
Senior developer Anita Reyes was called in to investigate. Her first step was to compare the audit trail totals against the actual account balances. She wrote a utility program that read the account master sequentially, summing all CURRENT-BALANCE fields. The total was $2,341,456,789.22. The expected total (based on yesterday's balance plus today's credits minus today's debits) was $2,341,604,072.72.
The difference: exactly $147,283.50.
"It's not a rounding error," Anita noted. "It's too precise. Something specific caused this."
She examined the audit trail for patterns. Each audit record contained the before-balance and after-balance for the account. She wrote another utility to verify that every after-balance equaled the before-balance plus or minus the transaction amount. For 44,863 records, the math was perfect. For 9 records, the before-balance was zero and the after-balance was zero — these were rejected transactions (account not found).
That left 44,863 good records. But the total impact should have produced the expected balance. Where was the discrepancy?
The Root Cause
After six hours of analysis, Anita found it. The problem was in the transfer processing logic. When a transfer processed successfully, the program wrote ONE audit trail record — for the source account. It did not write a separate audit trail record for the target account credit.
This was not the root cause of the balance break, however. The transfer logic correctly debited the source and credited the target. The balances were correct. The audit trail was incomplete but the file updates were right.
The real problem was subtler. Among the day's 4,218 transfers, 3 had target accounts that were in "frozen" status. The transfer logic correctly debited the source account. Then it read the target account, found it frozen, and skipped the credit. But the source debit had already been committed via REWRITE.
The 3 failed target credits totaled $147,283.50: one transfer of $95,000.00 (a payroll funding), one of $42,283.50 (an insurance settlement), and one of $10,000.00 (a member transfer).
The program logged these failures to the console (DISPLAY statements), but the console output scrolled past in the overnight run and nobody reviewed it. The program set return code 0 because it treated the target credit failure as a "handled" condition — it did not increment the rejected counter.
Lessons Learned
-
Transfers must be atomic. The two-phase nature of transfers — debit source, credit target — means both must succeed or both must fail. This program debited the source but did not roll back when the target credit failed.
-
The audit trail must be complete. One audit record per transfer was insufficient. Production systems should write one record for the source debit and one for the target credit. The absence of a target credit record was a clue, but it was invisible without a specific check.
-
Return codes must reflect reality. Setting return code 0 when 3 transfers partially failed was misleading. A return code of 4 (warning) would have triggered investigation.
-
Console messages are not error handling. DISPLAY statements that scroll past unread are equivalent to no error handling at all. Errors must be logged to files that are automatically reviewed.
Discussion Questions
- How would you modify the transfer logic in TXN-PROC to prevent this problem?
- Design an automated reconciliation check that would have caught this error before the Accounting department discovered it.
- In a DB2 environment, how would you use COMMIT and ROLLBACK to make transfers atomic?
- What organizational changes (not just technical changes) would help prevent console messages from being ignored?