Case Study 10.1: The Transaction That Broke TXN-PROC
Background
GlobalBank's TXN-PROC program processes 2.3 million transactions every night. In March 2021, an unusual transaction entered the system: a wire transfer for $99,999,999.99 — the maximum value that the ACCT-CURR-BALANCE field (PIC S9(11)V99 COMP-3) can hold when added to an existing balance.
The Incident
At 2:17 AM on March 15, the overnight batch operator noticed that TXN-PROC had ABENDed with a S0C7 (data exception). The ABEND occurred at offset +002A4E in the program, which corresponded to a COMPUTE statement that calculated the new account balance.
The Root Cause Chain
-
The trigger: A wire transfer of $99,999,999.99 was processed for account 7834291002, which already had a balance of $1,247,832.50.
-
The overflow: The COMPUTE statement
COMPUTE ACCT-CURR-BALANCE = ACCT-CURR-BALANCE + TXN-AMOUNTproduced a result of $101,247,832.49, which exceeded PIC S9(11)V99 (max $99,999,999,999.99 — actually this would fit, but the program had a secondary calculation). -
The secondary calculation: A downstream calculation attempted to multiply the new balance by an interest rate factor. The overflow from step 2 corrupted the COMP-3 field (producing an invalid packed-decimal sign nibble), and the subsequent MULTIPLY on the corrupted field caused the S0C7 ABEND.
-
The missing check: There was no ON SIZE ERROR on the COMPUTE statement, and no validation that the result was within acceptable bounds.
-
The cascade: When TXN-PROC ABENDed, it left the account master file in an inconsistent state — some transactions had been posted, others had not. Recovery required restoring the file from the previous night's backup and rerunning the entire batch.
Timeline
| Time | Event |
|---|---|
| 2:17 AM | TXN-PROC ABENDs with S0C7 |
| 2:20 AM | Operator pages on-call support (Derek Washington) |
| 2:45 AM | Derek identifies the ABEND offset and the corrupted field |
| 3:15 AM | Derek calls Maria Chen for guidance on recovery |
| 3:30 AM | Decision made to restore from backup and rerun |
| 4:00 AM | Backup restore begins |
| 4:45 AM | Restore complete, batch restarted from TXN-PROC |
| 5:17 AM | TXN-PROC ABENDs again on the same transaction |
| 5:20 AM | Derek manually removes the offending transaction from the input file |
| 5:25 AM | Batch restarted, TXN-PROC completes successfully |
| 5:47 AM | Remaining batch steps complete |
Total outage: 3 hours 30 minutes. The batch window ran 2 hours late, delaying the start of online processing.
The Fix
Maria implemented a comprehensive defensive upgrade to TXN-PROC over the following week:
1. ON SIZE ERROR on All Arithmetic
COMPUTE ACCT-CURR-BALANCE =
ACCT-CURR-BALANCE + TXN-AMOUNT
ON SIZE ERROR
MOVE 'BALANCE OVERFLOW' TO WS-ERR-MSG
STRING WS-ERR-MSG DELIMITED BY ' '
' ACCT=' DELIMITED BY SIZE
ACCT-NUMBER DELIMITED BY SIZE
' AMT=' DELIMITED BY SIZE
TXN-AMOUNT DELIMITED BY SIZE
INTO WS-ERR-MSG
END-STRING
PERFORM 9800-LOG-ERROR
PERFORM 4500-WRITE-REJECT
END-COMPUTE
2. Pre-Validation of Transaction Amounts
IF TXN-AMOUNT > WS-MAX-TXN-AMOUNT
MOVE 'TXN AMOUNT EXCEEDS MAXIMUM' TO WS-ERR-MSG
PERFORM 4500-WRITE-REJECT
END-IF
3. Balance Reasonableness Check
COMPUTE WS-PROJECTED-BALANCE =
ACCT-CURR-BALANCE + TXN-AMOUNT
ON SIZE ERROR
PERFORM HANDLE-OVERFLOW
END-COMPUTE
IF WS-PROJECTED-BALANCE > WS-MAX-BALANCE-LIMIT
MOVE 'PROJECTED BALANCE EXCEEDS LIMIT'
TO WS-ERR-MSG
PERFORM 9800-LOG-ERROR
PERFORM 4500-WRITE-REJECT
END-IF
4. VSAM File Recovery Protection
Maria added logic to checkpoint the program's position every 10,000 transactions, writing the last-processed transaction ID to a restart file. If TXN-PROC ABENDs and is restarted, it reads the restart file and skips transactions that were already processed.
Lessons Learned
-
Every arithmetic operation is a potential ABEND. ON SIZE ERROR is not optional for production code.
-
Validate before you calculate. Check that input values are within reasonable bounds before performing arithmetic.
-
Recovery planning is part of defensive programming. The checkpoint/restart pattern reduced the recovery time from 3.5 hours (full restore + rerun) to under 30 minutes in subsequent incidents.
-
The second ABEND was preventable. If the program had rejected the transaction on the first run instead of ABENDing, there would have been no outage at all.
Discussion Questions
-
Why did the S0C7 occur on the MULTIPLY statement rather than the COMPUTE that caused the overflow? What does this tell you about how COMP-3 corruption manifests?
-
Derek's instinct was to remove the offending transaction from the input file. Maria later told him this was the right short-term fix but the wrong long-term approach. Why?
-
How would you design a comprehensive test case set to verify the defensive measures Maria implemented? What boundary values would you test?
-
The checkpoint/restart pattern adds complexity. Under what circumstances is this complexity justified? When is it overkill?