Case Study 10.1: The Transaction That Broke TXN-PROC

Background

GlobalBank's TXN-PROC program processes 2.3 million transactions every night. In March 2021, an unusual transaction entered the system: a wire transfer for $99,999,999.99 — the maximum value that the ACCT-CURR-BALANCE field (PIC S9(11)V99 COMP-3) can hold when added to an existing balance.

The Incident

At 2:17 AM on March 15, the overnight batch operator noticed that TXN-PROC had ABENDed with a S0C7 (data exception). The ABEND occurred at offset +002A4E in the program, which corresponded to a COMPUTE statement that calculated the new account balance.

The Root Cause Chain

  1. The trigger: A wire transfer of $99,999,999.99 was processed for account 7834291002, which already had a balance of $1,247,832.50.

  2. The overflow: The COMPUTE statement COMPUTE ACCT-CURR-BALANCE = ACCT-CURR-BALANCE + TXN-AMOUNT produced a result of $101,247,832.49, which exceeded PIC S9(11)V99 (max $99,999,999,999.99 — actually this would fit, but the program had a secondary calculation).

  3. The secondary calculation: A downstream calculation attempted to multiply the new balance by an interest rate factor. The overflow from step 2 corrupted the COMP-3 field (producing an invalid packed-decimal sign nibble), and the subsequent MULTIPLY on the corrupted field caused the S0C7 ABEND.

  4. The missing check: There was no ON SIZE ERROR on the COMPUTE statement, and no validation that the result was within acceptable bounds.

  5. The cascade: When TXN-PROC ABENDed, it left the account master file in an inconsistent state — some transactions had been posted, others had not. Recovery required restoring the file from the previous night's backup and rerunning the entire batch.

Timeline

Time Event
2:17 AM TXN-PROC ABENDs with S0C7
2:20 AM Operator pages on-call support (Derek Washington)
2:45 AM Derek identifies the ABEND offset and the corrupted field
3:15 AM Derek calls Maria Chen for guidance on recovery
3:30 AM Decision made to restore from backup and rerun
4:00 AM Backup restore begins
4:45 AM Restore complete, batch restarted from TXN-PROC
5:17 AM TXN-PROC ABENDs again on the same transaction
5:20 AM Derek manually removes the offending transaction from the input file
5:25 AM Batch restarted, TXN-PROC completes successfully
5:47 AM Remaining batch steps complete

Total outage: 3 hours 30 minutes. The batch window ran 2 hours late, delaying the start of online processing.

The Fix

Maria implemented a comprehensive defensive upgrade to TXN-PROC over the following week:

1. ON SIZE ERROR on All Arithmetic

           COMPUTE ACCT-CURR-BALANCE =
               ACCT-CURR-BALANCE + TXN-AMOUNT
               ON SIZE ERROR
                   MOVE 'BALANCE OVERFLOW' TO WS-ERR-MSG
                   STRING WS-ERR-MSG DELIMITED BY '  '
                          ' ACCT=' DELIMITED BY SIZE
                          ACCT-NUMBER DELIMITED BY SIZE
                          ' AMT=' DELIMITED BY SIZE
                          TXN-AMOUNT DELIMITED BY SIZE
                     INTO WS-ERR-MSG
                   END-STRING
                   PERFORM 9800-LOG-ERROR
                   PERFORM 4500-WRITE-REJECT
           END-COMPUTE

2. Pre-Validation of Transaction Amounts

           IF TXN-AMOUNT > WS-MAX-TXN-AMOUNT
               MOVE 'TXN AMOUNT EXCEEDS MAXIMUM' TO WS-ERR-MSG
               PERFORM 4500-WRITE-REJECT
           END-IF

3. Balance Reasonableness Check

           COMPUTE WS-PROJECTED-BALANCE =
               ACCT-CURR-BALANCE + TXN-AMOUNT
               ON SIZE ERROR
                   PERFORM HANDLE-OVERFLOW
           END-COMPUTE

           IF WS-PROJECTED-BALANCE > WS-MAX-BALANCE-LIMIT
               MOVE 'PROJECTED BALANCE EXCEEDS LIMIT'
                   TO WS-ERR-MSG
               PERFORM 9800-LOG-ERROR
               PERFORM 4500-WRITE-REJECT
           END-IF

4. VSAM File Recovery Protection

Maria added logic to checkpoint the program's position every 10,000 transactions, writing the last-processed transaction ID to a restart file. If TXN-PROC ABENDs and is restarted, it reads the restart file and skips transactions that were already processed.

Lessons Learned

  1. Every arithmetic operation is a potential ABEND. ON SIZE ERROR is not optional for production code.

  2. Validate before you calculate. Check that input values are within reasonable bounds before performing arithmetic.

  3. Recovery planning is part of defensive programming. The checkpoint/restart pattern reduced the recovery time from 3.5 hours (full restore + rerun) to under 30 minutes in subsequent incidents.

  4. The second ABEND was preventable. If the program had rejected the transaction on the first run instead of ABENDing, there would have been no outage at all.

Discussion Questions

  1. Why did the S0C7 occur on the MULTIPLY statement rather than the COMPUTE that caused the overflow? What does this tell you about how COMP-3 corruption manifests?

  2. Derek's instinct was to remove the offending transaction from the input file. Maria later told him this was the right short-term fix but the wrong long-term approach. Why?

  3. How would you design a comprehensive test case set to verify the defensive measures Maria implemented? What boundary values would you test?

  4. The checkpoint/restart pattern adds complexity. Under what circumstances is this complexity justified? When is it overkill?