Case Study 1: The Phantom Balance Corruption

The Incident

At 2:47 AM on a Wednesday, GlobalBank's nightly BAL-CALC batch job abended with S0C7 after processing 47,293 of 3.2 million account records. The job had been running without issues for three years. The on-call developer, Derek Washington, escalated to Maria Chen.

The Investigation

Step 1: Gather Diagnostic Information

Maria pulled up the job log:

IEA995I SYMPTOM DUMP:
SYSTEM COMPLETION CODE=0C7  REASON CODE=00000000
  PSW AT TIME OF ERROR  078D1000  A001A3F0
  ACTIVE LOAD MODULE    BAL-CALC
  OFFSET                0001A3F0

The CEEDUMP provided:

CEE3207S The system detected a data exception.
Location:
  Program Unit:  BAL-CALC
  Statement:     3848
  Offset:        +0001A3F0

Variables at Statement 3848:
  WS-ACCT-NUMBER    = "ACCT00047294"
  WS-GROSS-BALANCE  = +1234567.89
  WS-HOLD-AMOUNT    = (INVALID DATA)

Step 2: Identify the Failing Statement

From the compiler listing:

003847  01A3E8  COMPUTE WS-NET-BALANCE =
003848  01A3F0      WS-GROSS-BALANCE - WS-HOLD-AMOUNT

The COMPUTE subtracting WS-HOLD-AMOUNT from WS-GROSS-BALANCE was failing. Since WS-GROSS-BALANCE displayed a valid value, WS-HOLD-AMOUNT must contain non-numeric data.

Step 3: Examine the Data

Using the MAP listing, Maria found WS-HOLD-AMOUNT at displacement +186 from the base register. The dump showed:

BL=01+186: 40 40 40 40 40 40 40

X'40' is EBCDIC space. The entire field was spaces — not valid packed decimal data.

Step 4: Trace the Data Source

WS-HOLD-AMOUNT was populated from the account master record:

003820  MOVE ACCT-HOLD-AMT TO WS-HOLD-AMOUNT.

Maria checked the actual VSAM record for ACCT00047294 using IDCAMS PRINT:

Position 147-153 (ACCT-HOLD-AMT): 40 40 40 40 40 40 40

Confirmed: the source data itself contained spaces in a numeric field.

Step 5: Find the Root Cause

Maria checked the change log and found that three weeks earlier, a data migration program (ACCT-CONV) had loaded 12,000 accounts from an acquired bank. She pulled up ACCT-CONV's source:

       IF OLD-HOLD-AMT = SPACES
           CONTINUE
       ELSE
           MOVE OLD-HOLD-AMT TO NEW-HOLD-AMT
       END-IF.

The bug: when the old system had no hold amount (spaces), the conversion program did nothing — leaving the field uninitialized. The correct code should have moved ZEROS to NEW-HOLD-AMT.

Step 6: Assess the Impact

IDCAMS PRINT INFILE(ACCTMAST) -
  COUNT(9999999) -
  SKIP(0) CHARACTER

Maria wrote a quick scan program that checked every account's ACCT-HOLD-AMT field. Result: 847 accounts had spaces in the hold amount field — all from the migration batch.

Step 7: Apply the Fixes

Immediate fix — data cleanup:

       PERFORM UNTIL END-OF-FILE
           READ ACCT-MASTER INTO WS-ACCT-REC
           IF ACCT-HOLD-AMT IS NOT NUMERIC
               MOVE ZEROS TO ACCT-HOLD-AMT
               REWRITE ACCT-RECORD FROM WS-ACCT-REC
               ADD 1 TO WS-FIX-COUNT
           END-IF
       END-PERFORM.

Defensive fix — add validation to BAL-CALC:

       IF ACCT-HOLD-AMT IS NOT NUMERIC
           MOVE ZEROS TO WS-HOLD-AMOUNT
           ADD 1 TO WS-DATA-QUALITY-ERRORS
           PERFORM WRITE-DATA-QUALITY-LOG
       ELSE
           MOVE ACCT-HOLD-AMT TO WS-HOLD-AMOUNT
       END-IF.

Prevention fix — add data quality validation to all migration programs.

Timeline

Time Action
02:47 BAL-CALC abends. On-call paged.
02:55 Derek escalates to Maria.
03:05 Maria identifies failing statement from CEEDUMP.
03:12 Maria confirms spaces in ACCT-HOLD-AMT from dump.
03:20 Maria traces to data migration as root cause.
03:35 Data cleanup program written and tested.
03:50 847 records fixed.
04:00 BAL-CALC restarted with defensive validation added.
04:05 BAL-CALC completes successfully.

Total resolution time: 78 minutes.

Discussion Questions

  1. Why did the bug take three weeks to manifest? (Hint: accounts are processed in account-number order, and the migrated accounts were numbered starting at ACCT00047001.)
  2. Maria's defensive fix logs data quality errors rather than abending. Is this the right approach? Under what circumstances should the program abend instead?
  3. The conversion programmer tested with data that always had hold amounts. What testing strategy would have caught this bug?
  4. Derek could not have resolved this alone. What skills and knowledge does Maria have that Derek is still developing?
  5. How does the "IS NOT NUMERIC" test work on a COMP-3 field? What exactly does it check?

Lessons Learned

  • Data migration is the #1 source of data quality bugs in mainframe systems
  • Defensive validation (IS NUMERIC checks) should be standard practice for all numeric fields populated from external sources
  • CEEDUMP with variable values dramatically reduces debugging time compared to raw hex dump analysis
  • The bug was in a different program (ACCT-CONV) than the one that abended (BAL-CALC) — root causes are often separated from symptoms by time and code