Case Study 1: The Phantom Balance Corruption
The Incident
At 2:47 AM on a Wednesday, GlobalBank's nightly BAL-CALC batch job abended with S0C7 after processing 47,293 of 3.2 million account records. The job had been running without issues for three years. The on-call developer, Derek Washington, escalated to Maria Chen.
The Investigation
Step 1: Gather Diagnostic Information
Maria pulled up the job log:
IEA995I SYMPTOM DUMP:
SYSTEM COMPLETION CODE=0C7 REASON CODE=00000000
PSW AT TIME OF ERROR 078D1000 A001A3F0
ACTIVE LOAD MODULE BAL-CALC
OFFSET 0001A3F0
The CEEDUMP provided:
CEE3207S The system detected a data exception.
Location:
Program Unit: BAL-CALC
Statement: 3848
Offset: +0001A3F0
Variables at Statement 3848:
WS-ACCT-NUMBER = "ACCT00047294"
WS-GROSS-BALANCE = +1234567.89
WS-HOLD-AMOUNT = (INVALID DATA)
Step 2: Identify the Failing Statement
From the compiler listing:
003847 01A3E8 COMPUTE WS-NET-BALANCE =
003848 01A3F0 WS-GROSS-BALANCE - WS-HOLD-AMOUNT
The COMPUTE subtracting WS-HOLD-AMOUNT from WS-GROSS-BALANCE was failing. Since WS-GROSS-BALANCE displayed a valid value, WS-HOLD-AMOUNT must contain non-numeric data.
Step 3: Examine the Data
Using the MAP listing, Maria found WS-HOLD-AMOUNT at displacement +186 from the base register. The dump showed:
BL=01+186: 40 40 40 40 40 40 40
X'40' is EBCDIC space. The entire field was spaces — not valid packed decimal data.
Step 4: Trace the Data Source
WS-HOLD-AMOUNT was populated from the account master record:
003820 MOVE ACCT-HOLD-AMT TO WS-HOLD-AMOUNT.
Maria checked the actual VSAM record for ACCT00047294 using IDCAMS PRINT:
Position 147-153 (ACCT-HOLD-AMT): 40 40 40 40 40 40 40
Confirmed: the source data itself contained spaces in a numeric field.
Step 5: Find the Root Cause
Maria checked the change log and found that three weeks earlier, a data migration program (ACCT-CONV) had loaded 12,000 accounts from an acquired bank. She pulled up ACCT-CONV's source:
IF OLD-HOLD-AMT = SPACES
CONTINUE
ELSE
MOVE OLD-HOLD-AMT TO NEW-HOLD-AMT
END-IF.
The bug: when the old system had no hold amount (spaces), the conversion program did nothing — leaving the field uninitialized. The correct code should have moved ZEROS to NEW-HOLD-AMT.
Step 6: Assess the Impact
IDCAMS PRINT INFILE(ACCTMAST) -
COUNT(9999999) -
SKIP(0) CHARACTER
Maria wrote a quick scan program that checked every account's ACCT-HOLD-AMT field. Result: 847 accounts had spaces in the hold amount field — all from the migration batch.
Step 7: Apply the Fixes
Immediate fix — data cleanup:
PERFORM UNTIL END-OF-FILE
READ ACCT-MASTER INTO WS-ACCT-REC
IF ACCT-HOLD-AMT IS NOT NUMERIC
MOVE ZEROS TO ACCT-HOLD-AMT
REWRITE ACCT-RECORD FROM WS-ACCT-REC
ADD 1 TO WS-FIX-COUNT
END-IF
END-PERFORM.
Defensive fix — add validation to BAL-CALC:
IF ACCT-HOLD-AMT IS NOT NUMERIC
MOVE ZEROS TO WS-HOLD-AMOUNT
ADD 1 TO WS-DATA-QUALITY-ERRORS
PERFORM WRITE-DATA-QUALITY-LOG
ELSE
MOVE ACCT-HOLD-AMT TO WS-HOLD-AMOUNT
END-IF.
Prevention fix — add data quality validation to all migration programs.
Timeline
| Time | Action |
|---|---|
| 02:47 | BAL-CALC abends. On-call paged. |
| 02:55 | Derek escalates to Maria. |
| 03:05 | Maria identifies failing statement from CEEDUMP. |
| 03:12 | Maria confirms spaces in ACCT-HOLD-AMT from dump. |
| 03:20 | Maria traces to data migration as root cause. |
| 03:35 | Data cleanup program written and tested. |
| 03:50 | 847 records fixed. |
| 04:00 | BAL-CALC restarted with defensive validation added. |
| 04:05 | BAL-CALC completes successfully. |
Total resolution time: 78 minutes.
Discussion Questions
- Why did the bug take three weeks to manifest? (Hint: accounts are processed in account-number order, and the migrated accounts were numbered starting at ACCT00047001.)
- Maria's defensive fix logs data quality errors rather than abending. Is this the right approach? Under what circumstances should the program abend instead?
- The conversion programmer tested with data that always had hold amounts. What testing strategy would have caught this bug?
- Derek could not have resolved this alone. What skills and knowledge does Maria have that Derek is still developing?
- How does the "IS NOT NUMERIC" test work on a COMP-3 field? What exactly does it check?
Lessons Learned
- Data migration is the #1 source of data quality bugs in mainframe systems
- Defensive validation (IS NUMERIC checks) should be standard practice for all numeric fields populated from external sources
- CEEDUMP with variable values dramatically reduces debugging time compared to raw hex dump analysis
- The bug was in a different program (ACCT-CONV) than the one that abended (BAL-CALC) — root causes are often separated from symptoms by time and code