Case Study 1: Diagnosing a Production Abend
Background
It is 3:12 AM on a Tuesday morning. Sarah Chen, the on-call COBOL programmer at Pacific Coast Savings Bank, receives a page on her phone: "CRITICAL: Job NIGHTLY07 step LOANPOST abended S0C7 at 03:07. Batch window deadline: 06:00 AM." She pulls up her laptop, connects to the mainframe through the VPN, and begins diagnosing the problem.
LOANPOST is a COBOL program that runs every night as part of the bank's end-of-day batch cycle. It reads a sequential file of loan payment transactions (approximately 85,000 records per night), matches each payment to the corresponding loan account in a VSAM KSDS master file, calculates interest accrual, applies the payment, and updates the master record. The program has been in production for seven years and runs reliably five nights a week. Tonight, it processed 42,617 records before abending.
Sarah knows from experience that S0C7 is a data exception -- the program attempted arithmetic on a field containing non-numeric data. The challenge is finding which field, which record, and why. She has three hours before the batch window closes and the online system must be brought up for the business day.
Step 1: Gathering Initial Information
Sarah's first action is to open SDSF (System Display and Search Facility) on TSO and locate the job output for NIGHTLY07. She enters the following commands:
SDSF
ST
PREFIX NIGHTLY07
The job output shows the following messages at the end of STEP LOANPOST:
IEA995I SYMPTOM DUMP OUTPUT
SYSTEM COMPLETION CODE=0C7 REASON CODE=00000004
PSW AT TIME OF ERROR 078D1000 80049A2E
ILC 06 INTC 07
ACTIVE LOAD MODULE LOANPOST
ACTIVE CSECT LOANPOST
DATA AT PSW 00049A28 - D207D030 D110F020 90ECD00C
GPR 0-3 00000000 00049200 00045A00 00045B80
GPR 4-7 00045C00 00000001 00045DE0 0004A310
GPR 8-11 00045E60 00045F40 00046000 000460C0
GPR 12-15 00049000 00045800 80049A2E 00000000
The critical pieces of information are:
- System Completion Code = 0C7: Confirms this is a data exception (S0C7).
- PSW address = 80049A2E: This is the address of the instruction that was being executed when the abend occurred. The leading "8" indicates 31-bit addressing mode; the actual address is 0049A2E.
- ILC 06: Instruction Length Code = 6 bytes, indicating the failing instruction is an SS-type instruction (typically a packed decimal operation like AP, SP, MP, DP, CP, or ZAP).
- ACTIVE LOAD MODULE = LOANPOST: The program that abended.
Step 2: Calculating the Offset
To find the source line that failed, Sarah needs to calculate the offset of the failing instruction within the LOANPOST module. She looks at the dump header for the Entry Point Address (EPA):
LOAD MODULE LOANPOST
ENTRY POINT 00044000
LOAD ADDR 00044000
The offset calculation:
PSW address: 0049A2E
EPA: -0044000
--------
Offset: 0005A2E
The failing instruction is at offset X'5A2E' within the LOANPOST load module.
Step 3: Finding the Source Line in the Compiler Listing
Sarah opens the compiler listing for LOANPOST, which was compiled with the MAP, XREF, and OFFSET options. She navigates to the OFFSET section of the listing (the Procedure Division Map) and searches for offset 5A2E.
The OFFSET listing shows:
LINE # OFFSET
312 005A10 3200-CALC-INTEREST
315 005A1C COMPUTE
318 005A2A ADD
320 005A34 COMPUTE
325 005A48 3300-APPLY-PAYMENT
The abend offset X'5A2E' falls between the entries for line 318 (offset 005A2A) and line 320 (offset 005A34). Since the ILC is 6 bytes, the instruction at offset 005A2E is part of the machine code generated for line 318 -- the ADD statement. The compiler often generates multiple machine instructions for a single COBOL statement, so the abend occurs within the ADD statement that starts at offset 005A2A.
Sarah looks at line 318 in the source listing:
312 3200-CALC-INTEREST.
313 MOVE WS-LOAN-RECORD TO WS-WORK-RECORD
314 COMPUTE WS-DAILY-RATE =
315 WS-ANNUAL-RATE / 365
316 COMPUTE WS-ACCRUED-INTEREST =
317 WS-OUTSTANDING-BALANCE * WS-DAILY-RATE
318 ADD WS-ACCRUED-INTEREST
319 TO WS-TOTAL-INTEREST-DUE
320 COMPUTE WS-PAYMENT-TO-PRINCIPAL =
321 WS-PAYMENT-AMOUNT - WS-TOTAL-INTEREST-DUE
The failing statement is:
ADD WS-ACCRUED-INTEREST
TO WS-TOTAL-INTEREST-DUE
Step 4: Identifying the Offending Field
An S0C7 on an ADD statement means that either WS-ACCRUED-INTEREST or WS-TOTAL-INTEREST-DUE contains non-numeric data. Sarah needs to examine both fields in the dump.
She goes to the Data Division Map to find the displacement of each field:
LINE LVL DATA NAME BASE DISPL USAGE
84 05 WS-OUTSTANDING-BALANCE BLW=0 000120 COMP-3
85 05 WS-ANNUAL-RATE BLW=0 000127 COMP-3
86 05 WS-DAILY-RATE BLW=0 00012B COMP-3
87 05 WS-ACCRUED-INTEREST BLW=0 000131 COMP-3
88 05 WS-TOTAL-INTEREST-DUE BLW=0 000137 COMP-3
89 05 WS-PAYMENT-AMOUNT BLW=0 00013D COMP-3
90 05 WS-PAYMENT-TO-PRINCIPAL BLW=0 000143 COMP-3
From the dump header, the BLW=0 base address (found in GPR 13, the Working-Storage base register, or in the TGT -- Task Global Table) is X'00045A00'.
WS-ACCRUED-INTEREST is at BLW=0 + X'131' = X'00045B31':
Address 00045B30: 00 01 23 45 6C 00 00
Reading the bytes at offset 131: 00 01 23 45 6C (assuming the field is PIC S9(9)V99 COMP-3, which is 6 bytes). Wait, 6 bytes is 11 digits + sign = 12 nibbles. Reading: 0-0-0-1-2-3-4-5-6-C. That is +000012345.6 -- this looks valid. The C nibble is a valid positive sign.
WS-TOTAL-INTEREST-DUE is at BLW=0 + X'137' = X'00045B37':
Address 00045B30: .. .. .. .. .. .. .. 40 40 40 40 40 40
Reading the bytes at offset 137: 40 40 40 40 40 40. In EBCDIC, X'40' is a space character. This is the problem. WS-TOTAL-INTEREST-DUE contains spaces, not a valid packed decimal number.
Step 5: Tracing Back to the Root Cause
Now Sarah knows that WS-TOTAL-INTEREST-DUE contains spaces when it should contain a packed decimal value. The question is: how did spaces get into a COMP-3 field?
She checks the cross-reference listing for WS-TOTAL-INTEREST-DUE:
WS-TOTAL-INTEREST-DUE 88 M190 M252 318 386 M410
The M-prefixed entries show where the field is modified: lines 190, 252, and 410. Sarah examines each one.
Line 190 is in the initialization paragraph:
188 2000-INIT-LOAN-WORK.
189 INITIALIZE WS-LOAN-WORK-AREA
190 MOVE ZEROS TO WS-TOTAL-INTEREST-DUE
This looks correct -- the field is initialized to zeros.
Line 252 is in the master read paragraph:
248 2500-READ-MASTER.
249 MOVE WS-LOAN-NUM TO LR-LOAN-KEY
250 READ LOAN-MASTER INTO WS-LOAN-RECORD
251 IF WS-MASTER-STATUS = '00'
252 MOVE LR-INTEREST-DUE TO WS-TOTAL-INTEREST-DUE
253 END-IF
This moves the interest-due field from the master record into the working field. If the master record field is valid, this should be fine.
Line 410 is in the payment processing paragraph:
408 4000-PROCESS-PAYMENT.
409 MOVE WS-LOAN-RECORD TO WS-WORK-RECORD
410 MOVE WS-MASTER-INT-DUE TO WS-TOTAL-INTEREST-DUE
This is another move from the master record. Sarah notices something: WS-MASTER-INT-DUE is a different field from LR-INTEREST-DUE. She checks the data map:
71 05 WS-MASTER-INT-DUE BLW=0 0000A0 DISPLAY
The USAGE is DISPLAY, not COMP-3. WS-MASTER-INT-DUE is an alphanumeric (DISPLAY) field, while WS-TOTAL-INTEREST-DUE is packed decimal (COMP-3). When you MOVE a DISPLAY field to a COMP-3 field, COBOL converts the data -- but only if the DISPLAY field contains valid numeric characters. If the DISPLAY field contains spaces, the MOVE produces a COMP-3 field full of spaces.
But wait -- why would WS-MASTER-INT-DUE contain spaces? Sarah traces the data flow and finds the answer. WS-WORK-RECORD is populated by a group MOVE at line 409:
MOVE WS-LOAN-RECORD TO WS-WORK-RECORD
WS-LOAN-RECORD is the record read from the master file. WS-WORK-RECORD is a working copy. But the layouts are different lengths:
60 01 WS-LOAN-RECORD LENGTH: 200
68 01 WS-WORK-RECORD LENGTH: 250
The group MOVE of a 200-byte record into a 250-byte area fills the first 200 bytes with data and leaves the remaining 50 bytes as whatever they were before (they are not cleared). If WS-MASTER-INT-DUE falls in the unmapped portion of WS-WORK-RECORD (beyond byte 200), it would contain residual data -- possibly spaces from the last INITIALIZE.
Sarah checks the displacement of WS-MASTER-INT-DUE within WS-WORK-RECORD:
71 05 WS-MASTER-INT-DUE BLW=0 0000A0 DISPLAY
X'A0' = 160 decimal. This is within the 200-byte range, so it should be populated by the group move. But Sarah then realizes the real issue: the field positions in WS-WORK-RECORD do not match the field positions in WS-LOAN-RECORD. The group MOVE copies bytes, not field values. If the interest-due field is at a different offset in WS-LOAN-RECORD than in WS-WORK-RECORD, the wrong bytes end up in WS-MASTER-INT-DUE.
She compares the layouts and finds the mismatch. WS-LOAN-RECORD was updated six months ago to add a new 15-byte field (LR-EMAIL-ADDR) at offset 140, pushing all subsequent fields forward by 15 bytes. But WS-WORK-RECORD was not updated to match. The interest-due field that was at offset 160 in the old layout is now at offset 175 in WS-LOAN-RECORD, but WS-WORK-RECORD still expects it at offset 160.
The bytes at offset 160 in the new master record layout are the last 10 bytes of the email address field -- which may contain spaces, letters, or special characters. When those bytes are moved to WS-MASTER-INT-DUE and then to WS-TOTAL-INTEREST-DUE, the result is non-numeric data in a packed decimal field.
Step 6: Understanding Why It Worked for 42,617 Records
Sarah wonders why the program processed 42,617 records successfully before abending. The answer lies in the data. For accounts where the email address field is exactly 15 characters long (fully populated), the bytes at offset 160 in the new layout contain characters from the email address -- which are non-numeric and would cause an S0C7.
But for accounts where the email address is shorter than 15 characters, the remaining bytes are space-padded. If the email is exactly the right length to place numeric-looking characters at offset 160-165, the MOVE might succeed by coincidence, producing an incorrect (but valid) packed decimal value. The first 42,617 accounts either had short enough email addresses or happened to produce coincidentally valid data at the critical offset.
Record 42,618 is the first account where the email address bytes at offset 160 produce data that is unambiguously non-numeric in the packed decimal format, triggering the S0C7.
Step 7: The Fix
The immediate fix is to correct the field mapping. Sarah has two options:
Option A (Immediate -- data fix): Modify the program to use the master record's fields directly (from LR-INTEREST-DUE at line 252) instead of going through the misaligned WS-WORK-RECORD. This requires changing line 410:
*--- OLD (BUGGY):
* MOVE WS-MASTER-INT-DUE TO WS-TOTAL-INTEREST-DUE
*--- NEW (FIXED):
MOVE LR-INTEREST-DUE TO WS-TOTAL-INTEREST-DUE
Option B (Proper -- structural fix): Update WS-WORK-RECORD to match the current WS-LOAN-RECORD layout. Both layouts should use the same COPY member to prevent future drift:
01 WS-LOAN-RECORD.
COPY LOANREC.
01 WS-WORK-RECORD.
COPY LOANREC.
Sarah applies Option A as an emergency fix (it requires changing only one line), compiles, tests with a sample of the failing data, and promotes the fix to production. She restarts the job at 4:45 AM, and it completes successfully at 5:23 AM, within the batch window. She files an incident report recommending Option B as a permanent fix for the next maintenance window.
Step 8: Prevention
In the incident report, Sarah recommends three preventive measures:
1. Use COPY members for all shared layouts. If WS-LOAN-RECORD and WS-WORK-RECORD both used COPY LOANREC, the compiler would have kept them synchronized. The bug occurred because the layouts were maintained independently -- one was updated and the other was not.
2. Avoid group MOVEs between structures. Group MOVEs copy bytes, not fields. They are inherently fragile because they depend on both structures having identical layouts. Instead, use MOVE CORRESPONDING (which moves matching field names) or individual field-level MOVEs (which are explicit and self-documenting).
3. Add numeric validation before arithmetic. Every field used in arithmetic should be validated with an IF ... IS NUMERIC check, or the arithmetic should use ON SIZE ERROR. If WS-TOTAL-INTEREST-DUE had been validated before the ADD statement, the program would have rejected the bad record instead of abending:
IF WS-TOTAL-INTEREST-DUE IS NUMERIC
ADD WS-ACCRUED-INTEREST
TO WS-TOTAL-INTEREST-DUE
ON SIZE ERROR
PERFORM 9100-ARITHMETIC-ERROR
END-ADD
ELSE
DISPLAY 'NON-NUMERIC INTEREST DUE'
DISPLAY 'LOAN: ' WS-LOAN-NUM
DISPLAY 'HEX VALUE: ' WS-TOTAL-INTEREST-DUE
PERFORM 9200-DATA-ERROR
END-IF
Lessons Learned
This case study illustrates several fundamental debugging principles:
Follow the offset. The PSW address minus the EPA gives the offset. The offset maps to a source line through the compiler listing. This three-step process is the foundation of all z/OS dump analysis.
Understand data representations. Knowing that X'40' is an EBCDIC space, that COMP-3 fields store two digits per byte with a sign nibble, and that group MOVEs are byte copies rather than field-level conversions -- this knowledge is essential for reading dumps.
Trace the data flow backward. Start at the failing statement, identify which field has bad data, then trace backward through every place that field is modified. The cross-reference listing is your roadmap.
Structural changes have distant effects. Adding a field to a record layout is a simple change, but its impact radiates to every program and every data structure that depends on that layout. COPY members are the mechanism that keeps these dependencies synchronized.
Coincidental success is dangerous. The program worked correctly for 42,617 records not because the code was right, but because the data happened to be tolerable. The bug existed from the moment the layout was changed six months ago -- it just took the right combination of data to trigger it.