Case Study 1: Diagnosing a Production Abend

Background

It is 3:12 AM on a Tuesday morning. Sarah Chen, the on-call COBOL programmer at Pacific Coast Savings Bank, receives a page on her phone: "CRITICAL: Job NIGHTLY07 step LOANPOST abended S0C7 at 03:07. Batch window deadline: 06:00 AM." She pulls up her laptop, connects to the mainframe through the VPN, and begins diagnosing the problem.

LOANPOST is a COBOL program that runs every night as part of the bank's end-of-day batch cycle. It reads a sequential file of loan payment transactions (approximately 85,000 records per night), matches each payment to the corresponding loan account in a VSAM KSDS master file, calculates interest accrual, applies the payment, and updates the master record. The program has been in production for seven years and runs reliably five nights a week. Tonight, it processed 42,617 records before abending.

Sarah knows from experience that S0C7 is a data exception -- the program attempted arithmetic on a field containing non-numeric data. The challenge is finding which field, which record, and why. She has three hours before the batch window closes and the online system must be brought up for the business day.


Step 1: Gathering Initial Information

Sarah's first action is to open SDSF (System Display and Search Facility) on TSO and locate the job output for NIGHTLY07. She enters the following commands:

SDSF
ST
PREFIX NIGHTLY07

The job output shows the following messages at the end of STEP LOANPOST:

IEA995I SYMPTOM DUMP OUTPUT
  SYSTEM COMPLETION CODE=0C7  REASON CODE=00000004
  PSW AT TIME OF ERROR  078D1000  80049A2E
  ILC 06  INTC 07
  ACTIVE LOAD MODULE      LOANPOST
  ACTIVE CSECT             LOANPOST
  DATA AT PSW  00049A28 - D207D030  D110F020  90ECD00C
  GPR 0-3     00000000  00049200  00045A00  00045B80
  GPR 4-7     00045C00  00000001  00045DE0  0004A310
  GPR 8-11    00045E60  00045F40  00046000  000460C0
  GPR 12-15   00049000  00045800  80049A2E  00000000

The critical pieces of information are:

  • System Completion Code = 0C7: Confirms this is a data exception (S0C7).
  • PSW address = 80049A2E: This is the address of the instruction that was being executed when the abend occurred. The leading "8" indicates 31-bit addressing mode; the actual address is 0049A2E.
  • ILC 06: Instruction Length Code = 6 bytes, indicating the failing instruction is an SS-type instruction (typically a packed decimal operation like AP, SP, MP, DP, CP, or ZAP).
  • ACTIVE LOAD MODULE = LOANPOST: The program that abended.

Step 2: Calculating the Offset

To find the source line that failed, Sarah needs to calculate the offset of the failing instruction within the LOANPOST module. She looks at the dump header for the Entry Point Address (EPA):

LOAD MODULE    LOANPOST
  ENTRY POINT  00044000
  LOAD ADDR    00044000

The offset calculation:

PSW address:    0049A2E
EPA:           -0044000
               --------
Offset:         0005A2E

The failing instruction is at offset X'5A2E' within the LOANPOST load module.


Step 3: Finding the Source Line in the Compiler Listing

Sarah opens the compiler listing for LOANPOST, which was compiled with the MAP, XREF, and OFFSET options. She navigates to the OFFSET section of the listing (the Procedure Division Map) and searches for offset 5A2E.

The OFFSET listing shows:

LINE #   OFFSET
  312    005A10     3200-CALC-INTEREST
  315    005A1C         COMPUTE
  318    005A2A         ADD
  320    005A34         COMPUTE
  325    005A48     3300-APPLY-PAYMENT

The abend offset X'5A2E' falls between the entries for line 318 (offset 005A2A) and line 320 (offset 005A34). Since the ILC is 6 bytes, the instruction at offset 005A2E is part of the machine code generated for line 318 -- the ADD statement. The compiler often generates multiple machine instructions for a single COBOL statement, so the abend occurs within the ADD statement that starts at offset 005A2A.

Sarah looks at line 318 in the source listing:

  312    3200-CALC-INTEREST.
  313        MOVE WS-LOAN-RECORD TO WS-WORK-RECORD
  314        COMPUTE WS-DAILY-RATE =
  315            WS-ANNUAL-RATE / 365
  316        COMPUTE WS-ACCRUED-INTEREST =
  317            WS-OUTSTANDING-BALANCE * WS-DAILY-RATE
  318        ADD WS-ACCRUED-INTEREST
  319            TO WS-TOTAL-INTEREST-DUE
  320        COMPUTE WS-PAYMENT-TO-PRINCIPAL =
  321            WS-PAYMENT-AMOUNT - WS-TOTAL-INTEREST-DUE

The failing statement is:

           ADD WS-ACCRUED-INTEREST
               TO WS-TOTAL-INTEREST-DUE

Step 4: Identifying the Offending Field

An S0C7 on an ADD statement means that either WS-ACCRUED-INTEREST or WS-TOTAL-INTEREST-DUE contains non-numeric data. Sarah needs to examine both fields in the dump.

She goes to the Data Division Map to find the displacement of each field:

LINE  LVL  DATA NAME                   BASE   DISPL   USAGE
 84    05  WS-OUTSTANDING-BALANCE      BLW=0  000120  COMP-3
 85    05  WS-ANNUAL-RATE              BLW=0  000127  COMP-3
 86    05  WS-DAILY-RATE               BLW=0  00012B  COMP-3
 87    05  WS-ACCRUED-INTEREST         BLW=0  000131  COMP-3
 88    05  WS-TOTAL-INTEREST-DUE       BLW=0  000137  COMP-3
 89    05  WS-PAYMENT-AMOUNT           BLW=0  00013D  COMP-3
 90    05  WS-PAYMENT-TO-PRINCIPAL     BLW=0  000143  COMP-3

From the dump header, the BLW=0 base address (found in GPR 13, the Working-Storage base register, or in the TGT -- Task Global Table) is X'00045A00'.

WS-ACCRUED-INTEREST is at BLW=0 + X'131' = X'00045B31':

Address   00045B30: 00 01 23 45 6C 00 00

Reading the bytes at offset 131: 00 01 23 45 6C (assuming the field is PIC S9(9)V99 COMP-3, which is 6 bytes). Wait, 6 bytes is 11 digits + sign = 12 nibbles. Reading: 0-0-0-1-2-3-4-5-6-C. That is +000012345.6 -- this looks valid. The C nibble is a valid positive sign.

WS-TOTAL-INTEREST-DUE is at BLW=0 + X'137' = X'00045B37':

Address   00045B30: .. .. .. .. .. .. .. 40 40 40 40 40 40

Reading the bytes at offset 137: 40 40 40 40 40 40. In EBCDIC, X'40' is a space character. This is the problem. WS-TOTAL-INTEREST-DUE contains spaces, not a valid packed decimal number.


Step 5: Tracing Back to the Root Cause

Now Sarah knows that WS-TOTAL-INTEREST-DUE contains spaces when it should contain a packed decimal value. The question is: how did spaces get into a COMP-3 field?

She checks the cross-reference listing for WS-TOTAL-INTEREST-DUE:

WS-TOTAL-INTEREST-DUE    88    M190  M252  318  386  M410

The M-prefixed entries show where the field is modified: lines 190, 252, and 410. Sarah examines each one.

Line 190 is in the initialization paragraph:

  188    2000-INIT-LOAN-WORK.
  189        INITIALIZE WS-LOAN-WORK-AREA
  190        MOVE ZEROS TO WS-TOTAL-INTEREST-DUE

This looks correct -- the field is initialized to zeros.

Line 252 is in the master read paragraph:

  248    2500-READ-MASTER.
  249        MOVE WS-LOAN-NUM TO LR-LOAN-KEY
  250        READ LOAN-MASTER INTO WS-LOAN-RECORD
  251        IF WS-MASTER-STATUS = '00'
  252            MOVE LR-INTEREST-DUE TO WS-TOTAL-INTEREST-DUE
  253        END-IF

This moves the interest-due field from the master record into the working field. If the master record field is valid, this should be fine.

Line 410 is in the payment processing paragraph:

  408    4000-PROCESS-PAYMENT.
  409        MOVE WS-LOAN-RECORD TO WS-WORK-RECORD
  410        MOVE WS-MASTER-INT-DUE TO WS-TOTAL-INTEREST-DUE

This is another move from the master record. Sarah notices something: WS-MASTER-INT-DUE is a different field from LR-INTEREST-DUE. She checks the data map:

 71    05  WS-MASTER-INT-DUE       BLW=0  0000A0  DISPLAY

The USAGE is DISPLAY, not COMP-3. WS-MASTER-INT-DUE is an alphanumeric (DISPLAY) field, while WS-TOTAL-INTEREST-DUE is packed decimal (COMP-3). When you MOVE a DISPLAY field to a COMP-3 field, COBOL converts the data -- but only if the DISPLAY field contains valid numeric characters. If the DISPLAY field contains spaces, the MOVE produces a COMP-3 field full of spaces.

But wait -- why would WS-MASTER-INT-DUE contain spaces? Sarah traces the data flow and finds the answer. WS-WORK-RECORD is populated by a group MOVE at line 409:

           MOVE WS-LOAN-RECORD TO WS-WORK-RECORD

WS-LOAN-RECORD is the record read from the master file. WS-WORK-RECORD is a working copy. But the layouts are different lengths:

 60    01  WS-LOAN-RECORD                  LENGTH: 200
 68    01  WS-WORK-RECORD                  LENGTH: 250

The group MOVE of a 200-byte record into a 250-byte area fills the first 200 bytes with data and leaves the remaining 50 bytes as whatever they were before (they are not cleared). If WS-MASTER-INT-DUE falls in the unmapped portion of WS-WORK-RECORD (beyond byte 200), it would contain residual data -- possibly spaces from the last INITIALIZE.

Sarah checks the displacement of WS-MASTER-INT-DUE within WS-WORK-RECORD:

 71    05  WS-MASTER-INT-DUE       BLW=0  0000A0  DISPLAY

X'A0' = 160 decimal. This is within the 200-byte range, so it should be populated by the group move. But Sarah then realizes the real issue: the field positions in WS-WORK-RECORD do not match the field positions in WS-LOAN-RECORD. The group MOVE copies bytes, not field values. If the interest-due field is at a different offset in WS-LOAN-RECORD than in WS-WORK-RECORD, the wrong bytes end up in WS-MASTER-INT-DUE.

She compares the layouts and finds the mismatch. WS-LOAN-RECORD was updated six months ago to add a new 15-byte field (LR-EMAIL-ADDR) at offset 140, pushing all subsequent fields forward by 15 bytes. But WS-WORK-RECORD was not updated to match. The interest-due field that was at offset 160 in the old layout is now at offset 175 in WS-LOAN-RECORD, but WS-WORK-RECORD still expects it at offset 160.

The bytes at offset 160 in the new master record layout are the last 10 bytes of the email address field -- which may contain spaces, letters, or special characters. When those bytes are moved to WS-MASTER-INT-DUE and then to WS-TOTAL-INTEREST-DUE, the result is non-numeric data in a packed decimal field.


Step 6: Understanding Why It Worked for 42,617 Records

Sarah wonders why the program processed 42,617 records successfully before abending. The answer lies in the data. For accounts where the email address field is exactly 15 characters long (fully populated), the bytes at offset 160 in the new layout contain characters from the email address -- which are non-numeric and would cause an S0C7.

But for accounts where the email address is shorter than 15 characters, the remaining bytes are space-padded. If the email is exactly the right length to place numeric-looking characters at offset 160-165, the MOVE might succeed by coincidence, producing an incorrect (but valid) packed decimal value. The first 42,617 accounts either had short enough email addresses or happened to produce coincidentally valid data at the critical offset.

Record 42,618 is the first account where the email address bytes at offset 160 produce data that is unambiguously non-numeric in the packed decimal format, triggering the S0C7.


Step 7: The Fix

The immediate fix is to correct the field mapping. Sarah has two options:

Option A (Immediate -- data fix): Modify the program to use the master record's fields directly (from LR-INTEREST-DUE at line 252) instead of going through the misaligned WS-WORK-RECORD. This requires changing line 410:

      *--- OLD (BUGGY):
      *    MOVE WS-MASTER-INT-DUE TO WS-TOTAL-INTEREST-DUE
      *--- NEW (FIXED):
           MOVE LR-INTEREST-DUE TO WS-TOTAL-INTEREST-DUE

Option B (Proper -- structural fix): Update WS-WORK-RECORD to match the current WS-LOAN-RECORD layout. Both layouts should use the same COPY member to prevent future drift:

       01  WS-LOAN-RECORD.
           COPY LOANREC.
       01  WS-WORK-RECORD.
           COPY LOANREC.

Sarah applies Option A as an emergency fix (it requires changing only one line), compiles, tests with a sample of the failing data, and promotes the fix to production. She restarts the job at 4:45 AM, and it completes successfully at 5:23 AM, within the batch window. She files an incident report recommending Option B as a permanent fix for the next maintenance window.


Step 8: Prevention

In the incident report, Sarah recommends three preventive measures:

1. Use COPY members for all shared layouts. If WS-LOAN-RECORD and WS-WORK-RECORD both used COPY LOANREC, the compiler would have kept them synchronized. The bug occurred because the layouts were maintained independently -- one was updated and the other was not.

2. Avoid group MOVEs between structures. Group MOVEs copy bytes, not fields. They are inherently fragile because they depend on both structures having identical layouts. Instead, use MOVE CORRESPONDING (which moves matching field names) or individual field-level MOVEs (which are explicit and self-documenting).

3. Add numeric validation before arithmetic. Every field used in arithmetic should be validated with an IF ... IS NUMERIC check, or the arithmetic should use ON SIZE ERROR. If WS-TOTAL-INTEREST-DUE had been validated before the ADD statement, the program would have rejected the bad record instead of abending:

           IF WS-TOTAL-INTEREST-DUE IS NUMERIC
               ADD WS-ACCRUED-INTEREST
                   TO WS-TOTAL-INTEREST-DUE
                   ON SIZE ERROR
                       PERFORM 9100-ARITHMETIC-ERROR
               END-ADD
           ELSE
               DISPLAY 'NON-NUMERIC INTEREST DUE'
               DISPLAY 'LOAN: ' WS-LOAN-NUM
               DISPLAY 'HEX VALUE: ' WS-TOTAL-INTEREST-DUE
               PERFORM 9200-DATA-ERROR
           END-IF

Lessons Learned

This case study illustrates several fundamental debugging principles:

Follow the offset. The PSW address minus the EPA gives the offset. The offset maps to a source line through the compiler listing. This three-step process is the foundation of all z/OS dump analysis.

Understand data representations. Knowing that X'40' is an EBCDIC space, that COMP-3 fields store two digits per byte with a sign nibble, and that group MOVEs are byte copies rather than field-level conversions -- this knowledge is essential for reading dumps.

Trace the data flow backward. Start at the failing statement, identify which field has bad data, then trace backward through every place that field is modified. The cross-reference listing is your roadmap.

Structural changes have distant effects. Adding a field to a record layout is a simple change, but its impact radiates to every program and every data structure that depends on that layout. COPY members are the mechanism that keeps these dependencies synchronized.

Coincidental success is dangerous. The program worked correctly for 42,617 records not because the code was right, but because the data happened to be tolerable. The bug existed from the moment the layout was changed six months ago -- it just took the right combination of data to trigger it.