Case Study 2: MedClaim Claim Line Item Processing and the Infinite Loop Incident

The Incident

At 2:47 AM on a Thursday in March 2023, the MedClaim on-call operations analyst received an automated alert: the nightly claims adjudication batch job, CLM-ADJUD, had exceeded its maximum CPU time allocation. The job had been running for 4 hours and 12 minutes — more than three times its normal 80-minute duration.

James Okafor, MedClaim's team lead, was paged at 3:15 AM. By the time he connected remotely, the operations team had already cancelled the job. The damage: 187,000 claims queued for adjudication were unprocessed, and the downstream payment job could not run. Provider payments for 43,000 claims would be delayed by at least one business day.

Root Cause

The investigation traced the problem to a single claim: claim number MC-2023-0847291, submitted by a provider with 247 line items. The claim line processing loop was:

       4000-PROCESS-CLAIM-LINES.
           MOVE 1 TO WS-LINE-IDX
           PERFORM 4100-PROCESS-ONE-LINE
               UNTIL WS-LINE-IDX > WS-LINE-COUNT
           .

       4100-PROCESS-ONE-LINE.
           PERFORM 4110-PRICE-LINE
           IF WS-PRICING-OK = 'Y'
               PERFORM 4120-APPLY-BENEFIT
               ADD 1 TO WS-LINE-IDX
           ELSE
               PERFORM 4130-REPRICE-LINE
           END-IF
           .

       4130-REPRICE-LINE.
           ADD 1 TO WS-REPRICE-ATTEMPT
           IF WS-REPRICE-ATTEMPT <= 3
               PERFORM 4110-PRICE-LINE
           ELSE
               SET LINE-DENIED TO TRUE
               ADD 1 TO WS-LINE-IDX
           END-IF
           .

The bug was subtle. 4130-REPRICE-LINE calls 4110-PRICE-LINE to retry pricing, but it does not re-check whether pricing succeeded. After calling 4110-PRICE-LINE, it falls through without updating WS-PRICING-OK. Control returns to 4100-PROCESS-ONE-LINE, which checks WS-PRICING-OK — but WS-PRICING-OK is still 'N' from the first failed attempt (because 4130-REPRICE-LINE did not update it after the retry).

Furthermore, WS-REPRICE-ATTEMPT was never reset between line items. After the first line item that required repricing exhausted its 3 attempts, WS-REPRICE-ATTEMPT was 4. For every subsequent line item that needed repricing, WS-REPRICE-ATTEMPT was already > 3, so the repricing was immediately denied and WS-LINE-IDX was advanced. This masked the bug for most claims.

But claim MC-2023-0847291 was different. Its first line item had a service code that triggered a pricing error. The code entered the reprice loop, attempted 3 reprices (all failing), and set the line to denied. But because WS-PRICING-OK was never updated after the repricing, the outer loop's IF WS-PRICING-OK = 'Y' was still 'N', and WS-LINE-IDX was not incremented — the increment in 4130-REPRICE-LINE only happens when WS-REPRICE-ATTEMPT > 3, which had already been processed. The code was stuck on line 1 forever.

The Fix

Immediate Fix

       4100-PROCESS-ONE-LINE.
           MOVE 0 TO WS-REPRICE-ATTEMPT
           PERFORM 4110-PRICE-LINE

           IF WS-PRICING-OK NOT = 'Y'
               PERFORM 4130-REPRICE-LINE
           END-IF

           IF WS-PRICING-OK = 'Y'
               PERFORM 4120-APPLY-BENEFIT
           ELSE
               SET LINE-DENIED(WS-LINE-IDX) TO TRUE
           END-IF

           ADD 1 TO WS-LINE-IDX
           .

       4130-REPRICE-LINE.
           PERFORM UNTIL WS-PRICING-OK = 'Y'
               OR WS-REPRICE-ATTEMPT >= 3
               ADD 1 TO WS-REPRICE-ATTEMPT
               PERFORM 4110-PRICE-LINE
           END-PERFORM
           .

Key changes: 1. WS-REPRICE-ATTEMPT is reset at the start of each line item 2. WS-LINE-IDX is always incremented, unconditionally, at the end of 4100-PROCESS-ONE-LINE 3. The repricing loop properly checks WS-PRICING-OK after each attempt

Defensive Controls Added

       01  WS-LINE-SAFETY.
           05  WS-MAX-LINE-ITERATIONS  PIC 9(05) VALUE 9999.
           05  WS-LINE-LOOP-COUNT      PIC 9(05) VALUE 0.
           05  WS-LAST-LINE-IDX        PIC 9(03) VALUE 0.
           05  WS-NO-PROGRESS-COUNT    PIC 9(03) VALUE 0.

       4000-PROCESS-CLAIM-LINES.
           MOVE 0 TO WS-LINE-LOOP-COUNT
                     WS-NO-PROGRESS-COUNT
           MOVE 1 TO WS-LINE-IDX

           PERFORM 4100-PROCESS-ONE-LINE
               UNTIL WS-LINE-IDX > WS-LINE-COUNT
               OR WS-LINE-LOOP-COUNT >= WS-MAX-LINE-ITERATIONS

           IF WS-LINE-LOOP-COUNT >= WS-MAX-LINE-ITERATIONS
               DISPLAY 'SAFETY: Line loop exceeded max for '
                   'claim ' WS-CLAIM-NUMBER
               SET CLM-PENDED TO TRUE
               SET PEND-PROCESSING-ERROR TO TRUE
           END-IF
           .

       4100-PROCESS-ONE-LINE.
           ADD 1 TO WS-LINE-LOOP-COUNT

      *    Progress check: is line index advancing?
           IF WS-LINE-IDX = WS-LAST-LINE-IDX
               ADD 1 TO WS-NO-PROGRESS-COUNT
               IF WS-NO-PROGRESS-COUNT > 5
                   DISPLAY 'SAFETY: Stuck on line '
                       WS-LINE-IDX ' of claim '
                       WS-CLAIM-NUMBER
                   MOVE WS-LINE-COUNT TO WS-LINE-IDX
               END-IF
           ELSE
               MOVE 0 TO WS-NO-PROGRESS-COUNT
               MOVE WS-LINE-IDX TO WS-LAST-LINE-IDX
           END-IF

           MOVE 0 TO WS-REPRICE-ATTEMPT
           PERFORM 4110-PRICE-LINE
           ...
           ADD 1 TO WS-LINE-IDX
           .

Impact and Recovery

The delayed claims were processed in an emergency daytime run:

Metric Value
Claims delayed 187,000
Provider payments delayed $12.4 million
Recovery processing time 2.5 hours (emergency run)
Root cause identification 3.5 hours
Fix development and testing 4 hours
Total incident duration 14 hours (2:47 AM to 4:45 PM)

Systemic Changes

James used this incident to mandate three standards across all MedClaim batch programs:

  1. Every PERFORM UNTIL must have a safety counter. The counter limit should be set to at least 10x the expected maximum iterations.

  2. Every loop that modifies an index must verify progress. If the index does not advance within N iterations, the loop must terminate with an error.

  3. Loop control variables must be initialized at the correct scope. Variables used across loop iterations (like WS-REPRICE-ATTEMPT) must be explicitly initialized at the start of each logical unit (each line item, each claim, etc.).

Sarah Kim added a fourth rule from the business perspective: No single claim should be able to block processing of all other claims. If a claim causes an error, it should be pended and processing should continue with the next claim.

Discussion Questions

  1. The original bug involved the loop counter (WS-LINE-IDX) being incremented in multiple places under different conditions. How does the fixed version avoid this problem? What design principle does this illustrate?

  2. The WS-REPRICE-ATTEMPT counter was not reset between line items. This is an example of a broader class of bugs involving stale state. How would you systematically prevent this type of bug?

  3. Sarah Kim's rule — that one bad claim should not block all claims — suggests an error isolation pattern. How would you implement this in the outer claim processing loop?

  4. The safety counter for line items is set to 9,999. Claims typically have 1-50 line items. Is 9,999 the right value? What factors should determine the safety limit?

  5. Could this bug have been caught by code review? What specific review checklist items would have flagged it?