Case Study 2: MedClaim Claim Line Item Processing and the Infinite Loop Incident
The Incident
At 2:47 AM on a Thursday in March 2023, the MedClaim on-call operations analyst received an automated alert: the nightly claims adjudication batch job, CLM-ADJUD, had exceeded its maximum CPU time allocation. The job had been running for 4 hours and 12 minutes — more than three times its normal 80-minute duration.
James Okafor, MedClaim's team lead, was paged at 3:15 AM. By the time he connected remotely, the operations team had already cancelled the job. The damage: 187,000 claims queued for adjudication were unprocessed, and the downstream payment job could not run. Provider payments for 43,000 claims would be delayed by at least one business day.
Root Cause
The investigation traced the problem to a single claim: claim number MC-2023-0847291, submitted by a provider with 247 line items. The claim line processing loop was:
4000-PROCESS-CLAIM-LINES.
MOVE 1 TO WS-LINE-IDX
PERFORM 4100-PROCESS-ONE-LINE
UNTIL WS-LINE-IDX > WS-LINE-COUNT
.
4100-PROCESS-ONE-LINE.
PERFORM 4110-PRICE-LINE
IF WS-PRICING-OK = 'Y'
PERFORM 4120-APPLY-BENEFIT
ADD 1 TO WS-LINE-IDX
ELSE
PERFORM 4130-REPRICE-LINE
END-IF
.
4130-REPRICE-LINE.
ADD 1 TO WS-REPRICE-ATTEMPT
IF WS-REPRICE-ATTEMPT <= 3
PERFORM 4110-PRICE-LINE
ELSE
SET LINE-DENIED TO TRUE
ADD 1 TO WS-LINE-IDX
END-IF
.
The bug was subtle. 4130-REPRICE-LINE calls 4110-PRICE-LINE to retry pricing, but it does not re-check whether pricing succeeded. After calling 4110-PRICE-LINE, it falls through without updating WS-PRICING-OK. Control returns to 4100-PROCESS-ONE-LINE, which checks WS-PRICING-OK — but WS-PRICING-OK is still 'N' from the first failed attempt (because 4130-REPRICE-LINE did not update it after the retry).
Furthermore, WS-REPRICE-ATTEMPT was never reset between line items. After the first line item that required repricing exhausted its 3 attempts, WS-REPRICE-ATTEMPT was 4. For every subsequent line item that needed repricing, WS-REPRICE-ATTEMPT was already > 3, so the repricing was immediately denied and WS-LINE-IDX was advanced. This masked the bug for most claims.
But claim MC-2023-0847291 was different. Its first line item had a service code that triggered a pricing error. The code entered the reprice loop, attempted 3 reprices (all failing), and set the line to denied. But because WS-PRICING-OK was never updated after the repricing, the outer loop's IF WS-PRICING-OK = 'Y' was still 'N', and WS-LINE-IDX was not incremented — the increment in 4130-REPRICE-LINE only happens when WS-REPRICE-ATTEMPT > 3, which had already been processed. The code was stuck on line 1 forever.
The Fix
Immediate Fix
4100-PROCESS-ONE-LINE.
MOVE 0 TO WS-REPRICE-ATTEMPT
PERFORM 4110-PRICE-LINE
IF WS-PRICING-OK NOT = 'Y'
PERFORM 4130-REPRICE-LINE
END-IF
IF WS-PRICING-OK = 'Y'
PERFORM 4120-APPLY-BENEFIT
ELSE
SET LINE-DENIED(WS-LINE-IDX) TO TRUE
END-IF
ADD 1 TO WS-LINE-IDX
.
4130-REPRICE-LINE.
PERFORM UNTIL WS-PRICING-OK = 'Y'
OR WS-REPRICE-ATTEMPT >= 3
ADD 1 TO WS-REPRICE-ATTEMPT
PERFORM 4110-PRICE-LINE
END-PERFORM
.
Key changes:
1. WS-REPRICE-ATTEMPT is reset at the start of each line item
2. WS-LINE-IDX is always incremented, unconditionally, at the end of 4100-PROCESS-ONE-LINE
3. The repricing loop properly checks WS-PRICING-OK after each attempt
Defensive Controls Added
01 WS-LINE-SAFETY.
05 WS-MAX-LINE-ITERATIONS PIC 9(05) VALUE 9999.
05 WS-LINE-LOOP-COUNT PIC 9(05) VALUE 0.
05 WS-LAST-LINE-IDX PIC 9(03) VALUE 0.
05 WS-NO-PROGRESS-COUNT PIC 9(03) VALUE 0.
4000-PROCESS-CLAIM-LINES.
MOVE 0 TO WS-LINE-LOOP-COUNT
WS-NO-PROGRESS-COUNT
MOVE 1 TO WS-LINE-IDX
PERFORM 4100-PROCESS-ONE-LINE
UNTIL WS-LINE-IDX > WS-LINE-COUNT
OR WS-LINE-LOOP-COUNT >= WS-MAX-LINE-ITERATIONS
IF WS-LINE-LOOP-COUNT >= WS-MAX-LINE-ITERATIONS
DISPLAY 'SAFETY: Line loop exceeded max for '
'claim ' WS-CLAIM-NUMBER
SET CLM-PENDED TO TRUE
SET PEND-PROCESSING-ERROR TO TRUE
END-IF
.
4100-PROCESS-ONE-LINE.
ADD 1 TO WS-LINE-LOOP-COUNT
* Progress check: is line index advancing?
IF WS-LINE-IDX = WS-LAST-LINE-IDX
ADD 1 TO WS-NO-PROGRESS-COUNT
IF WS-NO-PROGRESS-COUNT > 5
DISPLAY 'SAFETY: Stuck on line '
WS-LINE-IDX ' of claim '
WS-CLAIM-NUMBER
MOVE WS-LINE-COUNT TO WS-LINE-IDX
END-IF
ELSE
MOVE 0 TO WS-NO-PROGRESS-COUNT
MOVE WS-LINE-IDX TO WS-LAST-LINE-IDX
END-IF
MOVE 0 TO WS-REPRICE-ATTEMPT
PERFORM 4110-PRICE-LINE
...
ADD 1 TO WS-LINE-IDX
.
Impact and Recovery
The delayed claims were processed in an emergency daytime run:
| Metric | Value |
|---|---|
| Claims delayed | 187,000 |
| Provider payments delayed | $12.4 million |
| Recovery processing time | 2.5 hours (emergency run) |
| Root cause identification | 3.5 hours |
| Fix development and testing | 4 hours |
| Total incident duration | 14 hours (2:47 AM to 4:45 PM) |
Systemic Changes
James used this incident to mandate three standards across all MedClaim batch programs:
-
Every PERFORM UNTIL must have a safety counter. The counter limit should be set to at least 10x the expected maximum iterations.
-
Every loop that modifies an index must verify progress. If the index does not advance within N iterations, the loop must terminate with an error.
-
Loop control variables must be initialized at the correct scope. Variables used across loop iterations (like
WS-REPRICE-ATTEMPT) must be explicitly initialized at the start of each logical unit (each line item, each claim, etc.).
Sarah Kim added a fourth rule from the business perspective: No single claim should be able to block processing of all other claims. If a claim causes an error, it should be pended and processing should continue with the next claim.
Discussion Questions
-
The original bug involved the loop counter (
WS-LINE-IDX) being incremented in multiple places under different conditions. How does the fixed version avoid this problem? What design principle does this illustrate? -
The
WS-REPRICE-ATTEMPTcounter was not reset between line items. This is an example of a broader class of bugs involving stale state. How would you systematically prevent this type of bug? -
Sarah Kim's rule — that one bad claim should not block all claims — suggests an error isolation pattern. How would you implement this in the outer claim processing loop?
-
The safety counter for line items is set to 9,999. Claims typically have 1-50 line items. Is 9,999 the right value? What factors should determine the safety limit?
-
Could this bug have been caught by code review? What specific review checklist items would have flagged it?