Case Study 2: MedClaim's Batch Adjudication Performance Crisis

DataField.Dev

Case Study 2: MedClaim's Batch Adjudication Performance Crisis

Background

MedClaim Health Services processes approximately 500,000 insurance claims per month through their batch adjudication system — program CLM-ADJUD. The program reads submitted claims, validates member eligibility and provider network status, applies coverage rules, and sets each claim's status to APPROVED, DENIED, or PENDING REVIEW.

The batch window for CLM-ADJUD is 4 hours, running every weeknight starting at 10:00 PM. For two years, the program completed well within this window — typically finishing in about 2.5 hours.

The Crisis

In Q3 2024, MedClaim acquired a regional health plan, increasing monthly claim volume from 500,000 to 820,000. The first month after the acquisition, CLM-ADJUD ran for 7 hours and 12 minutes — nearly doubling the batch window and delaying the morning online system availability by 3 hours. Customer service representatives could not access claim status until 11:00 AM instead of the usual 8:00 AM.

James Okafor's team had 30 days to fix the problem before the next month-end run.

Investigation

Tomás Rivera began by analyzing the program's DB2 activity using the DB2 Performance Monitor (PM):

Finding 1: The N+1 Query Problem

For each of the 820,000 claims, the program executed three separate SQL statements: 1. FETCH the claim from a cursor on the CLAIM table 2. SELECT member data from the MEMBER table using MEMBER_ID 3. SELECT provider data from the PROVIDER table using PROVIDER_ID

Total SQL calls: 820,000 x 3 = 2,460,000 SQL statements. Each call involved context switching between the COBOL runtime and the DB2 engine.

Finding 2: Single-Row Processing

The cursor used standard single-row FETCH. For 820,000 rows, this meant 820,000 FETCH calls — each with its own DB2 engine overhead.

Finding 3: Commit-per-Row

The original developer had placed EXEC SQL COMMIT after every claim was processed. This meant 820,000 COMMIT operations, each forcing a log write to disk.

Finding 4: Missing Index

The CLAIM table had an index on CLAIM_NUMBER (primary key) but not on CLAIM_STATUS. The cursor's WHERE clause WHERE CLAIM_STATUS = 'SUBMITTED' was causing a full tablespace scan on a 12-million-row table.

Optimization Strategy

The team implemented four changes in priority order:

Change 1: JOIN Elimination (Impact: 67% reduction in SQL calls)

Tomás rewrote the cursor to JOIN the CLAIM, MEMBER, and PROVIDER tables:

SELECT C.CLAIM_NUMBER, C.CLAIM_AMOUNT, C.SERVICE_CODE,
       M.MEMBER_NAME, M.PLAN_CODE, M.COVERAGE_LEVEL,
       P.PROVIDER_NAME, P.PROVIDER_TYPE, P.NETWORK_STATUS
FROM CLAIM C
JOIN MEMBER M ON C.MEMBER_ID = M.MEMBER_ID
JOIN PROVIDER P ON C.PROVIDER_ID = P.PROVIDER_ID
WHERE C.CLAIM_STATUS = 'SUBMITTED'
ORDER BY C.CLAIM_NUMBER

This eliminated 1,640,000 individual SELECT calls. SQL calls dropped from 2,460,000 to 820,000.

Change 2: Multi-Row FETCH (Impact: 99% reduction in FETCH calls)

The team converted the cursor to use multi-row FETCH with a rowset size of 100:

FETCH NEXT ROWSET FROM ADJUD-CURSOR
    FOR 100 ROWS
INTO :HOST-VARIABLE-ARRAYS

820,000 FETCH calls became 8,200 FETCH calls.

Change 3: Batch COMMIT (Impact: 99.9% reduction in COMMIT calls)

The team changed from commit-per-row to commit-per-500-rows with a WITH HOLD cursor. The last-committed CLAIM_NUMBER was saved for restart purposes:

IF FUNCTION MOD(WS-CLAIM-COUNT, 500) = 0
    MOVE WS-CURRENT-CLAIM TO WS-CHECKPOINT-KEY
    EXEC SQL COMMIT END-EXEC
END-IF

820,000 COMMITs became 1,640 COMMITs.

Change 4: Status Index (Impact: Tablespace scan eliminated)

CREATE INDEX IX_CLAIM_STATUS ON CLAIM (CLAIM_STATUS, CLAIM_NUMBER)

The EXPLAIN output changed from ACCESSTYPE='R' (tablespace scan) to ACCESSTYPE='I' with MATCHCOLS=1.

Results

Metric	Before	After	Improvement
SQL calls	2,460,000	~18,000	99.3% reduction
COMMIT calls	820,000	1,640	99.8% reduction
Elapsed time	7h 12m	42m	90.3% reduction
CPU time	3h 45m	28m	87.6% reduction
Log volume	4.2 GB	180 MB	95.7% reduction

The program now completes well within the 4-hour batch window, even with projected volume growth to 1.5 million claims per month.

Post-Mortem: Why Was the Original Program So Inefficient?

The original developer was a skilled COBOL programmer but had limited DB2 experience. Three common misconceptions led to the performance problems:

"One table per query is simpler." The developer treated DB2 like a file system, reading one table at a time. JOINs felt complex but are actually what relational databases are optimized for.
"Commit after every row is safest." The developer equated frequent commits with data safety. In reality, commit-per-row creates enormous overhead and actually increases failure risk (more commits = more chances for a commit to fail).
"Indexes are the DBA's job." The developer never checked the access path. The missing CLAIM_STATUS index meant DB2 was reading all 12 million rows to find the ~820,000 with status 'SUBMITTED'.

Discussion Questions

The team chose a rowset size of 100 for multi-row FETCH. What factors would you consider when choosing a rowset size? What are the trade-offs of using 10, 100, 500, or 1000?
The commit frequency of 500 was chosen as a "sweet spot." How would you determine the optimal commit frequency for a specific batch job? What metrics would you monitor?
After the index was created, new INSERTs to the CLAIM table became slightly slower. Why? How do you balance read performance against write performance?
The team briefly considered using UR (Uncommitted Read) isolation to avoid lock contention with the online system. James Okafor rejected this. Why might UR be inappropriate for an adjudication program that makes payment decisions?
If the claim volume grows to 5 million per month, what additional optimization strategies might the team consider?