Case Study 2: MedClaim's Batch Adjudication Performance Crisis
Background
MedClaim Health Services processes approximately 500,000 insurance claims per month through their batch adjudication system — program CLM-ADJUD. The program reads submitted claims, validates member eligibility and provider network status, applies coverage rules, and sets each claim's status to APPROVED, DENIED, or PENDING REVIEW.
The batch window for CLM-ADJUD is 4 hours, running every weeknight starting at 10:00 PM. For two years, the program completed well within this window — typically finishing in about 2.5 hours.
The Crisis
In Q3 2024, MedClaim acquired a regional health plan, increasing monthly claim volume from 500,000 to 820,000. The first month after the acquisition, CLM-ADJUD ran for 7 hours and 12 minutes — nearly doubling the batch window and delaying the morning online system availability by 3 hours. Customer service representatives could not access claim status until 11:00 AM instead of the usual 8:00 AM.
James Okafor's team had 30 days to fix the problem before the next month-end run.
Investigation
Tomás Rivera began by analyzing the program's DB2 activity using the DB2 Performance Monitor (PM):
Finding 1: The N+1 Query Problem
For each of the 820,000 claims, the program executed three separate SQL statements: 1. FETCH the claim from a cursor on the CLAIM table 2. SELECT member data from the MEMBER table using MEMBER_ID 3. SELECT provider data from the PROVIDER table using PROVIDER_ID
Total SQL calls: 820,000 x 3 = 2,460,000 SQL statements. Each call involved context switching between the COBOL runtime and the DB2 engine.
Finding 2: Single-Row Processing
The cursor used standard single-row FETCH. For 820,000 rows, this meant 820,000 FETCH calls — each with its own DB2 engine overhead.
Finding 3: Commit-per-Row
The original developer had placed EXEC SQL COMMIT after every claim was processed. This meant 820,000 COMMIT operations, each forcing a log write to disk.
Finding 4: Missing Index
The CLAIM table had an index on CLAIM_NUMBER (primary key) but not on CLAIM_STATUS. The cursor's WHERE clause WHERE CLAIM_STATUS = 'SUBMITTED' was causing a full tablespace scan on a 12-million-row table.
Optimization Strategy
The team implemented four changes in priority order:
Change 1: JOIN Elimination (Impact: 67% reduction in SQL calls)
Tomás rewrote the cursor to JOIN the CLAIM, MEMBER, and PROVIDER tables:
SELECT C.CLAIM_NUMBER, C.CLAIM_AMOUNT, C.SERVICE_CODE,
M.MEMBER_NAME, M.PLAN_CODE, M.COVERAGE_LEVEL,
P.PROVIDER_NAME, P.PROVIDER_TYPE, P.NETWORK_STATUS
FROM CLAIM C
JOIN MEMBER M ON C.MEMBER_ID = M.MEMBER_ID
JOIN PROVIDER P ON C.PROVIDER_ID = P.PROVIDER_ID
WHERE C.CLAIM_STATUS = 'SUBMITTED'
ORDER BY C.CLAIM_NUMBER
This eliminated 1,640,000 individual SELECT calls. SQL calls dropped from 2,460,000 to 820,000.
Change 2: Multi-Row FETCH (Impact: 99% reduction in FETCH calls)
The team converted the cursor to use multi-row FETCH with a rowset size of 100:
FETCH NEXT ROWSET FROM ADJUD-CURSOR
FOR 100 ROWS
INTO :HOST-VARIABLE-ARRAYS
820,000 FETCH calls became 8,200 FETCH calls.
Change 3: Batch COMMIT (Impact: 99.9% reduction in COMMIT calls)
The team changed from commit-per-row to commit-per-500-rows with a WITH HOLD cursor. The last-committed CLAIM_NUMBER was saved for restart purposes:
IF FUNCTION MOD(WS-CLAIM-COUNT, 500) = 0
MOVE WS-CURRENT-CLAIM TO WS-CHECKPOINT-KEY
EXEC SQL COMMIT END-EXEC
END-IF
820,000 COMMITs became 1,640 COMMITs.
Change 4: Status Index (Impact: Tablespace scan eliminated)
CREATE INDEX IX_CLAIM_STATUS ON CLAIM (CLAIM_STATUS, CLAIM_NUMBER)
The EXPLAIN output changed from ACCESSTYPE='R' (tablespace scan) to ACCESSTYPE='I' with MATCHCOLS=1.
Results
| Metric | Before | After | Improvement |
|---|---|---|---|
| SQL calls | 2,460,000 | ~18,000 | 99.3% reduction |
| COMMIT calls | 820,000 | 1,640 | 99.8% reduction |
| Elapsed time | 7h 12m | 42m | 90.3% reduction |
| CPU time | 3h 45m | 28m | 87.6% reduction |
| Log volume | 4.2 GB | 180 MB | 95.7% reduction |
The program now completes well within the 4-hour batch window, even with projected volume growth to 1.5 million claims per month.
Post-Mortem: Why Was the Original Program So Inefficient?
The original developer was a skilled COBOL programmer but had limited DB2 experience. Three common misconceptions led to the performance problems:
-
"One table per query is simpler." The developer treated DB2 like a file system, reading one table at a time. JOINs felt complex but are actually what relational databases are optimized for.
-
"Commit after every row is safest." The developer equated frequent commits with data safety. In reality, commit-per-row creates enormous overhead and actually increases failure risk (more commits = more chances for a commit to fail).
-
"Indexes are the DBA's job." The developer never checked the access path. The missing CLAIM_STATUS index meant DB2 was reading all 12 million rows to find the ~820,000 with status 'SUBMITTED'.
Discussion Questions
- The team chose a rowset size of 100 for multi-row FETCH. What factors would you consider when choosing a rowset size? What are the trade-offs of using 10, 100, 500, or 1000?
- The commit frequency of 500 was chosen as a "sweet spot." How would you determine the optimal commit frequency for a specific batch job? What metrics would you monitor?
- After the index was created, new INSERTs to the CLAIM table became slightly slower. Why? How do you balance read performance against write performance?
- The team briefly considered using UR (Uncommitted Read) isolation to avoid lock contention with the online system. James Okafor rejected this. Why might UR be inappropriate for an adjudication program that makes payment decisions?
- If the claim volume grows to 5 million per month, what additional optimization strategies might the team consider?