Case Study 2: MedClaim Diagnosis Code Parser and Claim Description Formatter
Background
MedClaim Health Services receives claims from thousands of healthcare providers. Each claim contains diagnosis codes (ICD-10 format), procedure codes (CPT format), and free-text descriptions. The claims arrive in multiple formats:
- EDI 837 format: Pipe-delimited electronic submissions
- CMS-1500 format: Fixed-position fields based on the paper form
- Legacy format: A comma-delimited format from MedClaim's older providers
Sarah Kim's team must normalize all these formats into MedClaim's internal fixed-format record structure. This normalization is almost entirely a string handling challenge.
The Problem
Three specific string handling challenges have emerged:
Challenge 1: ICD-10 Code Parsing
ICD-10 codes like "E11.65" must be parsed into category (E), etiology (11), and detail (65) for reporting and analytics. The detail portion is optional and variable-length (0 to 4 characters).
Challenge 2: Multi-Format Claim Descriptions
Claim descriptions arrive in different formats and must be normalized: - EDI: "OFFICE VISIT|EST PATIENT|LEVEL 3" (pipe-delimited components) - CMS-1500: "OFFICE VISIT, EST PATIENT, LEVEL 3" (comma-separated) - Legacy: "OV-EST-L3" (abbreviated codes)
All must produce: "OFFICE VISIT - ESTABLISHED PATIENT - LEVEL 3"
Challenge 3: Provider Name Standardization
Provider names arrive in inconsistent formats: - "Dr. John Smith, MD" - "SMITH, JOHN A MD" - "john smith md" - "Smith,John"
All must be normalized to: "SMITH, JOHN A" (uppercase, Last-First format, credentials removed).
Solution: The Normalization Pipeline
James Okafor designs a three-stage pipeline, each stage using different string handling facilities:
Stage 1: Format Detection (INSPECT + Reference Modification)
2000-DETECT-FORMAT.
* Count pipes to detect EDI format
MOVE ZERO TO WS-PIPE-CNT
INSPECT WS-RAW-DESC
TALLYING WS-PIPE-CNT FOR ALL '|'
IF WS-PIPE-CNT > 0
MOVE 'EDI' TO WS-FORMAT-TYPE
EXIT PARAGRAPH
END-IF
* Check for comma-space pattern (CMS-1500)
MOVE ZERO TO WS-COMMA-CNT
INSPECT WS-RAW-DESC
TALLYING WS-COMMA-CNT FOR ALL ', '
IF WS-COMMA-CNT > 0
MOVE 'CMS' TO WS-FORMAT-TYPE
EXIT PARAGRAPH
END-IF
* Check for hyphen-separated codes (Legacy)
IF WS-RAW-DESC(1:2) = 'OV'
OR WS-RAW-DESC(1:2) = 'ER'
OR WS-RAW-DESC(1:3) = 'INP'
MOVE 'LEG' TO WS-FORMAT-TYPE
ELSE
MOVE 'UNK' TO WS-FORMAT-TYPE
END-IF.
Stage 2: Parsing by Format (UNSTRING)
2100-PARSE-EDI.
INITIALIZE WS-DESC-PARTS
UNSTRING WS-RAW-DESC
DELIMITED BY '|'
INTO WS-DESC-PART1
WS-DESC-PART2
WS-DESC-PART3
TALLYING IN WS-PART-COUNT
END-UNSTRING.
2200-PARSE-CMS.
INITIALIZE WS-DESC-PARTS
UNSTRING WS-RAW-DESC
DELIMITED BY ', '
INTO WS-DESC-PART1
WS-DESC-PART2
WS-DESC-PART3
TALLYING IN WS-PART-COUNT
END-UNSTRING.
2300-PARSE-LEGACY.
INITIALIZE WS-DESC-PARTS
UNSTRING WS-RAW-DESC
DELIMITED BY '-'
INTO WS-ABBREV1
WS-ABBREV2
WS-ABBREV3
TALLYING IN WS-PART-COUNT
END-UNSTRING
PERFORM 2310-EXPAND-ABBREVIATIONS.
Stage 3: Normalization (STRING + INSPECT)
3000-NORMALIZE-DESCRIPTION.
* Uppercase all parts
INSPECT WS-DESC-PART1
CONVERTING 'abcdefghijklmnopqrstuvwxyz'
TO 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
INSPECT WS-DESC-PART2
CONVERTING 'abcdefghijklmnopqrstuvwxyz'
TO 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
INSPECT WS-DESC-PART3
CONVERTING 'abcdefghijklmnopqrstuvwxyz'
TO 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
* Build normalized description
MOVE SPACES TO WS-NORM-DESC
MOVE 1 TO WS-DESC-PTR
STRING WS-DESC-PART1 DELIMITED BY ' '
' - ' DELIMITED BY SIZE
WS-DESC-PART2 DELIMITED BY ' '
INTO WS-NORM-DESC
WITH POINTER WS-DESC-PTR
END-STRING
IF WS-PART-COUNT > 2
STRING ' - ' DELIMITED BY SIZE
WS-DESC-PART3 DELIMITED BY ' '
INTO WS-NORM-DESC
WITH POINTER WS-DESC-PTR
END-STRING
END-IF.
Results
After deploying the normalization pipeline:
| Metric | Before | After |
|---|---|---|
| Format variants in database | 847 unique description patterns | 312 normalized patterns |
| Reporting accuracy | 72% (duplicate categories due to format variations) | 96% (clean categorization) |
| Provider name matches | 81% (missed due to format differences) | 98% (normalized matching) |
| Processing time | N/A (new feature) | 3.2 seconds for 500K claims |
Key Insight: Format Detection Before Parsing
The most important design decision was Stage 1 — detecting the format before parsing. Rather than trying a single UNSTRING and hoping it works, the program uses INSPECT TALLYING to characterize the input, then applies the appropriate UNSTRING pattern.
Sarah Kim's observation: "Never assume your input is in the format you expect. Check first, parse second. INSPECT TALLYING is cheap — a failed parse that corrupts downstream data is expensive."
Discussion Questions
- The format detection in Stage 1 uses simple heuristics (counting pipes, checking prefixes). What could go wrong? How would you make it more robust?
- How would you handle a claim description that contains both pipes and commas? Which format takes priority?
- The provider name standardization must handle names from many cultures. "Dr. Priya Kapoor" and "Kim, Sarah J." follow Western conventions, but what about names like "Tomás Rivera" (accent) or "James Okafor-Smith" (hyphenated)? How would you extend the parsing?
- If MedClaim added a fourth format (JSON from a REST API), how would you extend the pipeline?
- Why is the
DELIMITED BY ' '(two spaces) used instead ofDELIMITED BY SPACEwhen building the normalized description?