Case Study 2: MedClaim Diagnosis Code Parser and Claim Description Formatter

DataField.Dev

Case Study 2: MedClaim Diagnosis Code Parser and Claim Description Formatter

Background

MedClaim Health Services receives claims from thousands of healthcare providers. Each claim contains diagnosis codes (ICD-10 format), procedure codes (CPT format), and free-text descriptions. The claims arrive in multiple formats:

EDI 837 format: Pipe-delimited electronic submissions
CMS-1500 format: Fixed-position fields based on the paper form
Legacy format: A comma-delimited format from MedClaim's older providers

Sarah Kim's team must normalize all these formats into MedClaim's internal fixed-format record structure. This normalization is almost entirely a string handling challenge.

The Problem

Three specific string handling challenges have emerged:

Challenge 1: ICD-10 Code Parsing

ICD-10 codes like "E11.65" must be parsed into category (E), etiology (11), and detail (65) for reporting and analytics. The detail portion is optional and variable-length (0 to 4 characters).

Challenge 2: Multi-Format Claim Descriptions

Claim descriptions arrive in different formats and must be normalized: - EDI: "OFFICE VISIT|EST PATIENT|LEVEL 3" (pipe-delimited components) - CMS-1500: "OFFICE VISIT, EST PATIENT, LEVEL 3" (comma-separated) - Legacy: "OV-EST-L3" (abbreviated codes)

All must produce: "OFFICE VISIT - ESTABLISHED PATIENT - LEVEL 3"

Challenge 3: Provider Name Standardization

Provider names arrive in inconsistent formats: - "Dr. John Smith, MD" - "SMITH, JOHN A MD" - "john smith md" - "Smith,John"

All must be normalized to: "SMITH, JOHN A" (uppercase, Last-First format, credentials removed).

Solution: The Normalization Pipeline

James Okafor designs a three-stage pipeline, each stage using different string handling facilities:

Stage 1: Format Detection (INSPECT + Reference Modification)

       2000-DETECT-FORMAT.
      *    Count pipes to detect EDI format
           MOVE ZERO TO WS-PIPE-CNT
           INSPECT WS-RAW-DESC
               TALLYING WS-PIPE-CNT FOR ALL '|'
           IF WS-PIPE-CNT > 0
               MOVE 'EDI' TO WS-FORMAT-TYPE
               EXIT PARAGRAPH
           END-IF

      *    Check for comma-space pattern (CMS-1500)
           MOVE ZERO TO WS-COMMA-CNT
           INSPECT WS-RAW-DESC
               TALLYING WS-COMMA-CNT FOR ALL ', '
           IF WS-COMMA-CNT > 0
               MOVE 'CMS' TO WS-FORMAT-TYPE
               EXIT PARAGRAPH
           END-IF

      *    Check for hyphen-separated codes (Legacy)
           IF WS-RAW-DESC(1:2) = 'OV'
           OR WS-RAW-DESC(1:2) = 'ER'
           OR WS-RAW-DESC(1:3) = 'INP'
               MOVE 'LEG' TO WS-FORMAT-TYPE
           ELSE
               MOVE 'UNK' TO WS-FORMAT-TYPE
           END-IF.

Stage 2: Parsing by Format (UNSTRING)

       2100-PARSE-EDI.
           INITIALIZE WS-DESC-PARTS
           UNSTRING WS-RAW-DESC
               DELIMITED BY '|'
               INTO WS-DESC-PART1
                    WS-DESC-PART2
                    WS-DESC-PART3
               TALLYING IN WS-PART-COUNT
           END-UNSTRING.

       2200-PARSE-CMS.
           INITIALIZE WS-DESC-PARTS
           UNSTRING WS-RAW-DESC
               DELIMITED BY ', '
               INTO WS-DESC-PART1
                    WS-DESC-PART2
                    WS-DESC-PART3
               TALLYING IN WS-PART-COUNT
           END-UNSTRING.

       2300-PARSE-LEGACY.
           INITIALIZE WS-DESC-PARTS
           UNSTRING WS-RAW-DESC
               DELIMITED BY '-'
               INTO WS-ABBREV1
                    WS-ABBREV2
                    WS-ABBREV3
               TALLYING IN WS-PART-COUNT
           END-UNSTRING
           PERFORM 2310-EXPAND-ABBREVIATIONS.

Stage 3: Normalization (STRING + INSPECT)

       3000-NORMALIZE-DESCRIPTION.
      *    Uppercase all parts
           INSPECT WS-DESC-PART1
               CONVERTING 'abcdefghijklmnopqrstuvwxyz'
                       TO 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
           INSPECT WS-DESC-PART2
               CONVERTING 'abcdefghijklmnopqrstuvwxyz'
                       TO 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
           INSPECT WS-DESC-PART3
               CONVERTING 'abcdefghijklmnopqrstuvwxyz'
                       TO 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

      *    Build normalized description
           MOVE SPACES TO WS-NORM-DESC
           MOVE 1 TO WS-DESC-PTR
           STRING WS-DESC-PART1 DELIMITED BY '  '
                  ' - '         DELIMITED BY SIZE
                  WS-DESC-PART2 DELIMITED BY '  '
                  INTO WS-NORM-DESC
                  WITH POINTER WS-DESC-PTR
           END-STRING

           IF WS-PART-COUNT > 2
               STRING ' - '         DELIMITED BY SIZE
                      WS-DESC-PART3 DELIMITED BY '  '
                      INTO WS-NORM-DESC
                      WITH POINTER WS-DESC-PTR
               END-STRING
           END-IF.

Results

After deploying the normalization pipeline:

Metric	Before	After
Format variants in database	847 unique description patterns	312 normalized patterns
Reporting accuracy	72% (duplicate categories due to format variations)	96% (clean categorization)
Provider name matches	81% (missed due to format differences)	98% (normalized matching)
Processing time	N/A (new feature)	3.2 seconds for 500K claims

Key Insight: Format Detection Before Parsing

The most important design decision was Stage 1 — detecting the format before parsing. Rather than trying a single UNSTRING and hoping it works, the program uses INSPECT TALLYING to characterize the input, then applies the appropriate UNSTRING pattern.

Sarah Kim's observation: "Never assume your input is in the format you expect. Check first, parse second. INSPECT TALLYING is cheap — a failed parse that corrupts downstream data is expensive."

Discussion Questions

The format detection in Stage 1 uses simple heuristics (counting pipes, checking prefixes). What could go wrong? How would you make it more robust?
How would you handle a claim description that contains both pipes and commas? Which format takes priority?
The provider name standardization must handle names from many cultures. "Dr. Priya Kapoor" and "Kim, Sarah J." follow Western conventions, but what about names like "Tomás Rivera" (accent) or "James Okafor-Smith" (hyphenated)? How would you extend the parsing?
If MedClaim added a fourth format (JSON from a REST API), how would you extend the pipeline?
Why is the DELIMITED BY ' ' (two spaces) used instead of DELIMITED BY SPACE when building the normalized description?