Case Study 2: Name and Address Standardization at Heritage Life Insurance
Background
Heritage Life Insurance Company services 4.8 million policyholders across the United States. Their policyholder master file, maintained on an IBM z/OS mainframe, has accumulated data quality problems over three decades of manual entry, optical character recognition (OCR) ingestion, and batch merges from acquired companies. The same customer might appear as "JOHN Q. SMITH", "Smith, John Q", "JOHN QUINCY SMITH", or "J Q SMITH" depending on which data entry operator handled the original application.
In 2024, Heritage Life was ordered by their state insurance regulator to perform a comprehensive policyholder reconciliation. This required matching records across multiple systems -- and matching requires standardized names and addresses. A customer entered as "Robert J. McTavish Jr." in the policy system and "MCTAVISH, ROBERT J JR" in the claims system would not match unless both were normalized to a common format.
Priya Sharma, a senior COBOL batch developer, was assigned to build the Name and Address Standardization Program (NASP). The program would read the policyholder file, parse each name and address into component fields, normalize the components using consistent rules, and write a standardized output file suitable for matching.
The Problem
Priya cataloged the data quality issues in the policyholder file:
Name Field Problems
The name field is a single 50-character field (PIC X(50)) with no structure. Names appear in multiple formats:
| Input Name | Format Type |
|---|---|
JOHN Q SMITH |
First Middle-Initial Last |
SMITH, JOHN Q |
Last, First Middle-Initial |
Smith, John Quincy |
Last, First Middle (mixed case) |
DR. MARIA L. GONZALEZ-REYES |
Prefix, First, MI, Hyphenated Last |
JAMES MCALLISTER III |
First, Last with Mc prefix, Suffix |
PATRICIA ANN O'BRIEN |
First, Middle, Last with apostrophe |
MRS ALICE B WONDERLAND-JONES JR |
Prefix, First, MI, Hyphenated, Suffix |
Address Field Problems
The address occupies three lines of 30 characters each:
| Field | Example Problems |
|---|---|
| Address Line 1 | 123 N. Main St., 123 NORTH MAIN STREET, 123 N MAIN ST |
| Address Line 2 | Apt. 4B, APT 4B, #4B, UNIT 4-B, or blank |
| City-State-ZIP | SPRINGFIELD, IL 62704, Springfield IL 62704, SPRINGFIELD,IL62704 |
Standardization Rules
The regulator specified these normalization requirements:
- All output must be uppercase
- Names must be parsed into: Prefix, First, Middle, Last, Suffix
- Prefixes (MR, MRS, MS, DR, REV, HON) must be removed but recorded
- Suffixes (JR, SR, II, III, IV, ESQ, MD, PHD) must be separated from the last name
- Street abbreviations must be expanded (ST -> STREET, AVE -> AVENUE, etc.)
- Directional abbreviations must be standardized (N -> NORTH, S -> SOUTH, etc.)
- Periods must be removed from abbreviations
- Multiple spaces must be collapsed to single spaces
- Leading and trailing spaces must be removed
The Solution
IDENTIFICATION DIVISION.
PROGRAM-ID. NASP.
AUTHOR. PRIYA SHARMA.
DATE-WRITTEN. 2024-10-05.
*================================================================
* PROGRAM: NASP - NAME AND ADDRESS STANDARDIZATION
* PURPOSE: Parse and normalize policyholder names and
* addresses for matching across systems.
* Demonstrates UNSTRING, STRING, INSPECT,
* FUNCTION UPPER-CASE, and reference modification.
*================================================================
ENVIRONMENT DIVISION.
INPUT-OUTPUT SECTION.
FILE-CONTROL.
SELECT POLICY-INPUT-FILE
ASSIGN TO "POLICYIN"
ORGANIZATION IS SEQUENTIAL
FILE STATUS IS WS-INPUT-STATUS.
SELECT STANDARD-OUTPUT-FILE
ASSIGN TO "STDRDOUT"
ORGANIZATION IS SEQUENTIAL
FILE STATUS IS WS-OUTPUT-STATUS.
DATA DIVISION.
FILE SECTION.
FD POLICY-INPUT-FILE
RECORDING MODE IS F
RECORD CONTAINS 200 CHARACTERS.
01 FS-INPUT-RECORD.
05 FS-IN-POLICY-NO PIC X(10).
05 FS-IN-NAME PIC X(50).
05 FS-IN-ADDR-LINE1 PIC X(30).
05 FS-IN-ADDR-LINE2 PIC X(30).
05 FS-IN-CITY-ST-ZIP PIC X(30).
05 FS-IN-PHONE PIC X(10).
05 FS-IN-DOB PIC 9(8).
05 FILLER PIC X(32).
FD STANDARD-OUTPUT-FILE
RECORDING MODE IS F
RECORD CONTAINS 250 CHARACTERS.
01 FS-OUTPUT-RECORD.
05 FS-OUT-POLICY-NO PIC X(10).
05 FS-OUT-NAME-PREFIX PIC X(4).
05 FS-OUT-FIRST-NAME PIC X(20).
05 FS-OUT-MIDDLE-NAME PIC X(20).
05 FS-OUT-LAST-NAME PIC X(25).
05 FS-OUT-NAME-SUFFIX PIC X(5).
05 FS-OUT-ADDR-NUMBER PIC X(10).
05 FS-OUT-ADDR-STREET PIC X(30).
05 FS-OUT-ADDR-UNIT PIC X(10).
05 FS-OUT-CITY PIC X(20).
05 FS-OUT-STATE PIC X(2).
05 FS-OUT-ZIP PIC X(10).
05 FS-OUT-PHONE PIC X(10).
05 FS-OUT-DOB PIC 9(8).
05 FS-OUT-STD-NAME-KEY PIC X(40).
05 FS-OUT-STD-STATUS PIC X(1).
05 FILLER PIC X(25).
WORKING-STORAGE SECTION.
*----------------------------------------------------------------
* FILE STATUS
*----------------------------------------------------------------
01 WS-INPUT-STATUS PIC X(2).
88 INPUT-OK VALUE "00".
88 INPUT-EOF VALUE "10".
01 WS-OUTPUT-STATUS PIC X(2).
88 OUTPUT-OK VALUE "00".
*----------------------------------------------------------------
* NAME PARSING WORK FIELDS
*----------------------------------------------------------------
01 WS-NAME-WORK PIC X(50).
01 WS-NAME-UPPER PIC X(50).
01 WS-NAME-PARTS.
05 WS-WORD PIC X(25)
OCCURS 8 TIMES.
01 WS-WORD-COUNT PIC 9(2).
01 WS-UNSTR-PTR PIC 9(3).
01 WS-STR-PTR PIC 9(3).
01 WS-COMMA-FOUND PIC X(1).
88 HAS-COMMA VALUE 'Y'.
88 NO-COMMA VALUE 'N'.
*----------------------------------------------------------------
* PARSED NAME COMPONENTS
*----------------------------------------------------------------
01 WS-PARSED-NAME.
05 WS-P-PREFIX PIC X(4).
05 WS-P-FIRST PIC X(20).
05 WS-P-MIDDLE PIC X(20).
05 WS-P-LAST PIC X(25).
05 WS-P-SUFFIX PIC X(5).
*----------------------------------------------------------------
* PREFIX AND SUFFIX TABLES
*----------------------------------------------------------------
01 WS-PREFIX-TABLE-DATA.
05 FILLER PIC X(4) VALUE "MR ".
05 FILLER PIC X(4) VALUE "MRS ".
05 FILLER PIC X(4) VALUE "MS ".
05 FILLER PIC X(4) VALUE "DR ".
05 FILLER PIC X(4) VALUE "REV ".
05 FILLER PIC X(4) VALUE "HON ".
01 WS-PREFIX-TABLE REDEFINES WS-PREFIX-TABLE-DATA.
05 WS-KNOWN-PREFIX PIC X(4)
OCCURS 6 TIMES.
01 WS-PFX-IDX PIC 9(2).
01 WS-SUFFIX-TABLE-DATA.
05 FILLER PIC X(5) VALUE "JR ".
05 FILLER PIC X(5) VALUE "SR ".
05 FILLER PIC X(5) VALUE "II ".
05 FILLER PIC X(5) VALUE "III ".
05 FILLER PIC X(5) VALUE "IV ".
05 FILLER PIC X(5) VALUE "ESQ ".
05 FILLER PIC X(5) VALUE "MD ".
05 FILLER PIC X(5) VALUE "PHD ".
01 WS-SUFFIX-TABLE REDEFINES WS-SUFFIX-TABLE-DATA.
05 WS-KNOWN-SUFFIX PIC X(5)
OCCURS 8 TIMES.
01 WS-SFX-IDX PIC 9(2).
*----------------------------------------------------------------
* ADDRESS PARSING WORK FIELDS
*----------------------------------------------------------------
01 WS-ADDR-WORK PIC X(30).
01 WS-ADDR-PARTS.
05 WS-ADDR-WORD PIC X(15)
OCCURS 6 TIMES.
01 WS-ADDR-WORD-COUNT PIC 9(2).
*----------------------------------------------------------------
* STREET ABBREVIATION EXPANSION TABLE
* Format: 5-char abbreviation + 15-char expansion
*----------------------------------------------------------------
01 WS-STREET-ABBREV-DATA.
05 FILLER PIC X(20) VALUE "ST STREET ".
05 FILLER PIC X(20) VALUE "AVE AVENUE ".
05 FILLER PIC X(20) VALUE "BLVD BOULEVARD ".
05 FILLER PIC X(20) VALUE "DR DRIVE ".
05 FILLER PIC X(20) VALUE "LN LANE ".
05 FILLER PIC X(20) VALUE "RD ROAD ".
05 FILLER PIC X(20) VALUE "CT COURT ".
05 FILLER PIC X(20) VALUE "PL PLACE ".
05 FILLER PIC X(20) VALUE "CIR CIRCLE ".
05 FILLER PIC X(20) VALUE "PKY PARKWAY ".
05 FILLER PIC X(20) VALUE "PKWY PARKWAY ".
05 FILLER PIC X(20) VALUE "HWY HIGHWAY ".
01 WS-STREET-ABBREV-TABLE
REDEFINES WS-STREET-ABBREV-DATA.
05 WS-STREET-ENTRY OCCURS 12 TIMES.
10 WS-ABBREV PIC X(5).
10 WS-EXPANSION PIC X(15).
*----------------------------------------------------------------
* DIRECTIONAL ABBREVIATION TABLE
*----------------------------------------------------------------
01 WS-DIR-ABBREV-DATA.
05 FILLER PIC X(10) VALUE "N NORTH".
05 FILLER PIC X(10) VALUE "S SOUTH".
05 FILLER PIC X(10) VALUE "E EAST ".
05 FILLER PIC X(10) VALUE "W WEST ".
05 FILLER PIC X(10) VALUE "NE NE ".
05 FILLER PIC X(10) VALUE "NW NW ".
05 FILLER PIC X(10) VALUE "SE SE ".
05 FILLER PIC X(10) VALUE "SW SW ".
01 WS-DIR-TABLE REDEFINES WS-DIR-ABBREV-DATA.
05 WS-DIR-ENTRY OCCURS 8 TIMES.
10 WS-DIR-ABBREV PIC X(5).
10 WS-DIR-EXPAND PIC X(5).
*----------------------------------------------------------------
* CITY-STATE-ZIP PARSING
*----------------------------------------------------------------
01 WS-CSZ-WORK PIC X(30).
01 WS-CSZ-CITY PIC X(20).
01 WS-CSZ-STATE PIC X(2).
01 WS-CSZ-ZIP PIC X(10).
*----------------------------------------------------------------
* GENERAL WORK FIELDS
*----------------------------------------------------------------
01 WS-IDX PIC 9(3).
01 WS-IDX-2 PIC 9(3).
01 WS-TEMP-WORD PIC X(25).
01 WS-SCAN-POS PIC 9(3).
01 WS-TRIM-RESULT PIC X(50).
01 WS-IS-PREFIX PIC X(1).
88 WORD-IS-PREFIX VALUE 'Y'.
88 WORD-NOT-PREFIX VALUE 'N'.
01 WS-IS-SUFFIX PIC X(1).
88 WORD-IS-SUFFIX VALUE 'Y'.
88 WORD-NOT-SUFFIX VALUE 'N'.
*----------------------------------------------------------------
* COUNTERS
*----------------------------------------------------------------
01 WS-COUNTERS.
05 WS-TOTAL-READ PIC S9(7) COMP-3 VALUE 0.
05 WS-TOTAL-WRITTEN PIC S9(7) COMP-3 VALUE 0.
05 WS-NAMES-STANDARDIZED PIC S9(7) COMP-3 VALUE 0.
05 WS-ADDRS-STANDARDIZED PIC S9(7) COMP-3 VALUE 0.
05 WS-PARSE-ERRORS PIC S9(7) COMP-3 VALUE 0.
01 WS-DISP-COUNT PIC Z,ZZZ,ZZ9.
PROCEDURE DIVISION.
0000-MAIN-CONTROL.
PERFORM 1000-INITIALIZE
PERFORM 2000-PROCESS-RECORDS
UNTIL INPUT-EOF
PERFORM 8000-PRINT-STATISTICS
PERFORM 9000-FINALIZE
STOP RUN
.
1000-INITIALIZE.
DISPLAY "============================================="
DISPLAY " HERITAGE LIFE INSURANCE COMPANY"
DISPLAY " NAME AND ADDRESS STANDARDIZATION"
DISPLAY "============================================="
OPEN INPUT POLICY-INPUT-FILE
OUTPUT STANDARD-OUTPUT-FILE
IF NOT INPUT-OK
DISPLAY "ERROR: Cannot open input. Status: "
WS-INPUT-STATUS
STOP RUN
END-IF
IF NOT OUTPUT-OK
DISPLAY "ERROR: Cannot open output. Status: "
WS-OUTPUT-STATUS
STOP RUN
END-IF
PERFORM 2100-READ-INPUT
.
2000-PROCESS-RECORDS.
ADD 1 TO WS-TOTAL-READ
INITIALIZE FS-OUTPUT-RECORD
MOVE FS-IN-POLICY-NO TO FS-OUT-POLICY-NO
MOVE FS-IN-PHONE TO FS-OUT-PHONE
MOVE FS-IN-DOB TO FS-OUT-DOB
* Parse and standardize the name
PERFORM 3000-STANDARDIZE-NAME
* Parse and standardize the address
PERFORM 4000-STANDARDIZE-ADDRESS
* Build the matching key
PERFORM 5000-BUILD-MATCH-KEY
* Write the output record
PERFORM 6000-WRITE-OUTPUT
PERFORM 2100-READ-INPUT
.
2100-READ-INPUT.
READ POLICY-INPUT-FILE
AT END SET INPUT-EOF TO TRUE
END-READ
.
3000-STANDARDIZE-NAME.
* -------------------------------------------------------
* Step 1: Convert to uppercase and remove periods
* -------------------------------------------------------
MOVE FUNCTION UPPER-CASE(FS-IN-NAME)
TO WS-NAME-UPPER
* Remove all periods (DR. -> DR, J. -> J)
INSPECT WS-NAME-UPPER
REPLACING ALL "." BY " "
* Remove apostrophes for matching purposes
* (O'BRIEN -> O BRIEN, then later OBRIEN in key)
INSPECT WS-NAME-UPPER
REPLACING ALL "'" BY " "
* -------------------------------------------------------
* Step 2: Determine format (comma = Last, First)
* -------------------------------------------------------
MOVE ZERO TO WS-SCAN-POS
INSPECT WS-NAME-UPPER
TALLYING WS-SCAN-POS FOR ALL ","
IF WS-SCAN-POS > 0
SET HAS-COMMA TO TRUE
ELSE
SET NO-COMMA TO TRUE
END-IF
* Remove commas after detecting format
INSPECT WS-NAME-UPPER
REPLACING ALL "," BY " "
* -------------------------------------------------------
* Step 3: Split into individual words using UNSTRING
* -------------------------------------------------------
INITIALIZE WS-NAME-PARTS
MOVE ZERO TO WS-WORD-COUNT
MOVE 1 TO WS-UNSTR-PTR
UNSTRING WS-NAME-UPPER
DELIMITED BY ALL SPACES
INTO WS-WORD(1) WS-WORD(2) WS-WORD(3)
WS-WORD(4) WS-WORD(5) WS-WORD(6)
WS-WORD(7) WS-WORD(8)
WITH POINTER WS-UNSTR-PTR
TALLYING IN WS-WORD-COUNT
END-UNSTRING
* -------------------------------------------------------
* Step 4: Identify and extract prefix (first word)
* -------------------------------------------------------
INITIALIZE WS-PARSED-NAME
SET WORD-NOT-PREFIX TO TRUE
IF WS-WORD-COUNT > 0
PERFORM VARYING WS-PFX-IDX FROM 1 BY 1
UNTIL WS-PFX-IDX > 6
OR WORD-IS-PREFIX
MOVE FUNCTION TRIM(WS-WORD(1))
TO WS-TEMP-WORD
MOVE FUNCTION TRIM(
WS-KNOWN-PREFIX(WS-PFX-IDX))
TO WS-TRIM-RESULT
IF WS-TEMP-WORD(1:FUNCTION LENGTH(
FUNCTION TRIM(
WS-KNOWN-PREFIX(WS-PFX-IDX))))
= WS-TRIM-RESULT(1:FUNCTION LENGTH(
FUNCTION TRIM(
WS-KNOWN-PREFIX(WS-PFX-IDX))))
SET WORD-IS-PREFIX TO TRUE
MOVE WS-KNOWN-PREFIX(WS-PFX-IDX)
TO WS-P-PREFIX
END-IF
END-PERFORM
END-IF
* -------------------------------------------------------
* Step 5: Identify and extract suffix (last word)
* -------------------------------------------------------
SET WORD-NOT-SUFFIX TO TRUE
IF WS-WORD-COUNT > 1
PERFORM VARYING WS-SFX-IDX FROM 1 BY 1
UNTIL WS-SFX-IDX > 8
OR WORD-IS-SUFFIX
MOVE FUNCTION TRIM(
WS-WORD(WS-WORD-COUNT))
TO WS-TEMP-WORD
MOVE FUNCTION TRIM(
WS-KNOWN-SUFFIX(WS-SFX-IDX))
TO WS-TRIM-RESULT
IF WS-TEMP-WORD(1:FUNCTION LENGTH(
FUNCTION TRIM(
WS-KNOWN-SUFFIX(WS-SFX-IDX))))
= WS-TRIM-RESULT(1:FUNCTION LENGTH(
FUNCTION TRIM(
WS-KNOWN-SUFFIX(WS-SFX-IDX))))
SET WORD-IS-SUFFIX TO TRUE
MOVE WS-KNOWN-SUFFIX(WS-SFX-IDX)
TO WS-P-SUFFIX
END-IF
END-PERFORM
END-IF
* -------------------------------------------------------
* Step 6: Assign remaining words based on format
* -------------------------------------------------------
* Calculate first and last name word positions
MOVE 1 TO WS-IDX
IF WORD-IS-PREFIX
ADD 1 TO WS-IDX
END-IF
MOVE WS-WORD-COUNT TO WS-IDX-2
IF WORD-IS-SUFFIX
SUBTRACT 1 FROM WS-IDX-2
END-IF
IF HAS-COMMA
* Comma format: first word(s) = last name
* remaining words = first name, middle
MOVE WS-WORD(WS-IDX) TO WS-P-LAST
IF WS-IDX + 1 <= WS-IDX-2
MOVE WS-WORD(WS-IDX + 1) TO WS-P-FIRST
END-IF
IF WS-IDX + 2 <= WS-IDX-2
MOVE WS-WORD(WS-IDX + 2) TO WS-P-MIDDLE
END-IF
ELSE
* Standard format: first middle last
IF WS-IDX <= WS-IDX-2
MOVE WS-WORD(WS-IDX) TO WS-P-FIRST
END-IF
IF WS-IDX-2 > WS-IDX
MOVE WS-WORD(WS-IDX-2) TO WS-P-LAST
END-IF
IF WS-IDX + 1 < WS-IDX-2
MOVE WS-WORD(WS-IDX + 1) TO WS-P-MIDDLE
END-IF
IF WS-IDX = WS-IDX-2
* Only one name word -- treat as last name
MOVE WS-P-FIRST TO WS-P-LAST
MOVE SPACES TO WS-P-FIRST
END-IF
END-IF
* Move parsed fields to output
MOVE WS-P-PREFIX TO FS-OUT-NAME-PREFIX
MOVE WS-P-FIRST TO FS-OUT-FIRST-NAME
MOVE WS-P-MIDDLE TO FS-OUT-MIDDLE-NAME
MOVE WS-P-LAST TO FS-OUT-LAST-NAME
MOVE WS-P-SUFFIX TO FS-OUT-NAME-SUFFIX
ADD 1 TO WS-NAMES-STANDARDIZED
.
4000-STANDARDIZE-ADDRESS.
* -------------------------------------------------------
* Step 1: Standardize Address Line 1
* -------------------------------------------------------
MOVE FUNCTION UPPER-CASE(FS-IN-ADDR-LINE1)
TO WS-ADDR-WORK
* Remove periods from abbreviations
INSPECT WS-ADDR-WORK
REPLACING ALL "." BY " "
* Split address line into words
INITIALIZE WS-ADDR-PARTS
MOVE ZERO TO WS-ADDR-WORD-COUNT
MOVE 1 TO WS-UNSTR-PTR
UNSTRING WS-ADDR-WORK
DELIMITED BY ALL SPACES
INTO WS-ADDR-WORD(1) WS-ADDR-WORD(2)
WS-ADDR-WORD(3) WS-ADDR-WORD(4)
WS-ADDR-WORD(5) WS-ADDR-WORD(6)
WITH POINTER WS-UNSTR-PTR
TALLYING IN WS-ADDR-WORD-COUNT
END-UNSTRING
* First word is usually the street number
IF WS-ADDR-WORD-COUNT > 0
MOVE WS-ADDR-WORD(1) TO FS-OUT-ADDR-NUMBER
END-IF
* Expand directional abbreviations (2nd word)
IF WS-ADDR-WORD-COUNT > 1
PERFORM VARYING WS-IDX FROM 1 BY 1
UNTIL WS-IDX > 8
IF FUNCTION TRIM(WS-ADDR-WORD(2)) =
FUNCTION TRIM(WS-DIR-ABBREV(WS-IDX))
MOVE WS-DIR-EXPAND(WS-IDX)
TO WS-ADDR-WORD(2)
END-IF
END-PERFORM
END-IF
* Expand street type abbreviation (last word)
IF WS-ADDR-WORD-COUNT > 2
PERFORM VARYING WS-IDX FROM 1 BY 1
UNTIL WS-IDX > 12
IF FUNCTION TRIM(
WS-ADDR-WORD(WS-ADDR-WORD-COUNT)) =
FUNCTION TRIM(WS-ABBREV(WS-IDX))
MOVE WS-EXPANSION(WS-IDX)
TO WS-ADDR-WORD(WS-ADDR-WORD-COUNT)
END-IF
END-PERFORM
END-IF
* Reassemble the street address (without house number)
MOVE SPACES TO FS-OUT-ADDR-STREET
MOVE 1 TO WS-STR-PTR
PERFORM VARYING WS-IDX FROM 2 BY 1
UNTIL WS-IDX > WS-ADDR-WORD-COUNT
IF WS-IDX > 2
STRING " " DELIMITED BY SIZE
INTO FS-OUT-ADDR-STREET
WITH POINTER WS-STR-PTR
END-STRING
END-IF
STRING FUNCTION TRIM(WS-ADDR-WORD(WS-IDX))
DELIMITED BY SIZE
INTO FS-OUT-ADDR-STREET
WITH POINTER WS-STR-PTR
END-STRING
END-PERFORM
* -------------------------------------------------------
* Step 2: Standardize Address Line 2 (unit/apartment)
* -------------------------------------------------------
IF FS-IN-ADDR-LINE2 NOT = SPACES
MOVE FUNCTION UPPER-CASE(FS-IN-ADDR-LINE2)
TO WS-ADDR-WORK
INSPECT WS-ADDR-WORK
REPLACING ALL "." BY " "
* Standardize unit designators
INSPECT WS-ADDR-WORK
REPLACING ALL "APT " BY "UNIT"
ALL "SUITE" BY "UNIT "
ALL "#" BY "UNIT "
MOVE FUNCTION TRIM(WS-ADDR-WORK)
TO FS-OUT-ADDR-UNIT
END-IF
* -------------------------------------------------------
* Step 3: Parse City, State, ZIP
* -------------------------------------------------------
PERFORM 4100-PARSE-CITY-STATE-ZIP
ADD 1 TO WS-ADDRS-STANDARDIZED
.
4100-PARSE-CITY-STATE-ZIP.
* -------------------------------------------------------
* Parse the city-state-zip field which may appear as:
* "SPRINGFIELD, IL 62704"
* "SPRINGFIELD IL 62704"
* "SPRINGFIELD,IL62704"
* Uses UNSTRING with multiple delimiters.
* -------------------------------------------------------
MOVE FUNCTION UPPER-CASE(FS-IN-CITY-ST-ZIP)
TO WS-CSZ-WORK
INITIALIZE WS-CSZ-CITY
INITIALIZE WS-CSZ-STATE
INITIALIZE WS-CSZ-ZIP
* Replace commas with spaces for uniform parsing
INSPECT WS-CSZ-WORK
REPLACING ALL "," BY " "
* Use UNSTRING to split from the right: ZIP is last,
* state is second-to-last, city is the rest.
* Approach: scan from the end to find the ZIP code,
* then the 2-character state code.
* Extract ZIP code (5 or 9 digits at end of field)
MOVE ZERO TO WS-SCAN-POS
PERFORM VARYING WS-IDX FROM 30 BY -1
UNTIL WS-IDX < 1 OR WS-SCAN-POS > 0
IF WS-CSZ-WORK(WS-IDX:1) >= "0"
AND WS-CSZ-WORK(WS-IDX:1) <= "9"
IF WS-SCAN-POS = 0
MOVE WS-IDX TO WS-SCAN-POS
END-IF
ELSE
IF WS-SCAN-POS > 0
* Found end of ZIP; extract it
MOVE WS-CSZ-WORK(WS-IDX + 1:
WS-SCAN-POS - WS-IDX)
TO WS-CSZ-ZIP
MOVE SPACES TO
WS-CSZ-WORK(WS-IDX + 1:
WS-SCAN-POS - WS-IDX)
MOVE WS-IDX TO WS-SCAN-POS
EXIT PERFORM
END-IF
END-IF
END-PERFORM
* Extract state code (2 uppercase letters)
MOVE ZERO TO WS-SCAN-POS
PERFORM VARYING WS-IDX FROM 28 BY -1
UNTIL WS-IDX < 1 OR WS-SCAN-POS > 0
IF WS-CSZ-WORK(WS-IDX:1) >= "A"
AND WS-CSZ-WORK(WS-IDX:1) <= "Z"
AND WS-CSZ-WORK(WS-IDX + 1:1) >= "A"
AND WS-CSZ-WORK(WS-IDX + 1:1) <= "Z"
AND (WS-IDX = 1
OR WS-CSZ-WORK(WS-IDX - 1:1) = SPACE)
MOVE WS-CSZ-WORK(WS-IDX:2) TO WS-CSZ-STATE
MOVE SPACES TO WS-CSZ-WORK(WS-IDX:2)
MOVE WS-IDX TO WS-SCAN-POS
END-IF
END-PERFORM
* Everything remaining is the city name
MOVE FUNCTION TRIM(WS-CSZ-WORK)
TO WS-CSZ-CITY
MOVE WS-CSZ-CITY TO FS-OUT-CITY
MOVE WS-CSZ-STATE TO FS-OUT-STATE
MOVE WS-CSZ-ZIP TO FS-OUT-ZIP
.
5000-BUILD-MATCH-KEY.
* -------------------------------------------------------
* Build a 40-character matching key from the
* standardized name components:
* LAST(25) + FIRST-INITIAL(1) + MIDDLE-INITIAL(1)
* + ZIP(5) + DOB(8)
* This key enables efficient matching across systems.
* -------------------------------------------------------
MOVE SPACES TO FS-OUT-STD-NAME-KEY
MOVE 1 TO WS-STR-PTR
STRING
FUNCTION TRIM(FS-OUT-LAST-NAME)
DELIMITED BY SIZE
"|" DELIMITED BY SIZE
FS-OUT-FIRST-NAME(1:1)
DELIMITED BY SIZE
"|" DELIMITED BY SIZE
FS-OUT-MIDDLE-NAME(1:1)
DELIMITED BY SIZE
"|" DELIMITED BY SIZE
FUNCTION TRIM(FS-OUT-ZIP)
DELIMITED BY SIZE
"|" DELIMITED BY SIZE
FS-OUT-DOB
DELIMITED BY SIZE
INTO FS-OUT-STD-NAME-KEY
WITH POINTER WS-STR-PTR
END-STRING
MOVE "S" TO FS-OUT-STD-STATUS
.
6000-WRITE-OUTPUT.
WRITE FS-OUTPUT-RECORD
IF OUTPUT-OK
ADD 1 TO WS-TOTAL-WRITTEN
ELSE
DISPLAY "WRITE ERROR: " WS-OUTPUT-STATUS
" Policy: " FS-OUT-POLICY-NO
ADD 1 TO WS-PARSE-ERRORS
END-IF
.
8000-PRINT-STATISTICS.
DISPLAY " "
DISPLAY "============================================="
DISPLAY " STANDARDIZATION STATISTICS"
DISPLAY "============================================="
MOVE WS-TOTAL-READ TO WS-DISP-COUNT
DISPLAY " Records read: " WS-DISP-COUNT
MOVE WS-NAMES-STANDARDIZED TO WS-DISP-COUNT
DISPLAY " Names standardized: " WS-DISP-COUNT
MOVE WS-ADDRS-STANDARDIZED TO WS-DISP-COUNT
DISPLAY " Addresses standardized: " WS-DISP-COUNT
MOVE WS-TOTAL-WRITTEN TO WS-DISP-COUNT
DISPLAY " Records written: " WS-DISP-COUNT
MOVE WS-PARSE-ERRORS TO WS-DISP-COUNT
DISPLAY " Parse errors: " WS-DISP-COUNT
DISPLAY "============================================="
.
9000-FINALIZE.
CLOSE POLICY-INPUT-FILE
STANDARD-OUTPUT-FILE
DISPLAY " "
DISPLAY "NASP processing complete."
.
Solution Walkthrough
Name Parsing Strategy: Detect Format, Then Decompose
The name parser first determines whether the name is in "Last, First" format (contains a comma) or "First Last" format (no comma). This detection is done with INSPECT TALLYING before any modification:
INSPECT WS-NAME-UPPER
TALLYING WS-SCAN-POS FOR ALL ","
After format detection, commas and periods are removed with INSPECT REPLACING, and the name is split into individual words with UNSTRING using DELIMITED BY ALL SPACES. The ALL keyword is critical -- it treats consecutive spaces as a single delimiter, preventing empty words from appearing in the result.
Prefix and Suffix Identification: Table-Driven Matching
Rather than hard-coding prefix and suffix checks, the program uses lookup tables. Each word is compared against the prefix table (for the first word) and suffix table (for the last word). This table-driven approach means adding "CPT" as a new prefix requires only a table entry, not a code change.
Address Standardization: INSPECT REPLACING for Bulk Transformation
The address standardization uses INSPECT REPLACING ALL to perform bulk character substitutions in a single statement:
INSPECT WS-ADDR-WORK
REPLACING ALL "APT " BY "UNIT"
ALL "SUITE" BY "UNIT "
ALL "#" BY "UNIT "
This replaces multiple variations of unit designators with the standard form "UNIT" in a single pass. Note the careful sizing of replacement strings to match the original -- INSPECT REPLACING requires the replacement string to be the same length as the search string.
City-State-ZIP Parsing: Right-to-Left Scanning
The city-state-ZIP parser demonstrates a technique that compensates for UNSTRING's limitation of left-to-right scanning only. Since the ZIP code and state code are always at the end of the field, the parser scans from right to left using reference modification, extracting the ZIP first, then the state, and leaving the city as whatever remains.
Matching Key Construction: STRING with Delimiters
The matching key is built using STRING to concatenate standardized components with pipe delimiters. This produces a key like SMITH|J|Q|62704|19750315 that can be compared across systems regardless of how the original name was formatted.
Lessons Learned
1. FUNCTION UPPER-CASE Simplifies Case Normalization
Converting to uppercase as the first step in every parsing operation eliminates case-sensitivity from all subsequent comparisons. Without this, every comparison would need to handle mixed case.
2. INSPECT REPLACING Requires Equal-Length Strings
The restriction that replacement strings must be the same length as search strings is a common source of bugs. "APT " (4 chars) can be replaced by "UNIT" (4 chars), but "APT" (3 chars) cannot be replaced by "UNIT" (4 chars). Padding with spaces is necessary.
3. UNSTRING TALLYING Counts Fields, Not Delimiters
The TALLYING IN clause on UNSTRING counts the number of receiving fields populated, not the number of delimiters found. This is essential for knowing how many name words were extracted.
4. Right-to-Left Scanning Requires Reference Modification
UNSTRING always works left to right. When the structure of a field is known from the right (such as ZIP codes and state codes at the end of a city-state-ZIP field), reference modification with a backward loop is the practical solution.
5. Matching Keys Reduce Complex Comparisons to Simple Ones
By building a normalized matching key in the output record, downstream matching programs can use simple string comparison instead of repeating the normalization logic.
Discussion Questions
-
The name parser handles "Last, First" and "First Last" formats but not "First MI Last, Suffix" (where the comma precedes the suffix rather than separating last from first). How would you extend the parser to detect and handle this additional format?
-
INSPECT REPLACING ALL has a subtle behavior: replacements proceed left to right, and each character position is examined only once. What would happen if you tried to replace " " (double space) with " " (single space) in a string containing four consecutive spaces? How would you collapse all multiple spaces to single spaces?
-
The street abbreviation table stores both the abbreviation and its expansion. An alternative design would use INSPECT CONVERTING to map abbreviations. Why is INSPECT CONVERTING not suitable for this task?
-
The matching key uses the first character of the first name and middle name. This means "JAMES" and "JOHN" would both produce "J". What are the trade-offs of using more characters? How would you balance match precision against false-positive rates?
-
The program removes apostrophes from names (O'BRIEN becomes O BRIEN). This aids matching but loses information. How would you preserve the original name while still enabling apostrophe-insensitive matching?
-
The city-state-ZIP parser assumes the state code is a two-letter abbreviation. How would it behave if the input contained a fully spelled state name like "ILLINOIS"? How would you extend the parser to handle this case?
-
The program processes one record at a time. If Heritage Life needed to detect and merge duplicate policyholders (not just standardize records), what additional data structures and logic would be needed? Could this be done within a single-pass COBOL batch program, or would it require multiple passes?