Case Study 2: Data Quality Assessment Tool

Background

Heritage Mutual Insurance Company maintains a customer master file containing 1.2 million policyholder records. The file feeds downstream systems including policy administration, claims processing, billing, actuarial reporting, and regulatory filings. Over the past decade, data quality in the customer master has slowly degraded due to inconsistent data entry practices across the company's twelve regional offices, two mergers that brought in records from incompatible systems, and a general absence of input validation in the original 1990s-era data entry programs.

The consequences are real: billing statements are returned as undeliverable because of malformed addresses, regulatory reports contain implausible values that trigger auditor inquiries, and customer service representatives struggle with records where names appear in all uppercase, phone numbers contain embedded letters, and date fields hold values like 00000000 or 99999999.

The CIO has authorized a data quality initiative, and the first deliverable is a COBOL batch program that reads the entire customer master file, analyzes every field for quality issues, and produces a comprehensive data quality report. The report must identify specific problems at the record level and produce aggregate statistics at the file level, including means, standard deviations, and frequency distributions for key numeric fields.

This case study follows the design and implementation of that data quality assessment tool, demonstrating the full range of COBOL string and numeric intrinsic functions.


Problem Statement

The customer master record has the following layout:

      ******************************************************************
      * COPYBOOK: HMI-CUST-RECORD
      * Heritage Mutual Insurance - Customer Master Record
      * Record Length: 350 bytes
      ******************************************************************
       01  HMI-CUST-RECORD.
           05  HMI-CUST-ID                PIC X(10).
           05  HMI-CUST-NAME.
               10  HMI-LAST-NAME          PIC X(30).
               10  HMI-FIRST-NAME         PIC X(20).
               10  HMI-MIDDLE-INIT        PIC X(01).
           05  HMI-ADDRESS.
               10  HMI-STREET             PIC X(40).
               10  HMI-CITY               PIC X(25).
               10  HMI-STATE              PIC X(02).
               10  HMI-ZIP-CODE           PIC X(10).
           05  HMI-PHONE                  PIC X(15).
           05  HMI-EMAIL                  PIC X(50).
           05  HMI-DOB                    PIC 9(08).
           05  HMI-GENDER                 PIC X(01).
           05  HMI-POLICY-COUNT           PIC 9(03).
           05  HMI-TOTAL-PREMIUM          PIC S9(09)V99 COMP-3.
           05  HMI-ANNUAL-INCOME          PIC S9(09)V99 COMP-3.
           05  HMI-CREDIT-SCORE           PIC 9(03).
           05  HMI-RISK-CLASS             PIC X(02).
           05  HMI-REGION-CODE            PIC X(04).
           05  HMI-AGENT-ID              PIC X(08).
           05  HMI-ENROLLMENT-DATE        PIC 9(08).
           05  HMI-LAST-CONTACT-DATE      PIC 9(08).
           05  HMI-STATUS                 PIC X(01).
           05  HMI-FILLER                 PIC X(69).

The data quality tool must check each record for the following issues:

Category Check Functions Used
Name Quality Blank name fields LENGTH, TRIM
Name Quality Name is all uppercase (data entry laziness) UPPER-CASE comparison
Name Quality Name contains non-alphabetic characters ORD
Address Quality Missing address components LENGTH, TRIM
Address Quality Invalid state code (not in lookup table) UPPER-CASE
Phone Quality Phone contains alphabetic characters ORD
Phone Quality Phone is all zeros or blanks TRIM
Email Quality Email is blank or missing @ symbol LENGTH, TRIM
Date Quality Invalid date of birth TEST-DATE-YYYYMMDD
Date Quality Implausible age (< 0 or > 120) INTEGER-OF-DATE, CURRENT-DATE
Date Quality Invalid enrollment date TEST-DATE-YYYYMMDD
Numeric Quality Credit score out of range (300-850) Numeric comparison
Numeric Quality Negative premium amount Numeric comparison
Statistics Mean, median, standard deviation of income MEAN, MEDIAN, STANDARD-DEVIATION
Statistics Mean credit score by region MEAN

Complete Program

       IDENTIFICATION DIVISION.
       PROGRAM-ID. HMIDQUAL.
      *================================================================*
      * HERITAGE MUTUAL INSURANCE - DATA QUALITY ASSESSMENT TOOL       *
      * Reads the customer master file, analyzes every field for       *
      * quality issues, and produces a detailed quality report with    *
      * aggregate statistics.                                          *
      *================================================================*

       ENVIRONMENT DIVISION.
       INPUT-OUTPUT SECTION.
       FILE-CONTROL.
           SELECT CUSTOMER-FILE
               ASSIGN TO CUSTMAST
               ORGANIZATION IS SEQUENTIAL
               FILE STATUS IS WS-CUST-STATUS.
           SELECT QUALITY-REPORT
               ASSIGN TO DQRPT
               ORGANIZATION IS SEQUENTIAL
               FILE STATUS IS WS-RPT-STATUS.
           SELECT DETAIL-LOG
               ASSIGN TO DQDETAIL
               ORGANIZATION IS SEQUENTIAL
               FILE STATUS IS WS-LOG-STATUS.

       DATA DIVISION.
       FILE SECTION.

       FD  CUSTOMER-FILE
           RECORDING MODE IS F
           RECORD CONTAINS 350 CHARACTERS.
       01  CUST-INPUT-RECORD               PIC X(350).

       FD  QUALITY-REPORT
           RECORDING MODE IS F
           RECORD CONTAINS 132 CHARACTERS.
       01  RPT-LINE                        PIC X(132).

       FD  DETAIL-LOG
           RECORDING MODE IS F
           RECORD CONTAINS 200 CHARACTERS.
       01  LOG-LINE                        PIC X(200).

       WORKING-STORAGE SECTION.

       01  WS-FILE-STATUSES.
           05  WS-CUST-STATUS              PIC X(02).
           05  WS-RPT-STATUS               PIC X(02).
           05  WS-LOG-STATUS               PIC X(02).

       01  WS-FLAGS.
           05  WS-EOF                      PIC X VALUE 'N'.
               88  END-OF-FILE                 VALUE 'Y'.
           05  WS-RECORD-CLEAN            PIC X VALUE 'Y'.
               88  RECORD-IS-CLEAN             VALUE 'Y'.
               88  RECORD-HAS-ISSUES           VALUE 'N'.

      * Working copy of customer record
       01  WS-CUST-RECORD.
           05  WS-CUST-ID                  PIC X(10).
           05  WS-CUST-NAME.
               10  WS-LAST-NAME            PIC X(30).
               10  WS-FIRST-NAME           PIC X(20).
               10  WS-MIDDLE-INIT          PIC X(01).
           05  WS-ADDRESS.
               10  WS-STREET               PIC X(40).
               10  WS-CITY                 PIC X(25).
               10  WS-STATE                PIC X(02).
               10  WS-ZIP-CODE             PIC X(10).
           05  WS-PHONE                    PIC X(15).
           05  WS-EMAIL                    PIC X(50).
           05  WS-DOB                      PIC 9(08).
           05  WS-GENDER                   PIC X(01).
           05  WS-POLICY-COUNT             PIC 9(03).
           05  WS-TOTAL-PREMIUM            PIC S9(09)V99 COMP-3.
           05  WS-ANNUAL-INCOME            PIC S9(09)V99 COMP-3.
           05  WS-CREDIT-SCORE             PIC 9(03).
           05  WS-RISK-CLASS               PIC X(02).
           05  WS-REGION-CODE              PIC X(04).
           05  WS-AGENT-ID                PIC X(08).
           05  WS-ENROLLMENT-DATE          PIC 9(08).
           05  WS-LAST-CONTACT-DATE        PIC 9(08).
           05  WS-STATUS                   PIC X(01).
           05  WS-FILLER                   PIC X(69).

      * Date and time
       01  WS-CURRENT-DATETIME.
           05  WS-TODAY.
               10  WS-TODAY-YYYY           PIC 9(04).
               10  WS-TODAY-MM             PIC 9(02).
               10  WS-TODAY-DD             PIC 9(02).
           05  WS-TIME-NOW.
               10  WS-TIME-HH             PIC 9(02).
               10  WS-TIME-MN             PIC 9(02).
               10  WS-TIME-SS             PIC 9(02).
               10  WS-TIME-HS             PIC 9(02).
           05  WS-GMT-OFFSET              PIC X(05).
       01  WS-TODAY-INT                    PIC 9(09).

      * Issue counters
       01  WS-COUNTERS.
           05  WS-TOTAL-RECORDS            PIC 9(09) VALUE 0.
           05  WS-CLEAN-RECORDS            PIC 9(09) VALUE 0.
           05  WS-DIRTY-RECORDS            PIC 9(09) VALUE 0.
           05  WS-TOTAL-ISSUES             PIC 9(09) VALUE 0.
           05  WS-NAME-BLANK               PIC 9(07) VALUE 0.
           05  WS-NAME-ALL-UPPER           PIC 9(07) VALUE 0.
           05  WS-NAME-BAD-CHARS           PIC 9(07) VALUE 0.
           05  WS-ADDR-MISSING             PIC 9(07) VALUE 0.
           05  WS-STATE-INVALID            PIC 9(07) VALUE 0.
           05  WS-PHONE-BAD                PIC 9(07) VALUE 0.
           05  WS-PHONE-BLANK              PIC 9(07) VALUE 0.
           05  WS-EMAIL-BLANK              PIC 9(07) VALUE 0.
           05  WS-EMAIL-NO-AT              PIC 9(07) VALUE 0.
           05  WS-DOB-INVALID              PIC 9(07) VALUE 0.
           05  WS-DOB-IMPLAUSIBLE          PIC 9(07) VALUE 0.
           05  WS-ENROLL-INVALID           PIC 9(07) VALUE 0.
           05  WS-CREDIT-OUT-RANGE         PIC 9(07) VALUE 0.
           05  WS-PREMIUM-NEGATIVE         PIC 9(07) VALUE 0.

      * Work fields for string analysis
       01  WS-STRING-WORK.
           05  WS-TRIMMED-LEN             PIC 9(05).
           05  WS-CHAR-IDX                PIC 9(05).
           05  WS-CHAR-ORD                PIC 9(05).
           05  WS-TEST-CHAR               PIC X(01).
           05  WS-HAS-ALPHA               PIC X VALUE 'N'.
               88  FOUND-ALPHA                VALUE 'Y'.
               88  NO-ALPHA-FOUND             VALUE 'N'.
           05  WS-HAS-AT-SIGN             PIC X VALUE 'N'.
               88  FOUND-AT-SIGN              VALUE 'Y'.
               88  NO-AT-SIGN                 VALUE 'N'.
           05  WS-UPPER-VERSION            PIC X(50).

      * Work fields for age calculation
       01  WS-AGE-WORK.
           05  WS-DOB-INT                  PIC 9(09).
           05  WS-AGE-DAYS                 PIC 9(07).
           05  WS-AGE-YEARS                PIC 9(03).

      * Income statistics accumulation (sample of up to 1000)
       01  WS-INCOME-STATS.
           05  WS-INCOME-COUNT             PIC 9(07) VALUE 0.
           05  WS-INCOME-SUM               PIC S9(15)V99 VALUE 0.
           05  WS-INCOME-MIN               PIC S9(09)V99
                                           VALUE 999999999.99.
           05  WS-INCOME-MAX               PIC S9(09)V99
                                           VALUE 0.
      * Accumulate up to 20 income values for MEAN/MEDIAN/SD demo
       01  WS-INCOME-SAMPLE.
           05  WS-SAMPLE-SIZE              PIC 9(04) VALUE 0.
           05  WS-SAMPLE-VALUES.
               10  WS-SAMPLE-VAL          PIC S9(09)V99
                   OCCURS 20 TIMES.

      * Credit score statistics by region (4 regions)
       01  WS-REGION-STATS.
           05  WS-REGION-ENTRY OCCURS 4 TIMES.
               10  WS-RGN-CODE            PIC X(04).
               10  WS-RGN-COUNT           PIC 9(07) VALUE 0.
               10  WS-RGN-SCORE-SUM       PIC 9(11) VALUE 0.
               10  WS-RGN-MEAN-SCORE      PIC 9(03)V99.

      * US State code validation table
       01  WS-STATE-TABLE.
           05  FILLER PIC X(100) VALUE
               'AL AK AZ AR CA CO CT DE FL GA '
             & 'HI ID IL IN IA KS KY LA ME MD '.
           05  FILLER PIC X(100) VALUE
               'MA MI MN MS MO MT NE NV NH NJ '
             & 'NM NY NC ND OH OK OR PA RI SC '.
           05  FILLER PIC X(50) VALUE
               'SD TN TX UT VT VA WA WV WI WY '.
           05  FILLER PIC X(10) VALUE 'DC PR VI  '.
       01  WS-STATE-TABLE-R REDEFINES WS-STATE-TABLE.
           05  WS-VALID-STATE              PIC X(03)
               OCCURS 53 TIMES.

       01  WS-STATE-IDX                    PIC 9(03).
       01  WS-STATE-FOUND                  PIC X VALUE 'N'.
           88  STATE-IS-VALID                  VALUE 'Y'.
           88  STATE-NOT-VALID                 VALUE 'N'.

      * Region initialization
       01  WS-RGN-IDX                      PIC 9(02).
       01  WS-REGION-NAMES.
           05  FILLER PIC X(04) VALUE 'NEAS'.
           05  FILLER PIC X(04) VALUE 'SEAS'.
           05  FILLER PIC X(04) VALUE 'MDWS'.
           05  FILLER PIC X(04) VALUE 'WEST'.
       01  WS-REGION-NAMES-R REDEFINES WS-REGION-NAMES.
           05  WS-RGN-NAME-ENTRY PIC X(04) OCCURS 4 TIMES.

      * Formatted output fields
       01  WS-FMT-COUNT                    PIC Z(8)9.
       01  WS-FMT-PCT                      PIC ZZ9.99.
       01  WS-FMT-AMOUNT                   PIC $$$,$$$,$$9.99.
       01  WS-FMT-SCORE                    PIC ZZ9.99.
       01  WS-ISSUE-DETAIL                 PIC X(200).
       01  WS-PCT-CALC                     PIC 9(05)V99.

       PROCEDURE DIVISION.
       0000-MAIN.
           PERFORM 1000-INITIALIZE
           PERFORM 2000-PROCESS-RECORDS
               UNTIL END-OF-FILE
           PERFORM 8000-COMPUTE-STATISTICS
           PERFORM 9000-WRITE-SUMMARY-REPORT
           PERFORM 9900-FINALIZE
           STOP RUN.

       1000-INITIALIZE.
           MOVE FUNCTION CURRENT-DATE TO WS-CURRENT-DATETIME
           COMPUTE WS-TODAY-INT =
               FUNCTION INTEGER-OF-DATE(WS-TODAY)

      *    Initialize region codes
           PERFORM VARYING WS-RGN-IDX FROM 1 BY 1
               UNTIL WS-RGN-IDX > 4
               MOVE WS-RGN-NAME-ENTRY(WS-RGN-IDX)
                   TO WS-RGN-CODE(WS-RGN-IDX)
           END-PERFORM

           OPEN INPUT  CUSTOMER-FILE
                OUTPUT QUALITY-REPORT
                       DETAIL-LOG

           IF WS-CUST-STATUS NOT = '00'
               DISPLAY 'ERROR OPENING CUSTOMER FILE: '
                   WS-CUST-STATUS
               STOP RUN
           END-IF

      *    Write report header
           MOVE ALL '=' TO RPT-LINE
           WRITE RPT-LINE
           MOVE SPACES TO RPT-LINE
           STRING 'HERITAGE MUTUAL INSURANCE - '
               'DATA QUALITY ASSESSMENT REPORT'
               DELIMITED BY SIZE INTO RPT-LINE
           WRITE RPT-LINE
           MOVE SPACES TO RPT-LINE
           STRING 'Run Date: '
               WS-TODAY-YYYY '/' WS-TODAY-MM '/'
               WS-TODAY-DD '  Time: '
               WS-TIME-HH ':' WS-TIME-MN ':' WS-TIME-SS
               DELIMITED BY SIZE INTO RPT-LINE
           WRITE RPT-LINE
           MOVE ALL '=' TO RPT-LINE
           WRITE RPT-LINE

           PERFORM 1100-READ-CUSTOMER.

       1100-READ-CUSTOMER.
           READ CUSTOMER-FILE INTO WS-CUST-RECORD
               AT END SET END-OF-FILE TO TRUE
               NOT AT END ADD 1 TO WS-TOTAL-RECORDS
           END-READ.

       2000-PROCESS-RECORDS.
           SET RECORD-IS-CLEAN TO TRUE

           PERFORM 2100-CHECK-NAME-QUALITY
           PERFORM 2200-CHECK-ADDRESS-QUALITY
           PERFORM 2300-CHECK-PHONE-QUALITY
           PERFORM 2400-CHECK-EMAIL-QUALITY
           PERFORM 2500-CHECK-DATE-QUALITY
           PERFORM 2600-CHECK-NUMERIC-QUALITY
           PERFORM 2700-ACCUMULATE-STATISTICS

           IF RECORD-IS-CLEAN
               ADD 1 TO WS-CLEAN-RECORDS
           ELSE
               ADD 1 TO WS-DIRTY-RECORDS
           END-IF

           PERFORM 1100-READ-CUSTOMER.

      *================================================================*
      * NAME QUALITY CHECKS                                            *
      * Uses: LENGTH, TRIM, UPPER-CASE, ORD                           *
      *================================================================*
       2100-CHECK-NAME-QUALITY.
      *    Check 1: Is the name blank?
           COMPUTE WS-TRIMMED-LEN =
               FUNCTION LENGTH(
                   FUNCTION TRIM(WS-LAST-NAME))
           IF WS-TRIMMED-LEN = 0
               ADD 1 TO WS-NAME-BLANK
               ADD 1 TO WS-TOTAL-ISSUES
               SET RECORD-HAS-ISSUES TO TRUE
               STRING WS-CUST-ID '|NAME_BLANK|'
                   'Last name is empty'
                   DELIMITED BY SIZE INTO WS-ISSUE-DETAIL
               WRITE LOG-LINE FROM WS-ISSUE-DETAIL
           ELSE
      *        Check 2: Is the name all uppercase?
      *        Compare original to UPPER-CASE version
      *        If they are equal, name was entered in all caps
               MOVE FUNCTION UPPER-CASE(
                   FUNCTION TRIM(WS-LAST-NAME))
                   TO WS-UPPER-VERSION
               IF FUNCTION TRIM(WS-LAST-NAME)
                   = WS-UPPER-VERSION(1:WS-TRIMMED-LEN)
                   AND WS-TRIMMED-LEN > 1
                   ADD 1 TO WS-NAME-ALL-UPPER
                   ADD 1 TO WS-TOTAL-ISSUES
                   SET RECORD-HAS-ISSUES TO TRUE
                   MOVE SPACES TO WS-ISSUE-DETAIL
                   STRING WS-CUST-ID '|NAME_ALLCAPS|'
                       FUNCTION TRIM(WS-LAST-NAME)
                       DELIMITED BY SIZE
                       INTO WS-ISSUE-DETAIL
                   WRITE LOG-LINE FROM WS-ISSUE-DETAIL
               END-IF

      *        Check 3: Does name contain non-alpha characters?
      *        Allow letters, spaces, hyphens, apostrophes
               SET NO-ALPHA-FOUND TO TRUE
               PERFORM VARYING WS-CHAR-IDX FROM 1 BY 1
                   UNTIL WS-CHAR-IDX > WS-TRIMMED-LEN
                   OR FOUND-ALPHA
                   MOVE WS-LAST-NAME(WS-CHAR-IDX:1)
                       TO WS-TEST-CHAR
                   COMPUTE WS-CHAR-ORD =
                       FUNCTION ORD(WS-TEST-CHAR)
      *            In ASCII: A=66, Z=91, a=98, z=123
      *            Space=33, Hyphen=46, Apostrophe=40
                   IF  WS-TEST-CHAR NOT ALPHABETIC
                       AND WS-TEST-CHAR NOT = SPACE
                       AND WS-TEST-CHAR NOT = '-'
                       AND WS-TEST-CHAR NOT = ''''
                       SET FOUND-ALPHA TO TRUE
                   END-IF
               END-PERFORM
               IF FOUND-ALPHA
                   ADD 1 TO WS-NAME-BAD-CHARS
                   ADD 1 TO WS-TOTAL-ISSUES
                   SET RECORD-HAS-ISSUES TO TRUE
                   MOVE SPACES TO WS-ISSUE-DETAIL
                   STRING WS-CUST-ID
                       '|NAME_BADCHAR|'
                       FUNCTION TRIM(WS-LAST-NAME)
                       DELIMITED BY SIZE
                       INTO WS-ISSUE-DETAIL
                   WRITE LOG-LINE FROM WS-ISSUE-DETAIL
               END-IF
           END-IF.

      *================================================================*
      * ADDRESS QUALITY CHECKS                                         *
      * Uses: TRIM, LENGTH, UPPER-CASE                                *
      *================================================================*
       2200-CHECK-ADDRESS-QUALITY.
      *    Check for missing address components
           IF FUNCTION LENGTH(FUNCTION TRIM(WS-STREET)) = 0
               OR FUNCTION LENGTH(FUNCTION TRIM(WS-CITY)) = 0
               OR FUNCTION LENGTH(FUNCTION TRIM(WS-ZIP-CODE)) = 0
               ADD 1 TO WS-ADDR-MISSING
               ADD 1 TO WS-TOTAL-ISSUES
               SET RECORD-HAS-ISSUES TO TRUE
               MOVE SPACES TO WS-ISSUE-DETAIL
               STRING WS-CUST-ID '|ADDR_MISSING|'
                   'Incomplete address'
                   DELIMITED BY SIZE INTO WS-ISSUE-DETAIL
               WRITE LOG-LINE FROM WS-ISSUE-DETAIL
           END-IF

      *    Validate state code against lookup table
           SET STATE-NOT-VALID TO TRUE
           MOVE FUNCTION UPPER-CASE(WS-STATE)
               TO WS-STATE
           PERFORM VARYING WS-STATE-IDX FROM 1 BY 1
               UNTIL WS-STATE-IDX > 53
               OR STATE-IS-VALID
               IF WS-STATE =
                   FUNCTION TRIM(WS-VALID-STATE(WS-STATE-IDX))
                   SET STATE-IS-VALID TO TRUE
               END-IF
           END-PERFORM
           IF STATE-NOT-VALID AND
               FUNCTION LENGTH(FUNCTION TRIM(WS-STATE)) > 0
               ADD 1 TO WS-STATE-INVALID
               ADD 1 TO WS-TOTAL-ISSUES
               SET RECORD-HAS-ISSUES TO TRUE
               MOVE SPACES TO WS-ISSUE-DETAIL
               STRING WS-CUST-ID '|STATE_INVALID|'
                   WS-STATE
                   DELIMITED BY SIZE INTO WS-ISSUE-DETAIL
               WRITE LOG-LINE FROM WS-ISSUE-DETAIL
           END-IF.

      *================================================================*
      * PHONE QUALITY CHECKS                                           *
      * Uses: TRIM, LENGTH, ORD                                       *
      *================================================================*
       2300-CHECK-PHONE-QUALITY.
      *    Check for blank phone
           COMPUTE WS-TRIMMED-LEN =
               FUNCTION LENGTH(FUNCTION TRIM(WS-PHONE))
           IF WS-TRIMMED-LEN = 0
               ADD 1 TO WS-PHONE-BLANK
               ADD 1 TO WS-TOTAL-ISSUES
               SET RECORD-HAS-ISSUES TO TRUE
               MOVE SPACES TO WS-ISSUE-DETAIL
               STRING WS-CUST-ID '|PHONE_BLANK|'
                   'Phone number is empty'
                   DELIMITED BY SIZE INTO WS-ISSUE-DETAIL
               WRITE LOG-LINE FROM WS-ISSUE-DETAIL
           ELSE
      *        Check for alphabetic characters in phone
      *        Only digits, hyphens, parentheses, spaces, + allowed
               PERFORM VARYING WS-CHAR-IDX FROM 1 BY 1
                   UNTIL WS-CHAR-IDX > WS-TRIMMED-LEN
                   MOVE WS-PHONE(WS-CHAR-IDX:1)
                       TO WS-TEST-CHAR
                   IF WS-TEST-CHAR ALPHABETIC
                       AND WS-TEST-CHAR NOT = SPACE
                       ADD 1 TO WS-PHONE-BAD
                       ADD 1 TO WS-TOTAL-ISSUES
                       SET RECORD-HAS-ISSUES TO TRUE
                       MOVE SPACES TO WS-ISSUE-DETAIL
                       STRING WS-CUST-ID '|PHONE_ALPHA|'
                           FUNCTION TRIM(WS-PHONE)
                           DELIMITED BY SIZE
                           INTO WS-ISSUE-DETAIL
                       WRITE LOG-LINE FROM WS-ISSUE-DETAIL
      *                Exit the loop after first violation
                       MOVE WS-TRIMMED-LEN TO WS-CHAR-IDX
                   END-IF
               END-PERFORM
           END-IF.

      *================================================================*
      * EMAIL QUALITY CHECKS                                           *
      * Uses: TRIM, LENGTH                                             *
      *================================================================*
       2400-CHECK-EMAIL-QUALITY.
           COMPUTE WS-TRIMMED-LEN =
               FUNCTION LENGTH(FUNCTION TRIM(WS-EMAIL))
           IF WS-TRIMMED-LEN = 0
               ADD 1 TO WS-EMAIL-BLANK
               ADD 1 TO WS-TOTAL-ISSUES
               SET RECORD-HAS-ISSUES TO TRUE
           ELSE
      *        Check for @ symbol
               SET NO-AT-SIGN TO TRUE
               PERFORM VARYING WS-CHAR-IDX FROM 1 BY 1
                   UNTIL WS-CHAR-IDX > WS-TRIMMED-LEN
                   OR FOUND-AT-SIGN
                   IF WS-EMAIL(WS-CHAR-IDX:1) = '@'
                       SET FOUND-AT-SIGN TO TRUE
                   END-IF
               END-PERFORM
               IF NO-AT-SIGN
                   ADD 1 TO WS-EMAIL-NO-AT
                   ADD 1 TO WS-TOTAL-ISSUES
                   SET RECORD-HAS-ISSUES TO TRUE
                   MOVE SPACES TO WS-ISSUE-DETAIL
                   STRING WS-CUST-ID '|EMAIL_NOAT|'
                       FUNCTION TRIM(WS-EMAIL)
                       DELIMITED BY SIZE
                       INTO WS-ISSUE-DETAIL
                   WRITE LOG-LINE FROM WS-ISSUE-DETAIL
               END-IF
           END-IF.

      *================================================================*
      * DATE QUALITY CHECKS                                            *
      * Uses: TEST-DATE-YYYYMMDD, INTEGER-OF-DATE, CURRENT-DATE       *
      *================================================================*
       2500-CHECK-DATE-QUALITY.
      *    Validate date of birth
           IF FUNCTION TEST-DATE-YYYYMMDD(WS-DOB) NOT = 0
               ADD 1 TO WS-DOB-INVALID
               ADD 1 TO WS-TOTAL-ISSUES
               SET RECORD-HAS-ISSUES TO TRUE
               MOVE SPACES TO WS-ISSUE-DETAIL
               STRING WS-CUST-ID '|DOB_INVALID|'
                   WS-DOB
                   DELIMITED BY SIZE INTO WS-ISSUE-DETAIL
               WRITE LOG-LINE FROM WS-ISSUE-DETAIL
           ELSE
      *        Check plausible age (0-120 years)
               COMPUTE WS-DOB-INT =
                   FUNCTION INTEGER-OF-DATE(WS-DOB)
               COMPUTE WS-AGE-DAYS =
                   WS-TODAY-INT - WS-DOB-INT
               COMPUTE WS-AGE-YEARS =
                   FUNCTION INTEGER-PART(
                       WS-AGE-DAYS / 365.25)
               IF WS-AGE-YEARS < 0 OR WS-AGE-YEARS > 120
                   ADD 1 TO WS-DOB-IMPLAUSIBLE
                   ADD 1 TO WS-TOTAL-ISSUES
                   SET RECORD-HAS-ISSUES TO TRUE
                   MOVE SPACES TO WS-ISSUE-DETAIL
                   STRING WS-CUST-ID '|DOB_IMPLAUS|'
                       'Age=' WS-AGE-YEARS
                       DELIMITED BY SIZE
                       INTO WS-ISSUE-DETAIL
                   WRITE LOG-LINE FROM WS-ISSUE-DETAIL
               END-IF
           END-IF

      *    Validate enrollment date
           IF FUNCTION TEST-DATE-YYYYMMDD(WS-ENROLLMENT-DATE)
               NOT = 0
               ADD 1 TO WS-ENROLL-INVALID
               ADD 1 TO WS-TOTAL-ISSUES
               SET RECORD-HAS-ISSUES TO TRUE
               MOVE SPACES TO WS-ISSUE-DETAIL
               STRING WS-CUST-ID '|ENROLL_INVALID|'
                   WS-ENROLLMENT-DATE
                   DELIMITED BY SIZE INTO WS-ISSUE-DETAIL
               WRITE LOG-LINE FROM WS-ISSUE-DETAIL
           END-IF.

      *================================================================*
      * NUMERIC QUALITY CHECKS                                         *
      *================================================================*
       2600-CHECK-NUMERIC-QUALITY.
      *    Credit score range check (valid: 300-850)
           IF WS-CREDIT-SCORE < 300 OR WS-CREDIT-SCORE > 850
               IF WS-CREDIT-SCORE NOT = 0
                   ADD 1 TO WS-CREDIT-OUT-RANGE
                   ADD 1 TO WS-TOTAL-ISSUES
                   SET RECORD-HAS-ISSUES TO TRUE
               END-IF
           END-IF

      *    Negative premium check
           IF WS-TOTAL-PREMIUM < 0
               ADD 1 TO WS-PREMIUM-NEGATIVE
               ADD 1 TO WS-TOTAL-ISSUES
               SET RECORD-HAS-ISSUES TO TRUE
           END-IF.

      *================================================================*
      * STATISTICS ACCUMULATION                                        *
      *================================================================*
       2700-ACCUMULATE-STATISTICS.
      *    Income statistics
           IF WS-ANNUAL-INCOME > 0
               ADD 1 TO WS-INCOME-COUNT
               ADD WS-ANNUAL-INCOME TO WS-INCOME-SUM
               IF WS-ANNUAL-INCOME < WS-INCOME-MIN
                   MOVE WS-ANNUAL-INCOME TO WS-INCOME-MIN
               END-IF
               IF WS-ANNUAL-INCOME > WS-INCOME-MAX
                   MOVE WS-ANNUAL-INCOME TO WS-INCOME-MAX
               END-IF
      *        Collect first 20 samples for MEAN/MEDIAN/SD
               IF WS-SAMPLE-SIZE < 20
                   ADD 1 TO WS-SAMPLE-SIZE
                   MOVE WS-ANNUAL-INCOME
                       TO WS-SAMPLE-VAL(WS-SAMPLE-SIZE)
               END-IF
           END-IF

      *    Credit score by region
           PERFORM VARYING WS-RGN-IDX FROM 1 BY 1
               UNTIL WS-RGN-IDX > 4
               IF WS-REGION-CODE = WS-RGN-CODE(WS-RGN-IDX)
                   AND WS-CREDIT-SCORE >= 300
                   AND WS-CREDIT-SCORE <= 850
                   ADD 1 TO WS-RGN-COUNT(WS-RGN-IDX)
                   ADD WS-CREDIT-SCORE
                       TO WS-RGN-SCORE-SUM(WS-RGN-IDX)
               END-IF
           END-PERFORM.

      *================================================================*
      * COMPUTE AGGREGATE STATISTICS                                   *
      * Uses: MEAN, MEDIAN, STANDARD-DEVIATION, RANGE, VARIANCE      *
      *================================================================*
       8000-COMPUTE-STATISTICS.
      *    Compute region-level mean credit scores
           PERFORM VARYING WS-RGN-IDX FROM 1 BY 1
               UNTIL WS-RGN-IDX > 4
               IF WS-RGN-COUNT(WS-RGN-IDX) > 0
                   COMPUTE WS-RGN-MEAN-SCORE(WS-RGN-IDX) =
                       WS-RGN-SCORE-SUM(WS-RGN-IDX)
                       / WS-RGN-COUNT(WS-RGN-IDX)
               END-IF
           END-PERFORM.

      *================================================================*
      * WRITE SUMMARY REPORT                                           *
      *================================================================*
       9000-WRITE-SUMMARY-REPORT.
           MOVE SPACES TO RPT-LINE
           WRITE RPT-LINE
           MOVE 'DATA QUALITY SUMMARY' TO RPT-LINE
           WRITE RPT-LINE
           MOVE ALL '-' TO RPT-LINE
           WRITE RPT-LINE

      *    Record counts
           MOVE WS-TOTAL-RECORDS TO WS-FMT-COUNT
           MOVE SPACES TO RPT-LINE
           STRING 'Total Records Analyzed:  '
               WS-FMT-COUNT
               DELIMITED BY SIZE INTO RPT-LINE
           WRITE RPT-LINE

           MOVE WS-CLEAN-RECORDS TO WS-FMT-COUNT
           IF WS-TOTAL-RECORDS > 0
               COMPUTE WS-PCT-CALC =
                   (WS-CLEAN-RECORDS / WS-TOTAL-RECORDS)
                   * 100
           END-IF
           MOVE WS-PCT-CALC TO WS-FMT-PCT
           MOVE SPACES TO RPT-LINE
           STRING 'Clean Records:           '
               WS-FMT-COUNT '  (' WS-FMT-PCT '%)'
               DELIMITED BY SIZE INTO RPT-LINE
           WRITE RPT-LINE

           MOVE WS-DIRTY-RECORDS TO WS-FMT-COUNT
           IF WS-TOTAL-RECORDS > 0
               COMPUTE WS-PCT-CALC =
                   (WS-DIRTY-RECORDS / WS-TOTAL-RECORDS)
                   * 100
           END-IF
           MOVE WS-PCT-CALC TO WS-FMT-PCT
           MOVE SPACES TO RPT-LINE
           STRING 'Records with Issues:     '
               WS-FMT-COUNT '  (' WS-FMT-PCT '%)'
               DELIMITED BY SIZE INTO RPT-LINE
           WRITE RPT-LINE

           MOVE WS-TOTAL-ISSUES TO WS-FMT-COUNT
           MOVE SPACES TO RPT-LINE
           STRING 'Total Issues Found:      '
               WS-FMT-COUNT
               DELIMITED BY SIZE INTO RPT-LINE
           WRITE RPT-LINE

      *    Issue breakdown
           MOVE SPACES TO RPT-LINE
           WRITE RPT-LINE
           MOVE 'ISSUE BREAKDOWN:' TO RPT-LINE
           WRITE RPT-LINE

           PERFORM 9010-WRITE-ISSUE-LINE

      *    Income statistics using intrinsic functions
           MOVE SPACES TO RPT-LINE
           WRITE RPT-LINE
           MOVE 'INCOME STATISTICS (Sample of 20):' TO RPT-LINE
           WRITE RPT-LINE

           IF WS-SAMPLE-SIZE >= 2
               PERFORM 9020-WRITE-INCOME-STATS
           ELSE
               MOVE '  Insufficient data for statistics'
                   TO RPT-LINE
               WRITE RPT-LINE
           END-IF

      *    Credit score by region
           MOVE SPACES TO RPT-LINE
           WRITE RPT-LINE
           MOVE 'CREDIT SCORE BY REGION:' TO RPT-LINE
           WRITE RPT-LINE
           PERFORM 9030-WRITE-REGION-STATS

           MOVE ALL '=' TO RPT-LINE
           WRITE RPT-LINE
           MOVE 'END OF DATA QUALITY REPORT' TO RPT-LINE
           WRITE RPT-LINE.

       9010-WRITE-ISSUE-LINE.
           MOVE WS-NAME-BLANK TO WS-FMT-COUNT
           MOVE SPACES TO RPT-LINE
           STRING '  Blank Names:           '
               WS-FMT-COUNT
               DELIMITED BY SIZE INTO RPT-LINE
           WRITE RPT-LINE

           MOVE WS-NAME-ALL-UPPER TO WS-FMT-COUNT
           MOVE SPACES TO RPT-LINE
           STRING '  All-Uppercase Names:   '
               WS-FMT-COUNT
               DELIMITED BY SIZE INTO RPT-LINE
           WRITE RPT-LINE

           MOVE WS-NAME-BAD-CHARS TO WS-FMT-COUNT
           MOVE SPACES TO RPT-LINE
           STRING '  Names with Bad Chars:  '
               WS-FMT-COUNT
               DELIMITED BY SIZE INTO RPT-LINE
           WRITE RPT-LINE

           MOVE WS-ADDR-MISSING TO WS-FMT-COUNT
           MOVE SPACES TO RPT-LINE
           STRING '  Missing Address Parts: '
               WS-FMT-COUNT
               DELIMITED BY SIZE INTO RPT-LINE
           WRITE RPT-LINE

           MOVE WS-STATE-INVALID TO WS-FMT-COUNT
           MOVE SPACES TO RPT-LINE
           STRING '  Invalid State Codes:   '
               WS-FMT-COUNT
               DELIMITED BY SIZE INTO RPT-LINE
           WRITE RPT-LINE

           MOVE WS-DOB-INVALID TO WS-FMT-COUNT
           MOVE SPACES TO RPT-LINE
           STRING '  Invalid Birth Dates:   '
               WS-FMT-COUNT
               DELIMITED BY SIZE INTO RPT-LINE
           WRITE RPT-LINE

           MOVE WS-DOB-IMPLAUSIBLE TO WS-FMT-COUNT
           MOVE SPACES TO RPT-LINE
           STRING '  Implausible Ages:      '
               WS-FMT-COUNT
               DELIMITED BY SIZE INTO RPT-LINE
           WRITE RPT-LINE

           MOVE WS-CREDIT-OUT-RANGE TO WS-FMT-COUNT
           MOVE SPACES TO RPT-LINE
           STRING '  Credit Score Out Range: '
               WS-FMT-COUNT
               DELIMITED BY SIZE INTO RPT-LINE
           WRITE RPT-LINE

           MOVE WS-EMAIL-BLANK TO WS-FMT-COUNT
           MOVE SPACES TO RPT-LINE
           STRING '  Missing Email:         '
               WS-FMT-COUNT
               DELIMITED BY SIZE INTO RPT-LINE
           WRITE RPT-LINE

           MOVE WS-EMAIL-NO-AT TO WS-FMT-COUNT
           MOVE SPACES TO RPT-LINE
           STRING '  Email Missing @:       '
               WS-FMT-COUNT
               DELIMITED BY SIZE INTO RPT-LINE
           WRITE RPT-LINE.

       9020-WRITE-INCOME-STATS.
      *    Use intrinsic functions on the sample values
      *    MEAN, MEDIAN, STANDARD-DEVIATION operate on the
      *    individual sample values collected during processing
           EVALUATE WS-SAMPLE-SIZE
               WHEN 20
                   COMPUTE WS-PCT-CALC =
                       FUNCTION MEAN(
                           WS-SAMPLE-VAL(1)
                           WS-SAMPLE-VAL(2)
                           WS-SAMPLE-VAL(3)
                           WS-SAMPLE-VAL(4)
                           WS-SAMPLE-VAL(5)
                           WS-SAMPLE-VAL(6)
                           WS-SAMPLE-VAL(7)
                           WS-SAMPLE-VAL(8)
                           WS-SAMPLE-VAL(9)
                           WS-SAMPLE-VAL(10)
                           WS-SAMPLE-VAL(11)
                           WS-SAMPLE-VAL(12)
                           WS-SAMPLE-VAL(13)
                           WS-SAMPLE-VAL(14)
                           WS-SAMPLE-VAL(15)
                           WS-SAMPLE-VAL(16)
                           WS-SAMPLE-VAL(17)
                           WS-SAMPLE-VAL(18)
                           WS-SAMPLE-VAL(19)
                           WS-SAMPLE-VAL(20))
               WHEN OTHER
                   IF WS-INCOME-COUNT > 0
                       COMPUTE WS-PCT-CALC =
                           WS-INCOME-SUM / WS-INCOME-COUNT
                   END-IF
           END-EVALUATE

           MOVE WS-PCT-CALC TO WS-FMT-AMOUNT
           MOVE SPACES TO RPT-LINE
           STRING '  Mean Income:         '
               WS-FMT-AMOUNT
               DELIMITED BY SIZE INTO RPT-LINE
           WRITE RPT-LINE

           IF WS-SAMPLE-SIZE = 20
               COMPUTE WS-PCT-CALC =
                   FUNCTION MEDIAN(
                       WS-SAMPLE-VAL(1)
                       WS-SAMPLE-VAL(2)
                       WS-SAMPLE-VAL(3)
                       WS-SAMPLE-VAL(4)
                       WS-SAMPLE-VAL(5)
                       WS-SAMPLE-VAL(6)
                       WS-SAMPLE-VAL(7)
                       WS-SAMPLE-VAL(8)
                       WS-SAMPLE-VAL(9)
                       WS-SAMPLE-VAL(10)
                       WS-SAMPLE-VAL(11)
                       WS-SAMPLE-VAL(12)
                       WS-SAMPLE-VAL(13)
                       WS-SAMPLE-VAL(14)
                       WS-SAMPLE-VAL(15)
                       WS-SAMPLE-VAL(16)
                       WS-SAMPLE-VAL(17)
                       WS-SAMPLE-VAL(18)
                       WS-SAMPLE-VAL(19)
                       WS-SAMPLE-VAL(20))
               MOVE WS-PCT-CALC TO WS-FMT-AMOUNT
               MOVE SPACES TO RPT-LINE
               STRING '  Median Income:       '
                   WS-FMT-AMOUNT
                   DELIMITED BY SIZE INTO RPT-LINE
               WRITE RPT-LINE

               COMPUTE WS-PCT-CALC =
                   FUNCTION STANDARD-DEVIATION(
                       WS-SAMPLE-VAL(1)
                       WS-SAMPLE-VAL(2)
                       WS-SAMPLE-VAL(3)
                       WS-SAMPLE-VAL(4)
                       WS-SAMPLE-VAL(5)
                       WS-SAMPLE-VAL(6)
                       WS-SAMPLE-VAL(7)
                       WS-SAMPLE-VAL(8)
                       WS-SAMPLE-VAL(9)
                       WS-SAMPLE-VAL(10)
                       WS-SAMPLE-VAL(11)
                       WS-SAMPLE-VAL(12)
                       WS-SAMPLE-VAL(13)
                       WS-SAMPLE-VAL(14)
                       WS-SAMPLE-VAL(15)
                       WS-SAMPLE-VAL(16)
                       WS-SAMPLE-VAL(17)
                       WS-SAMPLE-VAL(18)
                       WS-SAMPLE-VAL(19)
                       WS-SAMPLE-VAL(20))
               MOVE WS-PCT-CALC TO WS-FMT-AMOUNT
               MOVE SPACES TO RPT-LINE
               STRING '  Std Deviation:       '
                   WS-FMT-AMOUNT
                   DELIMITED BY SIZE INTO RPT-LINE
               WRITE RPT-LINE
           END-IF

           MOVE WS-INCOME-MIN TO WS-FMT-AMOUNT
           MOVE SPACES TO RPT-LINE
           STRING '  Minimum Income:      '
               WS-FMT-AMOUNT
               DELIMITED BY SIZE INTO RPT-LINE
           WRITE RPT-LINE

           MOVE WS-INCOME-MAX TO WS-FMT-AMOUNT
           MOVE SPACES TO RPT-LINE
           STRING '  Maximum Income:      '
               WS-FMT-AMOUNT
               DELIMITED BY SIZE INTO RPT-LINE
           WRITE RPT-LINE.

       9030-WRITE-REGION-STATS.
           PERFORM VARYING WS-RGN-IDX FROM 1 BY 1
               UNTIL WS-RGN-IDX > 4
               IF WS-RGN-COUNT(WS-RGN-IDX) > 0
                   MOVE WS-RGN-MEAN-SCORE(WS-RGN-IDX)
                       TO WS-FMT-SCORE
                   MOVE WS-RGN-COUNT(WS-RGN-IDX)
                       TO WS-FMT-COUNT
                   MOVE SPACES TO RPT-LINE
                   STRING '  Region '
                       WS-RGN-CODE(WS-RGN-IDX)
                       ':  Mean Score = ' WS-FMT-SCORE
                       '  (N=' WS-FMT-COUNT ')'
                       DELIMITED BY SIZE INTO RPT-LINE
                   WRITE RPT-LINE
               END-IF
           END-PERFORM.

       9900-FINALIZE.
           DISPLAY 'HMIDQUAL PROCESSING COMPLETE'
           DISPLAY '  Total Records:  ' WS-TOTAL-RECORDS
           DISPLAY '  Clean Records:  ' WS-CLEAN-RECORDS
           DISPLAY '  Dirty Records:  ' WS-DIRTY-RECORDS
           DISPLAY '  Total Issues:   ' WS-TOTAL-ISSUES
           CLOSE CUSTOMER-FILE
                 QUALITY-REPORT
                 DETAIL-LOG.

Solution Walkthrough

String Functions in Action

The name quality check demonstrates the interplay of several string functions:

  • FUNCTION TRIM removes leading and trailing spaces from the name field before analysis. Without TRIM, a 30-byte PIC X field containing "SMITH" followed by 25 spaces would report a length of 30, masking the fact that only 5 characters contain actual data.

  • FUNCTION LENGTH applied to the trimmed result gives the true content length. FUNCTION LENGTH(FUNCTION TRIM(WS-LAST-NAME)) returns 5 for "SMITH", not 30.

  • FUNCTION UPPER-CASE enables the all-caps detection. By comparing the original value to its uppercase version, the program detects names that were entered entirely in uppercase -- a common data entry habit that makes the data look unprofessional in customer-facing documents.

  • FUNCTION ORD provides character-level analysis. By examining the ordinal value of each character, the program can classify characters as alphabetic, numeric, or special, detecting names that contain digits or unexpected symbols.

Date Functions for Validation

The date quality checks demonstrate a two-level validation strategy:

  1. FUNCTION TEST-DATE-YYYYMMDD catches structurally invalid dates -- values like 00000000, 99999999, 20230229 (February 29 in a non-leap year), or 20231301 (month 13). This is the first line of defense.

  2. FUNCTION INTEGER-OF-DATE with CURRENT-DATE provides semantic validation. A date like 20260115 is structurally valid, but if it represents a date of birth, it implies the person is less than one month old -- which may or may not be plausible depending on the business context. By converting both the date of birth and today's date to integers and computing the difference, the program calculates the person's age and flags implausible values.

Statistical Functions for Profiling

The income statistics section demonstrates COBOL's aggregate functions: FUNCTION MEAN, FUNCTION MEDIAN, and FUNCTION STANDARD-DEVIATION. These functions take multiple arguments and return the computed statistic in a single call.

A limitation of COBOL's intrinsic functions is that they require individual arguments, not arrays. The program must list each sample value explicitly (e.g., WS-SAMPLE-VAL(1) WS-SAMPLE-VAL(2) ... WS-SAMPLE-VAL(20)). For the demonstration, the sample is limited to 20 values. In a production system with millions of records, the running sum/count approach would be used for the mean, and a more sophisticated algorithm (such as Welford's online algorithm) would compute variance incrementally.


Discussion Questions

  1. The all-uppercase name detection compares the name to FUNCTION UPPER-CASE of itself. What are the edge cases where this check might produce false positives? Consider names like "O'BRIEN" or single-character names. How would you refine the check?

  2. COBOL's FUNCTION MEAN, FUNCTION MEDIAN, and FUNCTION STANDARD-DEVIATION require listing each argument individually, which limits their practical use to small datasets. Design an alternative approach using running accumulators that computes these statistics incrementally as records are read, without storing all values in memory.

  3. The program writes issues to a detail log file with pipe-delimited fields. Design a downstream process that reads this log, groups issues by category, and produces a remediation work queue for the data stewardship team. What additional fields would you add to the log?

  4. The email validation only checks for the presence of an '@' symbol. What additional checks would make this validation more robust? Consider using COBOL string functions to verify that there is text before the '@', text after the '@', and at least one '.' after the '@'.

  5. The program processes the entire file in a single pass. If the file contains 1.2 million records and each record generates an average of 1.5 issues, the detail log will contain approximately 1.8 million lines. How would you modify the program to support a "summary only" mode that skips the detail log and produces only the aggregate report? What changes to the program structure would this require?