Case Study 2: Data Quality Assessment Tool
Background
Heritage Mutual Insurance Company maintains a customer master file containing 1.2 million policyholder records. The file feeds downstream systems including policy administration, claims processing, billing, actuarial reporting, and regulatory filings. Over the past decade, data quality in the customer master has slowly degraded due to inconsistent data entry practices across the company's twelve regional offices, two mergers that brought in records from incompatible systems, and a general absence of input validation in the original 1990s-era data entry programs.
The consequences are real: billing statements are returned as undeliverable because of malformed addresses, regulatory reports contain implausible values that trigger auditor inquiries, and customer service representatives struggle with records where names appear in all uppercase, phone numbers contain embedded letters, and date fields hold values like 00000000 or 99999999.
The CIO has authorized a data quality initiative, and the first deliverable is a COBOL batch program that reads the entire customer master file, analyzes every field for quality issues, and produces a comprehensive data quality report. The report must identify specific problems at the record level and produce aggregate statistics at the file level, including means, standard deviations, and frequency distributions for key numeric fields.
This case study follows the design and implementation of that data quality assessment tool, demonstrating the full range of COBOL string and numeric intrinsic functions.
Problem Statement
The customer master record has the following layout:
******************************************************************
* COPYBOOK: HMI-CUST-RECORD
* Heritage Mutual Insurance - Customer Master Record
* Record Length: 350 bytes
******************************************************************
01 HMI-CUST-RECORD.
05 HMI-CUST-ID PIC X(10).
05 HMI-CUST-NAME.
10 HMI-LAST-NAME PIC X(30).
10 HMI-FIRST-NAME PIC X(20).
10 HMI-MIDDLE-INIT PIC X(01).
05 HMI-ADDRESS.
10 HMI-STREET PIC X(40).
10 HMI-CITY PIC X(25).
10 HMI-STATE PIC X(02).
10 HMI-ZIP-CODE PIC X(10).
05 HMI-PHONE PIC X(15).
05 HMI-EMAIL PIC X(50).
05 HMI-DOB PIC 9(08).
05 HMI-GENDER PIC X(01).
05 HMI-POLICY-COUNT PIC 9(03).
05 HMI-TOTAL-PREMIUM PIC S9(09)V99 COMP-3.
05 HMI-ANNUAL-INCOME PIC S9(09)V99 COMP-3.
05 HMI-CREDIT-SCORE PIC 9(03).
05 HMI-RISK-CLASS PIC X(02).
05 HMI-REGION-CODE PIC X(04).
05 HMI-AGENT-ID PIC X(08).
05 HMI-ENROLLMENT-DATE PIC 9(08).
05 HMI-LAST-CONTACT-DATE PIC 9(08).
05 HMI-STATUS PIC X(01).
05 HMI-FILLER PIC X(69).
The data quality tool must check each record for the following issues:
| Category | Check | Functions Used |
|---|---|---|
| Name Quality | Blank name fields | LENGTH, TRIM |
| Name Quality | Name is all uppercase (data entry laziness) | UPPER-CASE comparison |
| Name Quality | Name contains non-alphabetic characters | ORD |
| Address Quality | Missing address components | LENGTH, TRIM |
| Address Quality | Invalid state code (not in lookup table) | UPPER-CASE |
| Phone Quality | Phone contains alphabetic characters | ORD |
| Phone Quality | Phone is all zeros or blanks | TRIM |
| Email Quality | Email is blank or missing @ symbol | LENGTH, TRIM |
| Date Quality | Invalid date of birth | TEST-DATE-YYYYMMDD |
| Date Quality | Implausible age (< 0 or > 120) | INTEGER-OF-DATE, CURRENT-DATE |
| Date Quality | Invalid enrollment date | TEST-DATE-YYYYMMDD |
| Numeric Quality | Credit score out of range (300-850) | Numeric comparison |
| Numeric Quality | Negative premium amount | Numeric comparison |
| Statistics | Mean, median, standard deviation of income | MEAN, MEDIAN, STANDARD-DEVIATION |
| Statistics | Mean credit score by region | MEAN |
Complete Program
IDENTIFICATION DIVISION.
PROGRAM-ID. HMIDQUAL.
*================================================================*
* HERITAGE MUTUAL INSURANCE - DATA QUALITY ASSESSMENT TOOL *
* Reads the customer master file, analyzes every field for *
* quality issues, and produces a detailed quality report with *
* aggregate statistics. *
*================================================================*
ENVIRONMENT DIVISION.
INPUT-OUTPUT SECTION.
FILE-CONTROL.
SELECT CUSTOMER-FILE
ASSIGN TO CUSTMAST
ORGANIZATION IS SEQUENTIAL
FILE STATUS IS WS-CUST-STATUS.
SELECT QUALITY-REPORT
ASSIGN TO DQRPT
ORGANIZATION IS SEQUENTIAL
FILE STATUS IS WS-RPT-STATUS.
SELECT DETAIL-LOG
ASSIGN TO DQDETAIL
ORGANIZATION IS SEQUENTIAL
FILE STATUS IS WS-LOG-STATUS.
DATA DIVISION.
FILE SECTION.
FD CUSTOMER-FILE
RECORDING MODE IS F
RECORD CONTAINS 350 CHARACTERS.
01 CUST-INPUT-RECORD PIC X(350).
FD QUALITY-REPORT
RECORDING MODE IS F
RECORD CONTAINS 132 CHARACTERS.
01 RPT-LINE PIC X(132).
FD DETAIL-LOG
RECORDING MODE IS F
RECORD CONTAINS 200 CHARACTERS.
01 LOG-LINE PIC X(200).
WORKING-STORAGE SECTION.
01 WS-FILE-STATUSES.
05 WS-CUST-STATUS PIC X(02).
05 WS-RPT-STATUS PIC X(02).
05 WS-LOG-STATUS PIC X(02).
01 WS-FLAGS.
05 WS-EOF PIC X VALUE 'N'.
88 END-OF-FILE VALUE 'Y'.
05 WS-RECORD-CLEAN PIC X VALUE 'Y'.
88 RECORD-IS-CLEAN VALUE 'Y'.
88 RECORD-HAS-ISSUES VALUE 'N'.
* Working copy of customer record
01 WS-CUST-RECORD.
05 WS-CUST-ID PIC X(10).
05 WS-CUST-NAME.
10 WS-LAST-NAME PIC X(30).
10 WS-FIRST-NAME PIC X(20).
10 WS-MIDDLE-INIT PIC X(01).
05 WS-ADDRESS.
10 WS-STREET PIC X(40).
10 WS-CITY PIC X(25).
10 WS-STATE PIC X(02).
10 WS-ZIP-CODE PIC X(10).
05 WS-PHONE PIC X(15).
05 WS-EMAIL PIC X(50).
05 WS-DOB PIC 9(08).
05 WS-GENDER PIC X(01).
05 WS-POLICY-COUNT PIC 9(03).
05 WS-TOTAL-PREMIUM PIC S9(09)V99 COMP-3.
05 WS-ANNUAL-INCOME PIC S9(09)V99 COMP-3.
05 WS-CREDIT-SCORE PIC 9(03).
05 WS-RISK-CLASS PIC X(02).
05 WS-REGION-CODE PIC X(04).
05 WS-AGENT-ID PIC X(08).
05 WS-ENROLLMENT-DATE PIC 9(08).
05 WS-LAST-CONTACT-DATE PIC 9(08).
05 WS-STATUS PIC X(01).
05 WS-FILLER PIC X(69).
* Date and time
01 WS-CURRENT-DATETIME.
05 WS-TODAY.
10 WS-TODAY-YYYY PIC 9(04).
10 WS-TODAY-MM PIC 9(02).
10 WS-TODAY-DD PIC 9(02).
05 WS-TIME-NOW.
10 WS-TIME-HH PIC 9(02).
10 WS-TIME-MN PIC 9(02).
10 WS-TIME-SS PIC 9(02).
10 WS-TIME-HS PIC 9(02).
05 WS-GMT-OFFSET PIC X(05).
01 WS-TODAY-INT PIC 9(09).
* Issue counters
01 WS-COUNTERS.
05 WS-TOTAL-RECORDS PIC 9(09) VALUE 0.
05 WS-CLEAN-RECORDS PIC 9(09) VALUE 0.
05 WS-DIRTY-RECORDS PIC 9(09) VALUE 0.
05 WS-TOTAL-ISSUES PIC 9(09) VALUE 0.
05 WS-NAME-BLANK PIC 9(07) VALUE 0.
05 WS-NAME-ALL-UPPER PIC 9(07) VALUE 0.
05 WS-NAME-BAD-CHARS PIC 9(07) VALUE 0.
05 WS-ADDR-MISSING PIC 9(07) VALUE 0.
05 WS-STATE-INVALID PIC 9(07) VALUE 0.
05 WS-PHONE-BAD PIC 9(07) VALUE 0.
05 WS-PHONE-BLANK PIC 9(07) VALUE 0.
05 WS-EMAIL-BLANK PIC 9(07) VALUE 0.
05 WS-EMAIL-NO-AT PIC 9(07) VALUE 0.
05 WS-DOB-INVALID PIC 9(07) VALUE 0.
05 WS-DOB-IMPLAUSIBLE PIC 9(07) VALUE 0.
05 WS-ENROLL-INVALID PIC 9(07) VALUE 0.
05 WS-CREDIT-OUT-RANGE PIC 9(07) VALUE 0.
05 WS-PREMIUM-NEGATIVE PIC 9(07) VALUE 0.
* Work fields for string analysis
01 WS-STRING-WORK.
05 WS-TRIMMED-LEN PIC 9(05).
05 WS-CHAR-IDX PIC 9(05).
05 WS-CHAR-ORD PIC 9(05).
05 WS-TEST-CHAR PIC X(01).
05 WS-HAS-ALPHA PIC X VALUE 'N'.
88 FOUND-ALPHA VALUE 'Y'.
88 NO-ALPHA-FOUND VALUE 'N'.
05 WS-HAS-AT-SIGN PIC X VALUE 'N'.
88 FOUND-AT-SIGN VALUE 'Y'.
88 NO-AT-SIGN VALUE 'N'.
05 WS-UPPER-VERSION PIC X(50).
* Work fields for age calculation
01 WS-AGE-WORK.
05 WS-DOB-INT PIC 9(09).
05 WS-AGE-DAYS PIC 9(07).
05 WS-AGE-YEARS PIC 9(03).
* Income statistics accumulation (sample of up to 1000)
01 WS-INCOME-STATS.
05 WS-INCOME-COUNT PIC 9(07) VALUE 0.
05 WS-INCOME-SUM PIC S9(15)V99 VALUE 0.
05 WS-INCOME-MIN PIC S9(09)V99
VALUE 999999999.99.
05 WS-INCOME-MAX PIC S9(09)V99
VALUE 0.
* Accumulate up to 20 income values for MEAN/MEDIAN/SD demo
01 WS-INCOME-SAMPLE.
05 WS-SAMPLE-SIZE PIC 9(04) VALUE 0.
05 WS-SAMPLE-VALUES.
10 WS-SAMPLE-VAL PIC S9(09)V99
OCCURS 20 TIMES.
* Credit score statistics by region (4 regions)
01 WS-REGION-STATS.
05 WS-REGION-ENTRY OCCURS 4 TIMES.
10 WS-RGN-CODE PIC X(04).
10 WS-RGN-COUNT PIC 9(07) VALUE 0.
10 WS-RGN-SCORE-SUM PIC 9(11) VALUE 0.
10 WS-RGN-MEAN-SCORE PIC 9(03)V99.
* US State code validation table
01 WS-STATE-TABLE.
05 FILLER PIC X(100) VALUE
'AL AK AZ AR CA CO CT DE FL GA '
& 'HI ID IL IN IA KS KY LA ME MD '.
05 FILLER PIC X(100) VALUE
'MA MI MN MS MO MT NE NV NH NJ '
& 'NM NY NC ND OH OK OR PA RI SC '.
05 FILLER PIC X(50) VALUE
'SD TN TX UT VT VA WA WV WI WY '.
05 FILLER PIC X(10) VALUE 'DC PR VI '.
01 WS-STATE-TABLE-R REDEFINES WS-STATE-TABLE.
05 WS-VALID-STATE PIC X(03)
OCCURS 53 TIMES.
01 WS-STATE-IDX PIC 9(03).
01 WS-STATE-FOUND PIC X VALUE 'N'.
88 STATE-IS-VALID VALUE 'Y'.
88 STATE-NOT-VALID VALUE 'N'.
* Region initialization
01 WS-RGN-IDX PIC 9(02).
01 WS-REGION-NAMES.
05 FILLER PIC X(04) VALUE 'NEAS'.
05 FILLER PIC X(04) VALUE 'SEAS'.
05 FILLER PIC X(04) VALUE 'MDWS'.
05 FILLER PIC X(04) VALUE 'WEST'.
01 WS-REGION-NAMES-R REDEFINES WS-REGION-NAMES.
05 WS-RGN-NAME-ENTRY PIC X(04) OCCURS 4 TIMES.
* Formatted output fields
01 WS-FMT-COUNT PIC Z(8)9.
01 WS-FMT-PCT PIC ZZ9.99.
01 WS-FMT-AMOUNT PIC $$$,$$$,$$9.99.
01 WS-FMT-SCORE PIC ZZ9.99.
01 WS-ISSUE-DETAIL PIC X(200).
01 WS-PCT-CALC PIC 9(05)V99.
PROCEDURE DIVISION.
0000-MAIN.
PERFORM 1000-INITIALIZE
PERFORM 2000-PROCESS-RECORDS
UNTIL END-OF-FILE
PERFORM 8000-COMPUTE-STATISTICS
PERFORM 9000-WRITE-SUMMARY-REPORT
PERFORM 9900-FINALIZE
STOP RUN.
1000-INITIALIZE.
MOVE FUNCTION CURRENT-DATE TO WS-CURRENT-DATETIME
COMPUTE WS-TODAY-INT =
FUNCTION INTEGER-OF-DATE(WS-TODAY)
* Initialize region codes
PERFORM VARYING WS-RGN-IDX FROM 1 BY 1
UNTIL WS-RGN-IDX > 4
MOVE WS-RGN-NAME-ENTRY(WS-RGN-IDX)
TO WS-RGN-CODE(WS-RGN-IDX)
END-PERFORM
OPEN INPUT CUSTOMER-FILE
OUTPUT QUALITY-REPORT
DETAIL-LOG
IF WS-CUST-STATUS NOT = '00'
DISPLAY 'ERROR OPENING CUSTOMER FILE: '
WS-CUST-STATUS
STOP RUN
END-IF
* Write report header
MOVE ALL '=' TO RPT-LINE
WRITE RPT-LINE
MOVE SPACES TO RPT-LINE
STRING 'HERITAGE MUTUAL INSURANCE - '
'DATA QUALITY ASSESSMENT REPORT'
DELIMITED BY SIZE INTO RPT-LINE
WRITE RPT-LINE
MOVE SPACES TO RPT-LINE
STRING 'Run Date: '
WS-TODAY-YYYY '/' WS-TODAY-MM '/'
WS-TODAY-DD ' Time: '
WS-TIME-HH ':' WS-TIME-MN ':' WS-TIME-SS
DELIMITED BY SIZE INTO RPT-LINE
WRITE RPT-LINE
MOVE ALL '=' TO RPT-LINE
WRITE RPT-LINE
PERFORM 1100-READ-CUSTOMER.
1100-READ-CUSTOMER.
READ CUSTOMER-FILE INTO WS-CUST-RECORD
AT END SET END-OF-FILE TO TRUE
NOT AT END ADD 1 TO WS-TOTAL-RECORDS
END-READ.
2000-PROCESS-RECORDS.
SET RECORD-IS-CLEAN TO TRUE
PERFORM 2100-CHECK-NAME-QUALITY
PERFORM 2200-CHECK-ADDRESS-QUALITY
PERFORM 2300-CHECK-PHONE-QUALITY
PERFORM 2400-CHECK-EMAIL-QUALITY
PERFORM 2500-CHECK-DATE-QUALITY
PERFORM 2600-CHECK-NUMERIC-QUALITY
PERFORM 2700-ACCUMULATE-STATISTICS
IF RECORD-IS-CLEAN
ADD 1 TO WS-CLEAN-RECORDS
ELSE
ADD 1 TO WS-DIRTY-RECORDS
END-IF
PERFORM 1100-READ-CUSTOMER.
*================================================================*
* NAME QUALITY CHECKS *
* Uses: LENGTH, TRIM, UPPER-CASE, ORD *
*================================================================*
2100-CHECK-NAME-QUALITY.
* Check 1: Is the name blank?
COMPUTE WS-TRIMMED-LEN =
FUNCTION LENGTH(
FUNCTION TRIM(WS-LAST-NAME))
IF WS-TRIMMED-LEN = 0
ADD 1 TO WS-NAME-BLANK
ADD 1 TO WS-TOTAL-ISSUES
SET RECORD-HAS-ISSUES TO TRUE
STRING WS-CUST-ID '|NAME_BLANK|'
'Last name is empty'
DELIMITED BY SIZE INTO WS-ISSUE-DETAIL
WRITE LOG-LINE FROM WS-ISSUE-DETAIL
ELSE
* Check 2: Is the name all uppercase?
* Compare original to UPPER-CASE version
* If they are equal, name was entered in all caps
MOVE FUNCTION UPPER-CASE(
FUNCTION TRIM(WS-LAST-NAME))
TO WS-UPPER-VERSION
IF FUNCTION TRIM(WS-LAST-NAME)
= WS-UPPER-VERSION(1:WS-TRIMMED-LEN)
AND WS-TRIMMED-LEN > 1
ADD 1 TO WS-NAME-ALL-UPPER
ADD 1 TO WS-TOTAL-ISSUES
SET RECORD-HAS-ISSUES TO TRUE
MOVE SPACES TO WS-ISSUE-DETAIL
STRING WS-CUST-ID '|NAME_ALLCAPS|'
FUNCTION TRIM(WS-LAST-NAME)
DELIMITED BY SIZE
INTO WS-ISSUE-DETAIL
WRITE LOG-LINE FROM WS-ISSUE-DETAIL
END-IF
* Check 3: Does name contain non-alpha characters?
* Allow letters, spaces, hyphens, apostrophes
SET NO-ALPHA-FOUND TO TRUE
PERFORM VARYING WS-CHAR-IDX FROM 1 BY 1
UNTIL WS-CHAR-IDX > WS-TRIMMED-LEN
OR FOUND-ALPHA
MOVE WS-LAST-NAME(WS-CHAR-IDX:1)
TO WS-TEST-CHAR
COMPUTE WS-CHAR-ORD =
FUNCTION ORD(WS-TEST-CHAR)
* In ASCII: A=66, Z=91, a=98, z=123
* Space=33, Hyphen=46, Apostrophe=40
IF WS-TEST-CHAR NOT ALPHABETIC
AND WS-TEST-CHAR NOT = SPACE
AND WS-TEST-CHAR NOT = '-'
AND WS-TEST-CHAR NOT = ''''
SET FOUND-ALPHA TO TRUE
END-IF
END-PERFORM
IF FOUND-ALPHA
ADD 1 TO WS-NAME-BAD-CHARS
ADD 1 TO WS-TOTAL-ISSUES
SET RECORD-HAS-ISSUES TO TRUE
MOVE SPACES TO WS-ISSUE-DETAIL
STRING WS-CUST-ID
'|NAME_BADCHAR|'
FUNCTION TRIM(WS-LAST-NAME)
DELIMITED BY SIZE
INTO WS-ISSUE-DETAIL
WRITE LOG-LINE FROM WS-ISSUE-DETAIL
END-IF
END-IF.
*================================================================*
* ADDRESS QUALITY CHECKS *
* Uses: TRIM, LENGTH, UPPER-CASE *
*================================================================*
2200-CHECK-ADDRESS-QUALITY.
* Check for missing address components
IF FUNCTION LENGTH(FUNCTION TRIM(WS-STREET)) = 0
OR FUNCTION LENGTH(FUNCTION TRIM(WS-CITY)) = 0
OR FUNCTION LENGTH(FUNCTION TRIM(WS-ZIP-CODE)) = 0
ADD 1 TO WS-ADDR-MISSING
ADD 1 TO WS-TOTAL-ISSUES
SET RECORD-HAS-ISSUES TO TRUE
MOVE SPACES TO WS-ISSUE-DETAIL
STRING WS-CUST-ID '|ADDR_MISSING|'
'Incomplete address'
DELIMITED BY SIZE INTO WS-ISSUE-DETAIL
WRITE LOG-LINE FROM WS-ISSUE-DETAIL
END-IF
* Validate state code against lookup table
SET STATE-NOT-VALID TO TRUE
MOVE FUNCTION UPPER-CASE(WS-STATE)
TO WS-STATE
PERFORM VARYING WS-STATE-IDX FROM 1 BY 1
UNTIL WS-STATE-IDX > 53
OR STATE-IS-VALID
IF WS-STATE =
FUNCTION TRIM(WS-VALID-STATE(WS-STATE-IDX))
SET STATE-IS-VALID TO TRUE
END-IF
END-PERFORM
IF STATE-NOT-VALID AND
FUNCTION LENGTH(FUNCTION TRIM(WS-STATE)) > 0
ADD 1 TO WS-STATE-INVALID
ADD 1 TO WS-TOTAL-ISSUES
SET RECORD-HAS-ISSUES TO TRUE
MOVE SPACES TO WS-ISSUE-DETAIL
STRING WS-CUST-ID '|STATE_INVALID|'
WS-STATE
DELIMITED BY SIZE INTO WS-ISSUE-DETAIL
WRITE LOG-LINE FROM WS-ISSUE-DETAIL
END-IF.
*================================================================*
* PHONE QUALITY CHECKS *
* Uses: TRIM, LENGTH, ORD *
*================================================================*
2300-CHECK-PHONE-QUALITY.
* Check for blank phone
COMPUTE WS-TRIMMED-LEN =
FUNCTION LENGTH(FUNCTION TRIM(WS-PHONE))
IF WS-TRIMMED-LEN = 0
ADD 1 TO WS-PHONE-BLANK
ADD 1 TO WS-TOTAL-ISSUES
SET RECORD-HAS-ISSUES TO TRUE
MOVE SPACES TO WS-ISSUE-DETAIL
STRING WS-CUST-ID '|PHONE_BLANK|'
'Phone number is empty'
DELIMITED BY SIZE INTO WS-ISSUE-DETAIL
WRITE LOG-LINE FROM WS-ISSUE-DETAIL
ELSE
* Check for alphabetic characters in phone
* Only digits, hyphens, parentheses, spaces, + allowed
PERFORM VARYING WS-CHAR-IDX FROM 1 BY 1
UNTIL WS-CHAR-IDX > WS-TRIMMED-LEN
MOVE WS-PHONE(WS-CHAR-IDX:1)
TO WS-TEST-CHAR
IF WS-TEST-CHAR ALPHABETIC
AND WS-TEST-CHAR NOT = SPACE
ADD 1 TO WS-PHONE-BAD
ADD 1 TO WS-TOTAL-ISSUES
SET RECORD-HAS-ISSUES TO TRUE
MOVE SPACES TO WS-ISSUE-DETAIL
STRING WS-CUST-ID '|PHONE_ALPHA|'
FUNCTION TRIM(WS-PHONE)
DELIMITED BY SIZE
INTO WS-ISSUE-DETAIL
WRITE LOG-LINE FROM WS-ISSUE-DETAIL
* Exit the loop after first violation
MOVE WS-TRIMMED-LEN TO WS-CHAR-IDX
END-IF
END-PERFORM
END-IF.
*================================================================*
* EMAIL QUALITY CHECKS *
* Uses: TRIM, LENGTH *
*================================================================*
2400-CHECK-EMAIL-QUALITY.
COMPUTE WS-TRIMMED-LEN =
FUNCTION LENGTH(FUNCTION TRIM(WS-EMAIL))
IF WS-TRIMMED-LEN = 0
ADD 1 TO WS-EMAIL-BLANK
ADD 1 TO WS-TOTAL-ISSUES
SET RECORD-HAS-ISSUES TO TRUE
ELSE
* Check for @ symbol
SET NO-AT-SIGN TO TRUE
PERFORM VARYING WS-CHAR-IDX FROM 1 BY 1
UNTIL WS-CHAR-IDX > WS-TRIMMED-LEN
OR FOUND-AT-SIGN
IF WS-EMAIL(WS-CHAR-IDX:1) = '@'
SET FOUND-AT-SIGN TO TRUE
END-IF
END-PERFORM
IF NO-AT-SIGN
ADD 1 TO WS-EMAIL-NO-AT
ADD 1 TO WS-TOTAL-ISSUES
SET RECORD-HAS-ISSUES TO TRUE
MOVE SPACES TO WS-ISSUE-DETAIL
STRING WS-CUST-ID '|EMAIL_NOAT|'
FUNCTION TRIM(WS-EMAIL)
DELIMITED BY SIZE
INTO WS-ISSUE-DETAIL
WRITE LOG-LINE FROM WS-ISSUE-DETAIL
END-IF
END-IF.
*================================================================*
* DATE QUALITY CHECKS *
* Uses: TEST-DATE-YYYYMMDD, INTEGER-OF-DATE, CURRENT-DATE *
*================================================================*
2500-CHECK-DATE-QUALITY.
* Validate date of birth
IF FUNCTION TEST-DATE-YYYYMMDD(WS-DOB) NOT = 0
ADD 1 TO WS-DOB-INVALID
ADD 1 TO WS-TOTAL-ISSUES
SET RECORD-HAS-ISSUES TO TRUE
MOVE SPACES TO WS-ISSUE-DETAIL
STRING WS-CUST-ID '|DOB_INVALID|'
WS-DOB
DELIMITED BY SIZE INTO WS-ISSUE-DETAIL
WRITE LOG-LINE FROM WS-ISSUE-DETAIL
ELSE
* Check plausible age (0-120 years)
COMPUTE WS-DOB-INT =
FUNCTION INTEGER-OF-DATE(WS-DOB)
COMPUTE WS-AGE-DAYS =
WS-TODAY-INT - WS-DOB-INT
COMPUTE WS-AGE-YEARS =
FUNCTION INTEGER-PART(
WS-AGE-DAYS / 365.25)
IF WS-AGE-YEARS < 0 OR WS-AGE-YEARS > 120
ADD 1 TO WS-DOB-IMPLAUSIBLE
ADD 1 TO WS-TOTAL-ISSUES
SET RECORD-HAS-ISSUES TO TRUE
MOVE SPACES TO WS-ISSUE-DETAIL
STRING WS-CUST-ID '|DOB_IMPLAUS|'
'Age=' WS-AGE-YEARS
DELIMITED BY SIZE
INTO WS-ISSUE-DETAIL
WRITE LOG-LINE FROM WS-ISSUE-DETAIL
END-IF
END-IF
* Validate enrollment date
IF FUNCTION TEST-DATE-YYYYMMDD(WS-ENROLLMENT-DATE)
NOT = 0
ADD 1 TO WS-ENROLL-INVALID
ADD 1 TO WS-TOTAL-ISSUES
SET RECORD-HAS-ISSUES TO TRUE
MOVE SPACES TO WS-ISSUE-DETAIL
STRING WS-CUST-ID '|ENROLL_INVALID|'
WS-ENROLLMENT-DATE
DELIMITED BY SIZE INTO WS-ISSUE-DETAIL
WRITE LOG-LINE FROM WS-ISSUE-DETAIL
END-IF.
*================================================================*
* NUMERIC QUALITY CHECKS *
*================================================================*
2600-CHECK-NUMERIC-QUALITY.
* Credit score range check (valid: 300-850)
IF WS-CREDIT-SCORE < 300 OR WS-CREDIT-SCORE > 850
IF WS-CREDIT-SCORE NOT = 0
ADD 1 TO WS-CREDIT-OUT-RANGE
ADD 1 TO WS-TOTAL-ISSUES
SET RECORD-HAS-ISSUES TO TRUE
END-IF
END-IF
* Negative premium check
IF WS-TOTAL-PREMIUM < 0
ADD 1 TO WS-PREMIUM-NEGATIVE
ADD 1 TO WS-TOTAL-ISSUES
SET RECORD-HAS-ISSUES TO TRUE
END-IF.
*================================================================*
* STATISTICS ACCUMULATION *
*================================================================*
2700-ACCUMULATE-STATISTICS.
* Income statistics
IF WS-ANNUAL-INCOME > 0
ADD 1 TO WS-INCOME-COUNT
ADD WS-ANNUAL-INCOME TO WS-INCOME-SUM
IF WS-ANNUAL-INCOME < WS-INCOME-MIN
MOVE WS-ANNUAL-INCOME TO WS-INCOME-MIN
END-IF
IF WS-ANNUAL-INCOME > WS-INCOME-MAX
MOVE WS-ANNUAL-INCOME TO WS-INCOME-MAX
END-IF
* Collect first 20 samples for MEAN/MEDIAN/SD
IF WS-SAMPLE-SIZE < 20
ADD 1 TO WS-SAMPLE-SIZE
MOVE WS-ANNUAL-INCOME
TO WS-SAMPLE-VAL(WS-SAMPLE-SIZE)
END-IF
END-IF
* Credit score by region
PERFORM VARYING WS-RGN-IDX FROM 1 BY 1
UNTIL WS-RGN-IDX > 4
IF WS-REGION-CODE = WS-RGN-CODE(WS-RGN-IDX)
AND WS-CREDIT-SCORE >= 300
AND WS-CREDIT-SCORE <= 850
ADD 1 TO WS-RGN-COUNT(WS-RGN-IDX)
ADD WS-CREDIT-SCORE
TO WS-RGN-SCORE-SUM(WS-RGN-IDX)
END-IF
END-PERFORM.
*================================================================*
* COMPUTE AGGREGATE STATISTICS *
* Uses: MEAN, MEDIAN, STANDARD-DEVIATION, RANGE, VARIANCE *
*================================================================*
8000-COMPUTE-STATISTICS.
* Compute region-level mean credit scores
PERFORM VARYING WS-RGN-IDX FROM 1 BY 1
UNTIL WS-RGN-IDX > 4
IF WS-RGN-COUNT(WS-RGN-IDX) > 0
COMPUTE WS-RGN-MEAN-SCORE(WS-RGN-IDX) =
WS-RGN-SCORE-SUM(WS-RGN-IDX)
/ WS-RGN-COUNT(WS-RGN-IDX)
END-IF
END-PERFORM.
*================================================================*
* WRITE SUMMARY REPORT *
*================================================================*
9000-WRITE-SUMMARY-REPORT.
MOVE SPACES TO RPT-LINE
WRITE RPT-LINE
MOVE 'DATA QUALITY SUMMARY' TO RPT-LINE
WRITE RPT-LINE
MOVE ALL '-' TO RPT-LINE
WRITE RPT-LINE
* Record counts
MOVE WS-TOTAL-RECORDS TO WS-FMT-COUNT
MOVE SPACES TO RPT-LINE
STRING 'Total Records Analyzed: '
WS-FMT-COUNT
DELIMITED BY SIZE INTO RPT-LINE
WRITE RPT-LINE
MOVE WS-CLEAN-RECORDS TO WS-FMT-COUNT
IF WS-TOTAL-RECORDS > 0
COMPUTE WS-PCT-CALC =
(WS-CLEAN-RECORDS / WS-TOTAL-RECORDS)
* 100
END-IF
MOVE WS-PCT-CALC TO WS-FMT-PCT
MOVE SPACES TO RPT-LINE
STRING 'Clean Records: '
WS-FMT-COUNT ' (' WS-FMT-PCT '%)'
DELIMITED BY SIZE INTO RPT-LINE
WRITE RPT-LINE
MOVE WS-DIRTY-RECORDS TO WS-FMT-COUNT
IF WS-TOTAL-RECORDS > 0
COMPUTE WS-PCT-CALC =
(WS-DIRTY-RECORDS / WS-TOTAL-RECORDS)
* 100
END-IF
MOVE WS-PCT-CALC TO WS-FMT-PCT
MOVE SPACES TO RPT-LINE
STRING 'Records with Issues: '
WS-FMT-COUNT ' (' WS-FMT-PCT '%)'
DELIMITED BY SIZE INTO RPT-LINE
WRITE RPT-LINE
MOVE WS-TOTAL-ISSUES TO WS-FMT-COUNT
MOVE SPACES TO RPT-LINE
STRING 'Total Issues Found: '
WS-FMT-COUNT
DELIMITED BY SIZE INTO RPT-LINE
WRITE RPT-LINE
* Issue breakdown
MOVE SPACES TO RPT-LINE
WRITE RPT-LINE
MOVE 'ISSUE BREAKDOWN:' TO RPT-LINE
WRITE RPT-LINE
PERFORM 9010-WRITE-ISSUE-LINE
* Income statistics using intrinsic functions
MOVE SPACES TO RPT-LINE
WRITE RPT-LINE
MOVE 'INCOME STATISTICS (Sample of 20):' TO RPT-LINE
WRITE RPT-LINE
IF WS-SAMPLE-SIZE >= 2
PERFORM 9020-WRITE-INCOME-STATS
ELSE
MOVE ' Insufficient data for statistics'
TO RPT-LINE
WRITE RPT-LINE
END-IF
* Credit score by region
MOVE SPACES TO RPT-LINE
WRITE RPT-LINE
MOVE 'CREDIT SCORE BY REGION:' TO RPT-LINE
WRITE RPT-LINE
PERFORM 9030-WRITE-REGION-STATS
MOVE ALL '=' TO RPT-LINE
WRITE RPT-LINE
MOVE 'END OF DATA QUALITY REPORT' TO RPT-LINE
WRITE RPT-LINE.
9010-WRITE-ISSUE-LINE.
MOVE WS-NAME-BLANK TO WS-FMT-COUNT
MOVE SPACES TO RPT-LINE
STRING ' Blank Names: '
WS-FMT-COUNT
DELIMITED BY SIZE INTO RPT-LINE
WRITE RPT-LINE
MOVE WS-NAME-ALL-UPPER TO WS-FMT-COUNT
MOVE SPACES TO RPT-LINE
STRING ' All-Uppercase Names: '
WS-FMT-COUNT
DELIMITED BY SIZE INTO RPT-LINE
WRITE RPT-LINE
MOVE WS-NAME-BAD-CHARS TO WS-FMT-COUNT
MOVE SPACES TO RPT-LINE
STRING ' Names with Bad Chars: '
WS-FMT-COUNT
DELIMITED BY SIZE INTO RPT-LINE
WRITE RPT-LINE
MOVE WS-ADDR-MISSING TO WS-FMT-COUNT
MOVE SPACES TO RPT-LINE
STRING ' Missing Address Parts: '
WS-FMT-COUNT
DELIMITED BY SIZE INTO RPT-LINE
WRITE RPT-LINE
MOVE WS-STATE-INVALID TO WS-FMT-COUNT
MOVE SPACES TO RPT-LINE
STRING ' Invalid State Codes: '
WS-FMT-COUNT
DELIMITED BY SIZE INTO RPT-LINE
WRITE RPT-LINE
MOVE WS-DOB-INVALID TO WS-FMT-COUNT
MOVE SPACES TO RPT-LINE
STRING ' Invalid Birth Dates: '
WS-FMT-COUNT
DELIMITED BY SIZE INTO RPT-LINE
WRITE RPT-LINE
MOVE WS-DOB-IMPLAUSIBLE TO WS-FMT-COUNT
MOVE SPACES TO RPT-LINE
STRING ' Implausible Ages: '
WS-FMT-COUNT
DELIMITED BY SIZE INTO RPT-LINE
WRITE RPT-LINE
MOVE WS-CREDIT-OUT-RANGE TO WS-FMT-COUNT
MOVE SPACES TO RPT-LINE
STRING ' Credit Score Out Range: '
WS-FMT-COUNT
DELIMITED BY SIZE INTO RPT-LINE
WRITE RPT-LINE
MOVE WS-EMAIL-BLANK TO WS-FMT-COUNT
MOVE SPACES TO RPT-LINE
STRING ' Missing Email: '
WS-FMT-COUNT
DELIMITED BY SIZE INTO RPT-LINE
WRITE RPT-LINE
MOVE WS-EMAIL-NO-AT TO WS-FMT-COUNT
MOVE SPACES TO RPT-LINE
STRING ' Email Missing @: '
WS-FMT-COUNT
DELIMITED BY SIZE INTO RPT-LINE
WRITE RPT-LINE.
9020-WRITE-INCOME-STATS.
* Use intrinsic functions on the sample values
* MEAN, MEDIAN, STANDARD-DEVIATION operate on the
* individual sample values collected during processing
EVALUATE WS-SAMPLE-SIZE
WHEN 20
COMPUTE WS-PCT-CALC =
FUNCTION MEAN(
WS-SAMPLE-VAL(1)
WS-SAMPLE-VAL(2)
WS-SAMPLE-VAL(3)
WS-SAMPLE-VAL(4)
WS-SAMPLE-VAL(5)
WS-SAMPLE-VAL(6)
WS-SAMPLE-VAL(7)
WS-SAMPLE-VAL(8)
WS-SAMPLE-VAL(9)
WS-SAMPLE-VAL(10)
WS-SAMPLE-VAL(11)
WS-SAMPLE-VAL(12)
WS-SAMPLE-VAL(13)
WS-SAMPLE-VAL(14)
WS-SAMPLE-VAL(15)
WS-SAMPLE-VAL(16)
WS-SAMPLE-VAL(17)
WS-SAMPLE-VAL(18)
WS-SAMPLE-VAL(19)
WS-SAMPLE-VAL(20))
WHEN OTHER
IF WS-INCOME-COUNT > 0
COMPUTE WS-PCT-CALC =
WS-INCOME-SUM / WS-INCOME-COUNT
END-IF
END-EVALUATE
MOVE WS-PCT-CALC TO WS-FMT-AMOUNT
MOVE SPACES TO RPT-LINE
STRING ' Mean Income: '
WS-FMT-AMOUNT
DELIMITED BY SIZE INTO RPT-LINE
WRITE RPT-LINE
IF WS-SAMPLE-SIZE = 20
COMPUTE WS-PCT-CALC =
FUNCTION MEDIAN(
WS-SAMPLE-VAL(1)
WS-SAMPLE-VAL(2)
WS-SAMPLE-VAL(3)
WS-SAMPLE-VAL(4)
WS-SAMPLE-VAL(5)
WS-SAMPLE-VAL(6)
WS-SAMPLE-VAL(7)
WS-SAMPLE-VAL(8)
WS-SAMPLE-VAL(9)
WS-SAMPLE-VAL(10)
WS-SAMPLE-VAL(11)
WS-SAMPLE-VAL(12)
WS-SAMPLE-VAL(13)
WS-SAMPLE-VAL(14)
WS-SAMPLE-VAL(15)
WS-SAMPLE-VAL(16)
WS-SAMPLE-VAL(17)
WS-SAMPLE-VAL(18)
WS-SAMPLE-VAL(19)
WS-SAMPLE-VAL(20))
MOVE WS-PCT-CALC TO WS-FMT-AMOUNT
MOVE SPACES TO RPT-LINE
STRING ' Median Income: '
WS-FMT-AMOUNT
DELIMITED BY SIZE INTO RPT-LINE
WRITE RPT-LINE
COMPUTE WS-PCT-CALC =
FUNCTION STANDARD-DEVIATION(
WS-SAMPLE-VAL(1)
WS-SAMPLE-VAL(2)
WS-SAMPLE-VAL(3)
WS-SAMPLE-VAL(4)
WS-SAMPLE-VAL(5)
WS-SAMPLE-VAL(6)
WS-SAMPLE-VAL(7)
WS-SAMPLE-VAL(8)
WS-SAMPLE-VAL(9)
WS-SAMPLE-VAL(10)
WS-SAMPLE-VAL(11)
WS-SAMPLE-VAL(12)
WS-SAMPLE-VAL(13)
WS-SAMPLE-VAL(14)
WS-SAMPLE-VAL(15)
WS-SAMPLE-VAL(16)
WS-SAMPLE-VAL(17)
WS-SAMPLE-VAL(18)
WS-SAMPLE-VAL(19)
WS-SAMPLE-VAL(20))
MOVE WS-PCT-CALC TO WS-FMT-AMOUNT
MOVE SPACES TO RPT-LINE
STRING ' Std Deviation: '
WS-FMT-AMOUNT
DELIMITED BY SIZE INTO RPT-LINE
WRITE RPT-LINE
END-IF
MOVE WS-INCOME-MIN TO WS-FMT-AMOUNT
MOVE SPACES TO RPT-LINE
STRING ' Minimum Income: '
WS-FMT-AMOUNT
DELIMITED BY SIZE INTO RPT-LINE
WRITE RPT-LINE
MOVE WS-INCOME-MAX TO WS-FMT-AMOUNT
MOVE SPACES TO RPT-LINE
STRING ' Maximum Income: '
WS-FMT-AMOUNT
DELIMITED BY SIZE INTO RPT-LINE
WRITE RPT-LINE.
9030-WRITE-REGION-STATS.
PERFORM VARYING WS-RGN-IDX FROM 1 BY 1
UNTIL WS-RGN-IDX > 4
IF WS-RGN-COUNT(WS-RGN-IDX) > 0
MOVE WS-RGN-MEAN-SCORE(WS-RGN-IDX)
TO WS-FMT-SCORE
MOVE WS-RGN-COUNT(WS-RGN-IDX)
TO WS-FMT-COUNT
MOVE SPACES TO RPT-LINE
STRING ' Region '
WS-RGN-CODE(WS-RGN-IDX)
': Mean Score = ' WS-FMT-SCORE
' (N=' WS-FMT-COUNT ')'
DELIMITED BY SIZE INTO RPT-LINE
WRITE RPT-LINE
END-IF
END-PERFORM.
9900-FINALIZE.
DISPLAY 'HMIDQUAL PROCESSING COMPLETE'
DISPLAY ' Total Records: ' WS-TOTAL-RECORDS
DISPLAY ' Clean Records: ' WS-CLEAN-RECORDS
DISPLAY ' Dirty Records: ' WS-DIRTY-RECORDS
DISPLAY ' Total Issues: ' WS-TOTAL-ISSUES
CLOSE CUSTOMER-FILE
QUALITY-REPORT
DETAIL-LOG.
Solution Walkthrough
String Functions in Action
The name quality check demonstrates the interplay of several string functions:
-
FUNCTION TRIM removes leading and trailing spaces from the name field before analysis. Without TRIM, a 30-byte PIC X field containing "SMITH" followed by 25 spaces would report a length of 30, masking the fact that only 5 characters contain actual data.
-
FUNCTION LENGTH applied to the trimmed result gives the true content length.
FUNCTION LENGTH(FUNCTION TRIM(WS-LAST-NAME))returns 5 for "SMITH", not 30. -
FUNCTION UPPER-CASE enables the all-caps detection. By comparing the original value to its uppercase version, the program detects names that were entered entirely in uppercase -- a common data entry habit that makes the data look unprofessional in customer-facing documents.
-
FUNCTION ORD provides character-level analysis. By examining the ordinal value of each character, the program can classify characters as alphabetic, numeric, or special, detecting names that contain digits or unexpected symbols.
Date Functions for Validation
The date quality checks demonstrate a two-level validation strategy:
-
FUNCTION TEST-DATE-YYYYMMDD catches structurally invalid dates -- values like
00000000,99999999,20230229(February 29 in a non-leap year), or20231301(month 13). This is the first line of defense. -
FUNCTION INTEGER-OF-DATE with CURRENT-DATE provides semantic validation. A date like
20260115is structurally valid, but if it represents a date of birth, it implies the person is less than one month old -- which may or may not be plausible depending on the business context. By converting both the date of birth and today's date to integers and computing the difference, the program calculates the person's age and flags implausible values.
Statistical Functions for Profiling
The income statistics section demonstrates COBOL's aggregate functions: FUNCTION MEAN, FUNCTION MEDIAN, and FUNCTION STANDARD-DEVIATION. These functions take multiple arguments and return the computed statistic in a single call.
A limitation of COBOL's intrinsic functions is that they require individual arguments, not arrays. The program must list each sample value explicitly (e.g., WS-SAMPLE-VAL(1) WS-SAMPLE-VAL(2) ... WS-SAMPLE-VAL(20)). For the demonstration, the sample is limited to 20 values. In a production system with millions of records, the running sum/count approach would be used for the mean, and a more sophisticated algorithm (such as Welford's online algorithm) would compute variance incrementally.
Discussion Questions
-
The all-uppercase name detection compares the name to
FUNCTION UPPER-CASEof itself. What are the edge cases where this check might produce false positives? Consider names like "O'BRIEN" or single-character names. How would you refine the check? -
COBOL's
FUNCTION MEAN,FUNCTION MEDIAN, andFUNCTION STANDARD-DEVIATIONrequire listing each argument individually, which limits their practical use to small datasets. Design an alternative approach using running accumulators that computes these statistics incrementally as records are read, without storing all values in memory. -
The program writes issues to a detail log file with pipe-delimited fields. Design a downstream process that reads this log, groups issues by category, and produces a remediation work queue for the data stewardship team. What additional fields would you add to the log?
-
The email validation only checks for the presence of an '@' symbol. What additional checks would make this validation more robust? Consider using COBOL string functions to verify that there is text before the '@', text after the '@', and at least one '.' after the '@'.
-
The program processes the entire file in a single pass. If the file contains 1.2 million records and each record generates an average of 1.5 issues, the detail log will contain approximately 1.8 million lines. How would you modify the program to support a "summary only" mode that skips the detail log and produces only the aggregate report? What changes to the program structure would this require?