Case Study 2: Data Quality Investigation Using SQL
Background
Meridian National Bank is preparing for a regulatory examination. The Office of the Comptroller of the Currency (OCC) has notified the bank that examiners will review customer data completeness as part of their Know Your Customer (KYC) compliance assessment. The bank's Chief Data Officer, Marcus Webb, has asked the data governance team to conduct a pre-examination data quality audit.
The team needs to answer a fundamental question: How complete and consistent is our customer and account data?
This case study demonstrates how the SELECT, WHERE, CASE, and scalar function skills from Chapter 5 can be applied to a real-world data quality investigation. You will not need aggregate functions or joins for most of these queries — the emphasis is on clever use of NULL detection, pattern matching, CASE expressions, and string functions to surface data issues.
Phase 1: Identifying Missing Data
The first step in any data quality audit is determining what is missing. Regulators expect certain fields to be populated for every customer record.
Required Fields Check
For KYC compliance, every individual customer should have: first name, last name, date of birth, address, and at least one phone number. Let us check:
SELECT CUSTOMER_ID,
FIRST_NAME,
LAST_NAME,
CASE WHEN DATE_OF_BIRTH IS NULL THEN 'MISSING' ELSE 'OK' END AS DOB_STATUS,
CASE WHEN EMAIL_ADDRESS IS NULL THEN 'MISSING' ELSE 'OK' END AS EMAIL_STATUS,
CASE WHEN PHONE_PRIMARY IS NULL THEN 'MISSING' ELSE 'OK' END AS PHONE_STATUS,
CASE WHEN SSN_LAST_FOUR IS NULL THEN 'MISSING' ELSE 'OK' END AS SSN4_STATUS
FROM MERIDIAN.CUSTOMERS
WHERE CUSTOMER_TYPE = 'I'
AND STATUS = 'A'
AND (DATE_OF_BIRTH IS NULL
OR EMAIL_ADDRESS IS NULL
OR PHONE_PRIMARY IS NULL
OR SSN_LAST_FOUR IS NULL)
ORDER BY CUSTOMER_ID;
In our sample data, all individual customers have these fields populated. That is good news. But a real production database with thousands of records would almost certainly have gaps. The query structure above is the template the team would use.
Building a Completeness Score
For a more nuanced view, the team builds a completeness score for each customer — counting how many of several optional fields are populated:
SELECT CUSTOMER_ID,
FIRST_NAME || ' ' || LAST_NAME AS CUSTOMER_NAME,
-- Count populated optional fields (0 or 1 for each)
CASE WHEN MIDDLE_NAME IS NOT NULL THEN 1 ELSE 0 END +
CASE WHEN EMAIL_ADDRESS IS NOT NULL THEN 1 ELSE 0 END +
CASE WHEN PHONE_PRIMARY IS NOT NULL THEN 1 ELSE 0 END +
CASE WHEN PHONE_SECONDARY IS NOT NULL THEN 1 ELSE 0 END +
CASE WHEN DATE_OF_BIRTH IS NOT NULL THEN 1 ELSE 0 END +
CASE WHEN SSN_LAST_FOUR IS NOT NULL THEN 1 ELSE 0 END +
CASE WHEN ADDRESS_LINE2 IS NOT NULL THEN 1 ELSE 0 END
AS FIELDS_POPULATED,
-- Total possible optional fields
7 AS TOTAL_FIELDS,
-- Completeness percentage (approximate, using integer math)
CAST(
(CASE WHEN MIDDLE_NAME IS NOT NULL THEN 1 ELSE 0 END +
CASE WHEN EMAIL_ADDRESS IS NOT NULL THEN 1 ELSE 0 END +
CASE WHEN PHONE_PRIMARY IS NOT NULL THEN 1 ELSE 0 END +
CASE WHEN PHONE_SECONDARY IS NOT NULL THEN 1 ELSE 0 END +
CASE WHEN DATE_OF_BIRTH IS NOT NULL THEN 1 ELSE 0 END +
CASE WHEN SSN_LAST_FOUR IS NOT NULL THEN 1 ELSE 0 END +
CASE WHEN ADDRESS_LINE2 IS NOT NULL THEN 1 ELSE 0 END)
* 100 / 7 AS INTEGER
) AS COMPLETENESS_PCT
FROM MERIDIAN.CUSTOMERS
WHERE CUSTOMER_TYPE = 'I'
ORDER BY COMPLETENESS_PCT, LAST_NAME;
Customers with a low completeness percentage should be flagged for outreach — a teller or relationship manager can ask for the missing information during the customer's next visit.
Phase 2: Format Consistency Checks
Data can be present but still problematic if it is not formatted consistently. The team checks several format-related concerns.
Phone Number Format
The bank's standard phone format is XXX-XXX-XXXX (12 characters with dashes). Are all phone numbers formatted consistently?
SELECT CUSTOMER_ID,
FIRST_NAME || ' ' || LAST_NAME AS CUSTOMER_NAME,
PHONE_PRIMARY,
LENGTH(PHONE_PRIMARY) AS PHONE_LENGTH,
CASE
WHEN PHONE_PRIMARY IS NULL THEN 'NULL'
WHEN LENGTH(PHONE_PRIMARY) <> 12 THEN 'Wrong Length'
WHEN SUBSTR(PHONE_PRIMARY, 4, 1) <> '-' THEN 'Missing Dash at Pos 4'
WHEN SUBSTR(PHONE_PRIMARY, 8, 1) <> '-' THEN 'Missing Dash at Pos 8'
ELSE 'Valid Format'
END AS FORMAT_CHECK
FROM MERIDIAN.CUSTOMERS
ORDER BY FORMAT_CHECK, CUSTOMER_ID;
In our sample data, all phone numbers follow the standard format. In a production system, you might find entries like 8605551001 (no dashes), (860) 555-1001 (parenthetical area code), or +1-860-555-1001 (country code prefix). Each format variation must be identified and normalized.
Email Address Validation
A basic email validation checks for the presence of an @ sign and at least one period after it:
SELECT CUSTOMER_ID,
FIRST_NAME || ' ' || LAST_NAME AS CUSTOMER_NAME,
EMAIL_ADDRESS,
CASE
WHEN EMAIL_ADDRESS IS NULL THEN 'NULL'
WHEN LOCATE('@', EMAIL_ADDRESS) = 0 THEN 'Missing @'
WHEN LOCATE('.', EMAIL_ADDRESS, LOCATE('@', EMAIL_ADDRESS)) = 0
THEN 'No domain dot'
WHEN LENGTH(EMAIL_ADDRESS) - LENGTH(REPLACE(EMAIL_ADDRESS, '@', '')) > 1
THEN 'Multiple @ signs'
WHEN SUBSTR(EMAIL_ADDRESS, LENGTH(EMAIL_ADDRESS), 1) = '.'
THEN 'Ends with dot'
ELSE 'Appears Valid'
END AS EMAIL_CHECK
FROM MERIDIAN.CUSTOMERS
WHERE EMAIL_ADDRESS IS NOT NULL
ORDER BY EMAIL_CHECK, CUSTOMER_ID;
This is a pragmatic validation — it does not enforce the full email RFC specification, but it catches the most common errors. The LOCATE function returns the position of a substring within a string (0 if not found). The trick of comparing the string length before and after REPLACE detects multiple occurrences of a character.
ZIP Code Format
US ZIP codes should be either 5 digits or 5+4 format (XXXXX or XXXXX-XXXX):
SELECT CUSTOMER_ID,
FIRST_NAME || ' ' || LAST_NAME AS CUSTOMER_NAME,
ZIP_CODE,
LENGTH(ZIP_CODE) AS ZIP_LENGTH,
CASE
WHEN LENGTH(ZIP_CODE) = 5 THEN 'Standard 5-digit'
WHEN LENGTH(ZIP_CODE) = 10
AND SUBSTR(ZIP_CODE, 6, 1) = '-' THEN 'ZIP+4 format'
ELSE 'Non-standard'
END AS ZIP_FORMAT
FROM MERIDIAN.CUSTOMERS
ORDER BY ZIP_FORMAT DESC, CUSTOMER_ID;
All our sample ZIP codes are 5-digit format. A real audit might uncover entries like 6103 (leading zero dropped), 06103-1234 (ZIP+4), or even non-numeric entries from data entry errors.
Phase 3: Logical Consistency Checks
Beyond format, data should be logically consistent. The team checks for values that are technically valid but logically suspicious.
Future Dates
No customer should have a CUSTOMER_SINCE date in the future, and no account should have an OPEN_DATE in the future:
SELECT 'CUSTOMERS' AS TABLE_NAME,
CUSTOMER_ID AS RECORD_ID,
FIRST_NAME || ' ' || LAST_NAME AS NAME,
CUSTOMER_SINCE AS SUSPICIOUS_DATE,
'CUSTOMER_SINCE in the future' AS ISSUE
FROM MERIDIAN.CUSTOMERS
WHERE CUSTOMER_SINCE > CURRENT DATE
UNION ALL
SELECT 'ACCOUNTS',
ACCOUNT_ID,
ACCOUNT_NUMBER,
OPEN_DATE,
'OPEN_DATE in the future'
FROM MERIDIAN.ACCOUNTS
WHERE OPEN_DATE > CURRENT DATE
ORDER BY TABLE_NAME, RECORD_ID;
Preview: This query uses UNION ALL, which combines the results of two SELECT statements. We will cover UNION formally in Chapter 8. For now, understand that it stacks one result set on top of another, and the columns must be compatible in number and type.
Age Validation
Customers should be at least 18 years old (for individual accounts) and not impossibly old:
SELECT CUSTOMER_ID,
FIRST_NAME || ' ' || LAST_NAME AS CUSTOMER_NAME,
DATE_OF_BIRTH,
YEAR(CURRENT DATE) - YEAR(DATE_OF_BIRTH) AS APPROX_AGE,
CASE
WHEN DATE_OF_BIRTH IS NULL THEN 'DOB Missing'
WHEN YEAR(CURRENT DATE) - YEAR(DATE_OF_BIRTH) < 18 THEN 'Under 18 - Investigate'
WHEN YEAR(CURRENT DATE) - YEAR(DATE_OF_BIRTH) > 120 THEN 'Impossibly Old - Data Error'
ELSE 'Reasonable'
END AS AGE_CHECK
FROM MERIDIAN.CUSTOMERS
WHERE CUSTOMER_TYPE = 'I'
ORDER BY APPROX_AGE DESC;
Franklin Wu, born in 1955, is the oldest individual customer at approximately 71 years old. All ages are within reasonable bounds.
Account Balance vs. Overdraft Logic
An active account with a negative balance that exceeds its overdraft limit suggests a problem:
SELECT ACCOUNT_NUMBER,
CURRENT_BALANCE,
OVERDRAFT_LIMIT,
CASE
WHEN CURRENT_BALANCE < 0
AND ABS(CURRENT_BALANCE) > OVERDRAFT_LIMIT
THEN 'Over Overdraft Limit'
WHEN CURRENT_BALANCE < 0 THEN 'In Overdraft'
WHEN CURRENT_BALANCE = 0 AND OVERDRAFT_LIMIT = 0 THEN 'Zero Balance, No Cushion'
ELSE 'OK'
END AS BALANCE_CHECK
FROM MERIDIAN.ACCOUNTS
WHERE STATUS = 'A'
ORDER BY CURRENT_BALANCE;
Closed Accounts with Activity
A closed account should not have recent transactions. This check flags any inconsistency:
SELECT A.ACCOUNT_NUMBER,
A.STATUS,
A.CLOSE_DATE,
A.LAST_ACTIVITY_DATE,
CASE
WHEN A.STATUS = 'C' AND A.LAST_ACTIVITY_DATE > A.CLOSE_DATE
THEN 'Activity after closure!'
WHEN A.STATUS = 'C' AND A.CLOSE_DATE IS NULL
THEN 'Closed but no close date'
ELSE 'Consistent'
END AS CONSISTENCY_CHECK
FROM MERIDIAN.ACCOUNTS A
WHERE A.STATUS = 'C'
OR (A.STATUS <> 'C' AND A.CLOSE_DATE IS NOT NULL)
ORDER BY A.ACCOUNT_NUMBER;
The second condition in the WHERE clause catches the reverse issue: an account that has a CLOSE_DATE but is not in 'C' status.
Phase 4: Building the Audit Summary
The team compiles their findings into a summary. Using only the tools from Chapter 5, they can produce a categorized list of issues:
SELECT 'Missing Data' AS CATEGORY,
'Customers without secondary phone' AS FINDING,
CAST(NULL AS INTEGER) AS EXAMPLE_ID
FROM SYSIBM.SYSDUMMY1
WHERE EXISTS (
SELECT 1 FROM MERIDIAN.CUSTOMERS
WHERE PHONE_SECONDARY IS NULL
)
UNION ALL
SELECT 'Missing Data',
'Customers without middle name',
NULL
FROM SYSIBM.SYSDUMMY1
WHERE EXISTS (
SELECT 1 FROM MERIDIAN.CUSTOMERS
WHERE MIDDLE_NAME IS NULL AND CUSTOMER_TYPE = 'I'
)
UNION ALL
SELECT 'Missing Data',
'Customers without address line 2',
NULL
FROM SYSIBM.SYSDUMMY1
WHERE EXISTS (
SELECT 1 FROM MERIDIAN.CUSTOMERS
WHERE ADDRESS_LINE2 IS NULL
)
ORDER BY CATEGORY, FINDING;
Note
This query uses EXISTS subqueries and UNION ALL, which are previewed here but formally covered in Chapters 7 and 8. The pattern is straightforward: each subquery checks whether a specific issue exists in the data, and the results are combined into a single report.
Lessons Learned
This investigation demonstrates several important principles:
1. SQL is a data quality tool, not just a reporting tool. The same SELECT/WHERE/CASE skills that produce business reports can systematically audit millions of records for completeness, consistency, and correctness.
2. NULL is the primary indicator of missing data. Every quality check begins with IS NULL / IS NOT NULL. Understanding NULL behavior (three-valued logic, NULL propagation in expressions) is essential for accurate auditing.
3. CASE expressions are the Swiss Army knife of data classification. Every check in this study used CASE to translate raw data conditions into human-readable findings.
4. String functions enable format validation. LENGTH, SUBSTR, LOCATE, and REPLACE allow pattern-based validation without regular expressions (which DB2 does support through the REGEXP_LIKE function, covered in a later chapter).
5. Date arithmetic catches temporal inconsistencies. Comparing dates against CURRENT DATE and against each other reveals future dates, impossible ages, and timeline violations.
6. Data quality work often reveals modeling issues. The business customer name problem (names stored in FIRST_NAME/LAST_NAME fields designed for individuals) is a schema design concern, not a SQL concern. Data quality audits frequently surface these deeper structural issues.
Discussion Questions
-
The completeness score query in Phase 1 uses integer arithmetic (
* 100 / 7), which truncates the decimal. How would you modify the query to produce a more precise percentage (e.g., 71.4% instead of 71%)? -
The email validation in Phase 2 catches common errors but misses edge cases like double periods (
user@domain..com) or very short domains (user@d.c). Write additional CASE conditions to detect these. -
In a production database with 500,000 customer records, several of these queries would scan the entire CUSTOMERS table. Which columns would benefit from indexes to speed up the data quality checks? Consider which WHERE clause conditions are used most frequently.
-
The regulatory team asks: "What percentage of our customer records are fully complete?" You cannot use COUNT or aggregate functions yet. How would you approach answering this question using only Chapter 5 techniques?
-
Consider the "closed account with activity" check. What business processes might legitimately cause a transaction to post to an account after its close date? (Think about interest calculations, fee reversals, or regulatory holds.)
Case Study 2 for Chapter 5 of "IBM DB2: From First Query to Enterprise Architecture"