> "Data governance is not a project. It is a permanent function that requires ongoing authority, accountability, and stewardship."
Learning Objectives
- Distinguish data governance from data protection law and explain why both are necessary
- Describe the DAMA-DMBOK framework and identify its eleven knowledge areas
- Design a data governance council structure with appropriate roles, authority, and reporting lines
- Apply data quality dimensions — accuracy, completeness, consistency, timeliness, validity, and uniqueness — to evaluate a dataset
- Implement a Python DataQualityAuditor class that programmatically assesses data quality across multiple dimensions
- Evaluate an organization's data governance maturity and recommend improvements
In This Chapter
- Chapter Overview
- 22.1 What Is Data Governance?
- 22.2 The DAMA-DMBOK Framework
- 22.3 Data Governance Councils and Committees
- 22.4 Data Quality Management
- 22.5 Python in Practice: Building a DataQualityAuditor
- 22.6 Metadata Management and Data Catalogs
- 22.7 Data Governance Maturity Models
- 22.8 NovaCorp's Data Governance Journey
- 22.9 Data Governance in the Public Sector
- 22.10 Chapter Summary
- What's Next
- Chapter 22 Exercises → exercises.md
- Chapter 22 Quiz → quiz.md
- Case Study: NovaCorp's Data Governance Journey → case-study-01.md
- Case Study: Data Governance in Government: The UK's National Data Strategy → case-study-02.md
Chapter 22: Data Governance Frameworks and Institutions
"Data governance is not a project. It is a permanent function that requires ongoing authority, accountability, and stewardship." — DAMA International, DAMA-DMBOK: Data Management Body of Knowledge
Chapter Overview
When Ray Zhao joined NovaCorp as its first Chief Data Officer in 2022, he asked to see the company's data governance framework. The response was a folder containing three documents: a data classification policy written four years earlier and never updated, a spreadsheet listing database administrators with contact information, and a two-page document titled "Data Governance Principles" that consisted entirely of aspirational statements — "NovaCorp treats data as a strategic asset" — with no implementation details.
"This is not data governance," Ray told the CIO. "This is a wish list."
Ray's experience is not unusual. Most organizations recognize that data is important. Far fewer have implemented the structures, processes, and roles necessary to manage data systematically — to ensure that data is accurate, accessible, secure, compliant, and used responsibly. That systematic management is what data governance means.
This chapter is about building that governance. It is distinct from the regulatory compliance examined in Chapters 20, 21, and 25 — though good data governance makes regulatory compliance far easier. Data governance is the internal organizational framework that determines how data is managed throughout its lifecycle. It is about authority (who makes decisions about data), accountability (who is responsible for data quality and compliance), processes (how data is collected, stored, used, and retired), and culture (whether the organization genuinely values data stewardship or merely claims to).
This is also a Python chapter. Data quality — a central component of data governance — can be assessed programmatically, and building a DataQualityAuditor class will make the abstract dimensions of data quality concrete and measurable.
In this chapter, you will learn to: - Design data governance structures appropriate to an organization's size and complexity - Apply the DAMA-DMBOK framework to evaluate and improve data management practices - Assess data quality across six dimensions using both conceptual analysis and code - Build a Python tool that audits a dataset for quality issues - Evaluate data governance maturity and develop improvement roadmaps
22.1 What Is Data Governance?
22.1.1 A Definition
Data governance is the exercise of authority, control, and shared decision-making over the management of data assets. It encompasses the policies, standards, processes, roles, and metrics that ensure data is managed as a valuable organizational resource.
A simpler way to think about it: data governance answers the question, "Who gets to do what with which data, according to what rules, and how do we know the rules are being followed?"
22.1.2 Data Governance vs. Data Management
Data governance and data management are related but distinct:
| Concept | Focus | Question |
|---|---|---|
| Data governance | Authority and accountability | Who decides how data is managed? |
| Data management | Operational execution | How is data actually managed day-to-day? |
Data governance sets the rules. Data management implements them. A data governance policy might state: "All customer data must be accurate to within 99.5% measured quarterly." Data management is the set of operational practices — data entry validation, reconciliation processes, quality monitoring — that achieve that standard.
22.1.3 Data Governance vs. Data Protection Law
Data governance is also distinct from data protection regulation, though they overlap:
| Concept | Scope | Motivation | Enforced By |
|---|---|---|---|
| Data protection law (GDPR, CCPA, etc.) | Legal compliance | Protect individual rights | Government regulators |
| Data governance | Organizational management | Manage data as strategic asset | Internal governance structures |
An organization can be legally compliant — meeting GDPR requirements — while having poor data governance: inaccurate databases, inconsistent definitions, undocumented data flows, no clear ownership. Conversely, an organization with excellent data governance will find regulatory compliance far more achievable because it already knows what data it has, where it is, who is responsible for it, and how it flows.
Connection: The AI Act's requirement for data governance in high-risk AI systems (Chapter 21, Article 10) explicitly links regulatory compliance to organizational data governance practices. You cannot demonstrate that your training data is "relevant, sufficiently representative, and as free of errors as possible" without robust internal data governance.
22.1.4 Why Data Governance Matters
Dr. Adeyemi introduced the governance unit with a thought experiment: "Imagine a hospital that has no governance over its pharmaceutical supplies. Drugs are stored wherever someone finds space. There's no inventory system. Nobody checks expiration dates systematically. Different departments use different names for the same medication. When a doctor prescribes a drug, there's no reliable way to verify what the patient has already been given."
"That's terrifying," Mira said.
"It is. And yet, that's essentially how many organizations manage their data — scattered across systems with no central inventory, no quality controls, no consistent definitions, no clear ownership. We would never accept that for pharmaceuticals. Why do we accept it for data?"
The consequences of poor data governance are concrete and measurable:
- Financial loss. Gartner estimated that poor data quality costs organizations an average of $12.9 million annually. IBM's estimate is higher: $3.1 trillion per year for the US economy.
- Regulatory penalties. Data breaches, inaccurate reporting, and non-compliance with data protection requirements all become more likely without governance.
- Failed analytics. Machine learning models trained on poorly governed data produce unreliable results — a direct link to the bias concerns examined in Chapter 14.
- Erosion of trust. When customers, patients, or citizens discover that their data is poorly managed, trust erodes — and trust, once lost, is difficult to rebuild.
22.2 The DAMA-DMBOK Framework
The most widely referenced data management framework is the Data Management Body of Knowledge (DAMA-DMBOK), published by DAMA International. Now in its second edition, the DMBOK identifies eleven knowledge areas that collectively constitute data management, with data governance at the center.
22.2.1 The Eleven Knowledge Areas
┌─────────────────────┐
│ DATA GOVERNANCE │
│ (The Center) │
└─────────┬───────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
┌─────┴─────┐ ┌─────┴─────┐ ┌─────┴─────┐
│ Data │ │ Data │ │ Data │
│ Architecture│ │ Modeling & │ │ Storage & │
│ & Design │ │ Design │ │ Operations│
└───────────┘ └───────────┘ └───────────┘
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Data │ │ Data │ │ Document │
│ Security │ │ Integration│ │ & Content │
│ │ │ & Interop │ │ Management│
└───────────┘ └───────────┘ └───────────┘
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Reference │ │ Data │ │ Metadata │
│ & Master │ │ Warehousing│ │ Management│
│ Data Mgmt │ │ & BI │ │ │
└───────────┘ └───────────┘ └───────────┘
┌───────────────┐
│ Data Quality │
│ Management │
└───────────────┘
| # | Knowledge Area | Description |
|---|---|---|
| 1 | Data Governance | The central function: establishing authority, accountability, policies, and standards |
| 2 | Data Architecture | Defining the blueprint for data assets as part of enterprise architecture |
| 3 | Data Modeling & Design | Designing data structures (conceptual, logical, physical models) |
| 4 | Data Storage & Operations | Managing the infrastructure for storing and retrieving data |
| 5 | Data Security | Ensuring appropriate authentication, authorization, access, and audit |
| 6 | Data Integration & Interoperability | Moving and combining data across systems and organizations |
| 7 | Document & Content Management | Managing unstructured and semi-structured data (documents, images, etc.) |
| 8 | Reference & Master Data Management | Establishing and maintaining authoritative versions of shared data entities |
| 9 | Data Warehousing & Business Intelligence | Managing analytical data repositories and reporting systems |
| 10 | Metadata Management | Managing data about data — definitions, lineage, quality metrics |
| 11 | Data Quality Management | Ensuring data meets fitness-for-use standards |
22.2.2 The Framework in Practice
The DMBOK is a reference framework, not a prescriptive checklist. Organizations are expected to adapt it to their context. A twelve-person startup like VitraMed will not implement all eleven knowledge areas with the same depth as a multinational financial services company like NovaCorp. But even the smallest organization benefits from explicitly addressing governance, quality, security, and metadata.
Ray Zhao used the DMBOK as a diagnostic tool when he arrived at NovaCorp: "I assessed each of the eleven areas on a simple red/amber/green scale. Five were red. Three were amber. Three were green — and the green ones were mostly green because they fell under IT operations, which at least had established processes. Everything that required cross-functional collaboration — governance, quality, metadata — was red."
Real-World Application: If you work in or with any data-intensive organization, try the same exercise. For each DMBOK knowledge area, assess whether your organization has explicit policies, defined roles, documented processes, and measurable outcomes. The gaps you identify are your data governance improvement roadmap.
22.3 Data Governance Councils and Committees
Data governance requires organizational structure — people with defined roles, authority, and accountability. The most common structure involves a multi-layered governance council.
22.3.1 A Three-Tier Model
Tier 1: Executive Data Governance Council - Composition: CDO (chair), CIO, CISO, CFO, heads of major business units, Chief Privacy Officer, General Counsel - Frequency: Quarterly - Authority: Sets data strategy, approves policies, allocates resources, resolves escalated disputes - Accountability: Reports to the CEO and/or board of directors
Tier 2: Data Governance Working Committee - Composition: Data governance program manager (chair), domain data stewards, representatives from IT, legal, compliance, and key business units - Frequency: Monthly - Authority: Develops and implements policies approved by Tier 1, manages data quality programs, oversees data stewardship - Accountability: Reports to the Executive Data Governance Council
Tier 3: Domain Data Stewardship Teams - Composition: Subject matter experts within specific data domains (customer data, financial data, product data, etc.) - Frequency: Weekly or bi-weekly - Authority: Manage data quality within their domain, define business rules, maintain data definitions, resolve operational data issues - Accountability: Report to the Working Committee
22.3.2 Key Roles
| Role | Responsibility | Reports To |
|---|---|---|
| Chief Data Officer (CDO) | Overall accountability for data strategy, governance, and value realization | CEO or C-suite |
| Data Governance Program Manager | Day-to-day management of the governance program | CDO |
| Data Steward | Responsible for data quality and governance within a specific domain | Governance Program Manager (functionally) |
| Data Custodian | Technical management of data storage, security, and infrastructure | CIO/IT leadership |
| Data Owner | Business executive accountable for a specific data domain | Business unit leadership |
22.3.3 Authority vs. Influence
One of the most critical — and most frequently neglected — design decisions in data governance is the question of authority.
A data governance council with advisory authority can recommend that a business unit clean up its customer data. A council with decision-making authority can require it, with escalation paths and consequences for non-compliance.
"I've seen governance programs fail because they had influence but not authority," Ray told Dr. Adeyemi's class during a guest lecture. "People say, 'Data governance is everyone's responsibility,' which sounds nice but in practice means it's nobody's responsibility. Governance needs teeth — budgetary authority, the power to block a project launch if data quality standards aren't met, the ability to say 'no, you cannot use that dataset for that purpose.' Without teeth, it's a committee that meets monthly and accomplishes nothing."
Accountability Gap: The tension between authority and organizational culture is a microcosm of the broader Accountability Gap. When data governance structures lack enforcement power, accountability becomes diffuse — everyone is "responsible," but no one is accountable when things go wrong. Effective governance closes this gap by assigning specific accountability to specific roles for specific data domains.
22.4 Data Quality Management
Data quality is the governance domain with the most direct, measurable impact. Poor data quality leads to poor decisions, failed analytics, regulatory violations, and eroded trust. Understanding — and measuring — data quality requires defining what "quality" means.
22.4.1 The Six Dimensions of Data Quality
| Dimension | Definition | Example of Failure |
|---|---|---|
| Accuracy | Data correctly represents the real-world entity or event it describes | A patient's blood pressure is recorded as 120/80 when it was actually 140/90 |
| Completeness | All required data values are present | 15% of customer records are missing email addresses |
| Consistency | Data does not contradict itself across systems or within a dataset | A patient is listed as "female" in the demographics table but "male" in the billing system |
| Timeliness | Data is sufficiently current for its intended use | A credit risk model uses income data that is three years old |
| Validity | Data conforms to the defined format, type, and range | A date field contains the value "13/32/2024" |
| Uniqueness | Each real-world entity is represented once | The same customer appears four times in the database with slightly different name spellings |
22.4.2 Measuring Data Quality
Data quality is not a binary — data is not simply "good" or "bad." It exists on a continuum, and the acceptable threshold depends on the use case. A marketing email campaign might tolerate 5% incomplete addresses. A medical records system cannot tolerate 5% incomplete allergy information.
Measurement requires:
- Define metrics for each dimension (e.g., "percentage of records with non-null values for required fields")
- Set thresholds based on business requirements (e.g., "completeness must exceed 99% for required clinical fields")
- Measure regularly (continuously or on a defined schedule)
- Report and act on results (dashboards, alerts, remediation processes)
22.4.3 The Data Quality Lifecycle
DEFINE standards → MEASURE current state → MONITOR continuously →
REPORT findings → REMEDIATE issues → PREVENT recurrence
This lifecycle is continuous, not one-time. Data quality degrades over time as people move, businesses change, systems are updated, and manual entry errors accumulate. Governance must include ongoing monitoring, not just initial cleansing.
22.5 Python in Practice: Building a DataQualityAuditor
Data quality assessment can — and should — be automated. In this section, we build a DataQualityAuditor class in Python that evaluates a dataset across three core quality dimensions: completeness, consistency, and uniqueness. This tool demonstrates how governance principles translate into executable code.
22.5.1 Design Goals
Our DataQualityAuditor will:
- Accept a pandas DataFrame as input
- Check completeness — the percentage of non-null values per column
- Check consistency — validate that values in specified columns match expected formats (e.g., email addresses, dates, phone numbers)
- Check uniqueness — detect duplicate rows based on specified key columns
- Generate a structured quality report with dimension scores and an overall quality score
22.5.2 The Complete Implementation
"""
DataQualityAuditor: A data quality assessment tool for governance practitioners.
This module provides a DataQualityAuditor class that evaluates a pandas DataFrame
across three core data quality dimensions: completeness, consistency, and uniqueness.
It generates a structured quality report with per-dimension scores and an overall
quality score.
Part of: Data, Society, and Responsibility — Chapter 22
License: Educational use
Requirements: pandas, re (standard library)
"""
import re
from dataclasses import dataclass, field
import pandas as pd
@dataclass
class QualityReport:
"""Structured report of data quality assessment results.
Attributes:
dataset_name: Identifier for the dataset being audited.
total_rows: Number of rows in the dataset.
total_columns: Number of columns in the dataset.
completeness_scores: Dict mapping column names to their completeness
percentage (0.0 to 100.0).
consistency_results: Dict mapping column names to dicts with 'valid_count',
'invalid_count', 'invalid_examples', and 'score' (0.0 to 100.0).
uniqueness_results: Dict with 'total_rows', 'duplicate_rows',
'duplicate_examples', and 'score' (0.0 to 100.0).
overall_completeness: Weighted average completeness across all columns.
overall_consistency: Weighted average consistency across checked columns.
overall_uniqueness: Uniqueness score for the dataset.
overall_score: Combined quality score (0.0 to 100.0).
"""
dataset_name: str
total_rows: int
total_columns: int
completeness_scores: dict = field(default_factory=dict)
consistency_results: dict = field(default_factory=dict)
uniqueness_results: dict = field(default_factory=dict)
overall_completeness: float = 0.0
overall_consistency: float = 0.0
overall_uniqueness: float = 0.0
overall_score: float = 0.0
def summary(self) -> str:
"""Return a human-readable summary of the quality report."""
lines = [
f"=== Data Quality Report: {self.dataset_name} ===",
f"Dataset size: {self.total_rows} rows x {self.total_columns} columns",
"",
"--- Completeness (% non-null per column) ---",
]
for col, score in self.completeness_scores.items():
flag = " [!]" if score < 95.0 else ""
lines.append(f" {col}: {score:.1f}%{flag}")
lines.append(f" OVERALL COMPLETENESS: {self.overall_completeness:.1f}%")
lines.append("")
if self.consistency_results:
lines.append("--- Consistency (format validation) ---")
for col, result in self.consistency_results.items():
flag = " [!]" if result["score"] < 95.0 else ""
lines.append(
f" {col}: {result['score']:.1f}% valid "
f"({result['invalid_count']} invalid){flag}"
)
if result["invalid_examples"]:
examples = result["invalid_examples"][:3]
lines.append(f" Examples of invalid values: {examples}")
lines.append(
f" OVERALL CONSISTENCY: {self.overall_consistency:.1f}%"
)
lines.append("")
if self.uniqueness_results:
lines.append("--- Uniqueness (duplicate detection) ---")
lines.append(
f" Total rows: {self.uniqueness_results['total_rows']}"
)
lines.append(
f" Duplicate rows: {self.uniqueness_results['duplicate_rows']}"
)
if self.uniqueness_results["duplicate_examples"]:
lines.append(" Example duplicate key values:")
for ex in self.uniqueness_results["duplicate_examples"][:3]:
lines.append(f" {ex}")
lines.append(
f" OVERALL UNIQUENESS: {self.overall_uniqueness:.1f}%"
)
lines.append("")
lines.append(f"=== OVERALL QUALITY SCORE: {self.overall_score:.1f}% ===")
grade = (
"EXCELLENT" if self.overall_score >= 95
else "GOOD" if self.overall_score >= 85
else "FAIR" if self.overall_score >= 70
else "POOR"
)
lines.append(f" Grade: {grade}")
return "\n".join(lines)
class DataQualityAuditor:
"""Audits a pandas DataFrame for data quality across multiple dimensions.
The auditor checks three core dimensions:
- Completeness: percentage of non-null values per column
- Consistency: format validation against expected patterns
- Uniqueness: duplicate detection based on key columns
Usage:
>>> auditor = DataQualityAuditor(df, dataset_name="patient_records")
>>> auditor.add_consistency_check("email", r'^[\\w.+-]+@[\\w-]+\\.[\\w.]+$')
>>> auditor.add_consistency_check("phone", r'^\\d{3}-\\d{3}-\\d{4}$')
>>> auditor.set_uniqueness_keys(["patient_id"])
>>> report = auditor.run_audit()
>>> print(report.summary())
"""
# Common format patterns for convenience
PATTERNS = {
"email": r"^[\w.+-]+@[\w-]+\.[\w.]+$",
"phone_us": r"^\d{3}-\d{3}-\d{4}$",
"date_iso": r"^\d{4}-\d{2}-\d{2}$",
"zip_us": r"^\d{5}(-\d{4})?$",
"ssn_masked": r"^\*{3}-\*{2}-\d{4}$",
}
def __init__(self, df: pd.DataFrame, dataset_name: str = "unnamed_dataset"):
"""Initialize the auditor with a DataFrame.
Args:
df: The pandas DataFrame to audit.
dataset_name: A human-readable name for the dataset.
"""
self.df = df.copy()
self.dataset_name = dataset_name
self._consistency_checks: dict[str, str] = {}
self._uniqueness_keys: list[str] = []
def add_consistency_check(self, column: str, pattern: str) -> None:
"""Register a regex pattern to validate values in a column.
Args:
column: The column name to validate.
pattern: A regex pattern that valid values should match.
Raises:
ValueError: If the column does not exist in the DataFrame.
"""
if column not in self.df.columns:
raise ValueError(
f"Column '{column}' not found. "
f"Available columns: {list(self.df.columns)}"
)
self._consistency_checks[column] = pattern
def set_uniqueness_keys(self, key_columns: list[str]) -> None:
"""Set the columns that define a unique record.
Args:
key_columns: List of column names that together should uniquely
identify each row.
Raises:
ValueError: If any column does not exist in the DataFrame.
"""
missing = [c for c in key_columns if c not in self.df.columns]
if missing:
raise ValueError(
f"Columns not found: {missing}. "
f"Available columns: {list(self.df.columns)}"
)
self._uniqueness_keys = key_columns
def _check_completeness(self) -> tuple[dict, float]:
"""Check completeness: percentage of non-null values per column.
Returns:
Tuple of (per-column scores dict, overall completeness score).
"""
scores = {}
for col in self.df.columns:
non_null = self.df[col].notna().sum()
total = len(self.df)
scores[col] = (non_null / total * 100) if total > 0 else 0.0
overall = sum(scores.values()) / len(scores) if scores else 0.0
return scores, overall
def _check_consistency(self) -> tuple[dict, float]:
"""Check consistency: validate column values against registered patterns.
Returns:
Tuple of (per-column results dict, overall consistency score).
"""
if not self._consistency_checks:
return {}, 100.0
results = {}
scores = []
for col, pattern in self._consistency_checks.items():
# Only check non-null values (nulls are a completeness issue)
non_null_mask = self.df[col].notna()
non_null_values = self.df.loc[non_null_mask, col].astype(str)
if len(non_null_values) == 0:
results[col] = {
"valid_count": 0,
"invalid_count": 0,
"invalid_examples": [],
"score": 100.0,
}
scores.append(100.0)
continue
valid_mask = non_null_values.apply(
lambda x: bool(re.match(pattern, x))
)
valid_count = valid_mask.sum()
invalid_count = len(non_null_values) - valid_count
invalid_examples = (
non_null_values[~valid_mask].head(5).tolist()
)
score = (valid_count / len(non_null_values) * 100)
results[col] = {
"valid_count": int(valid_count),
"invalid_count": int(invalid_count),
"invalid_examples": invalid_examples,
"score": score,
}
scores.append(score)
overall = sum(scores) / len(scores) if scores else 100.0
return results, overall
def _check_uniqueness(self) -> tuple[dict, float]:
"""Check uniqueness: detect duplicate rows based on key columns.
Returns:
Tuple of (uniqueness results dict, uniqueness score).
"""
if not self._uniqueness_keys:
return {}, 100.0
duplicated_mask = self.df.duplicated(
subset=self._uniqueness_keys, keep=False
)
duplicate_rows = duplicated_mask.sum()
# Count unique duplicate groups (not individual rows)
duplicate_groups = (
self.df[duplicated_mask]
.groupby(self._uniqueness_keys)
.ngroups
) if duplicate_rows > 0 else 0
# Collect examples of duplicate key values
duplicate_examples = []
if duplicate_rows > 0:
dup_subset = (
self.df[duplicated_mask][self._uniqueness_keys]
.drop_duplicates()
.head(5)
)
for _, row in dup_subset.iterrows():
duplicate_examples.append(dict(row))
total = len(self.df)
# Score: percentage of rows that are NOT duplicates
unique_count = total - (duplicate_rows - duplicate_groups)
score = (unique_count / total * 100) if total > 0 else 100.0
results = {
"total_rows": total,
"duplicate_rows": int(duplicate_rows),
"duplicate_groups": int(duplicate_groups),
"duplicate_examples": duplicate_examples,
"score": score,
}
return results, score
def run_audit(self) -> QualityReport:
"""Execute the full data quality audit.
Runs all registered checks (completeness, consistency, uniqueness)
and generates a QualityReport.
Returns:
A QualityReport dataclass containing all results and scores.
"""
completeness_scores, overall_completeness = self._check_completeness()
consistency_results, overall_consistency = self._check_consistency()
uniqueness_results, overall_uniqueness = self._check_uniqueness()
# Overall score: weighted average of all dimensions
# Weights: completeness 40%, consistency 30%, uniqueness 30%
overall_score = (
overall_completeness * 0.40
+ overall_consistency * 0.30
+ overall_uniqueness * 0.30
)
report = QualityReport(
dataset_name=self.dataset_name,
total_rows=len(self.df),
total_columns=len(self.df.columns),
completeness_scores=completeness_scores,
consistency_results=consistency_results,
uniqueness_results=uniqueness_results,
overall_completeness=overall_completeness,
overall_consistency=overall_consistency,
overall_uniqueness=overall_uniqueness,
overall_score=overall_score,
)
return report
22.5.3 Example: Auditing a Patient Records Dataset
Let's see the DataQualityAuditor in action with a sample dataset that contains realistic quality issues:
# --- Example: Auditing a sample patient records dataset ---
# Create a sample dataset with intentional quality issues
data = {
"patient_id": ["P001", "P002", "P003", "P004", "P005",
"P003", "P007", "P008", "P009", "P010"],
"name": ["Alice Chen", "Bob Martinez", "Carla Davis", "David Kim",
"Elena Popov", "Carla Davis", "Frank Obi", None,
"Hannah Lee", "Ivan Petrov"],
"email": ["alice@example.com", "bob@example", "carla.d@hospital.org",
"david.kim@clinic.net", None, "carla.d@hospital.org",
"frank obi@mail.com", "grace@hospital.org",
"hannah@clinic.net", "ivan@hospital.org"],
"phone": ["555-123-4567", "555-234-5678", "555-345-6789",
"5554567890", "555-567-8901", "555-345-6789",
"555-678-9012", "555-789-0123", None, "555-901-2345"],
"date_of_birth": ["1985-03-15", "1990-07-22", "1978-11-30",
"1995-01-10", "1982-06-05", "1978-11-30",
"2000-13-01", "1988-04-18", "1992-09-25",
"12/15/1975"],
"blood_type": ["A+", "O-", "B+", None, "AB+",
"B+", "A-", "O+", "B+", "A+"],
}
df = pd.DataFrame(data)
# Initialize the auditor
auditor = DataQualityAuditor(df, dataset_name="patient_records_q4_2025")
# Register consistency checks
auditor.add_consistency_check("email", DataQualityAuditor.PATTERNS["email"])
auditor.add_consistency_check("phone", DataQualityAuditor.PATTERNS["phone_us"])
auditor.add_consistency_check("date_of_birth", DataQualityAuditor.PATTERNS["date_iso"])
# Set uniqueness keys
auditor.set_uniqueness_keys(["patient_id"])
# Run the audit
report = auditor.run_audit()
# Print the report
print(report.summary())
Expected output:
=== Data Quality Report: patient_records_q4_2025 ===
Dataset size: 10 rows x 6 columns
--- Completeness (% non-null per column) ---
patient_id: 100.0%
name: 90.0% [!]
email: 90.0% [!]
phone: 90.0% [!]
date_of_birth: 100.0%
blood_type: 90.0% [!]
OVERALL COMPLETENESS: 93.3%
--- Consistency (format validation) ---
email: 77.8% valid (2 invalid) [!]
Examples of invalid values: ['bob@example', 'frank obi@mail.com']
phone: 88.9% valid (1 invalid) [!]
Examples of invalid values: ['5554567890']
date_of_birth: 80.0% valid (2 invalid) [!]
Examples of invalid values: ['2000-13-01', '12/15/1975']
OVERALL CONSISTENCY: 82.2%
--- Uniqueness (duplicate detection) ---
Total rows: 10
Duplicate rows: 2
Example duplicate key values:
{'patient_id': 'P003'}
OVERALL UNIQUENESS: 90.0%
=== OVERALL QUALITY SCORE: 88.9% ===
Grade: GOOD
22.5.4 Interpreting the Results
Let's walk through what the auditor found:
Completeness issues:
- One patient record (P008) is missing a name — a critical field in health records
- One patient (P005) has no email address
- One patient (P009) has no phone number
- One patient (P004) has no blood type recorded
Each of these gaps represents a governance failure. Were these fields required at data entry? Was there validation? Were the records migrated from a legacy system that lacked these fields?
Consistency issues:
- bob@example lacks a proper domain extension — likely a data entry error
- frank obi@mail.com contains a space — invalid email format
- 5554567890 is a phone number without dashes — valid number, inconsistent format
- 2000-13-01 contains month "13" — an invalid date
- 12/15/1975 uses MM/DD/YYYY format instead of the expected ISO format (YYYY-MM-DD) — inconsistent format, not inaccurate data
Uniqueness issues:
- Patient P003 (Carla Davis) appears twice — a duplicate record. This could lead to fragmented medical histories, duplicate billing, or conflicting treatment records.
Common Pitfall: Students sometimes assume that a high overall quality score means the data is "good enough." But quality thresholds are use-case dependent. A 90% completeness rate might be acceptable for a marketing database but catastrophic for clinical health records, where a missing allergy entry could endanger a patient's life. The auditor provides measurements; governance provides the judgment about what those measurements mean.
22.5.5 Extending the Auditor
The DataQualityAuditor is a foundation. In a production environment, you would extend it to include:
- Accuracy checks: Cross-referencing values against authoritative sources (e.g., validating zip codes against postal databases)
- Timeliness checks: Flagging records that haven't been updated within a specified window
- Cross-column consistency: Verifying that related fields are logically consistent (e.g., a date of birth and an age field should agree)
- Historical trending: Comparing quality scores over time to detect degradation
- Automated alerting: Triggering notifications when scores drop below defined thresholds
Mira, who had built data quality scripts for the OIR, saw immediate relevance: "We have student records where the same student has three different spellings of their name across different systems. Registrar has 'Katherine,' financial aid has 'Katharine,' and the LMS has 'Kathy.' That's a uniqueness and consistency problem that a tool like this could catch — but only if someone bothers to run it."
"And only if someone has the authority to fix it," Ray added during his guest lecture. "Tools find problems. Governance fixes them."
22.6 Metadata Management and Data Catalogs
22.6.1 What Is Metadata Management?
Metadata management — the systematic management of data about data — is the invisible infrastructure that makes all other data governance functions possible. Without metadata, you cannot answer basic governance questions: What data do we have? Where is it stored? Who is responsible for it? Where did it come from? How is it being used?
There are three primary types of metadata:
Technical metadata: Information about data structures, formats, and systems. Database schemas, table definitions, column types, file formats, API specifications.
Business metadata: Information about what data means in business context. Business definitions, data ownership, usage policies, quality rules, regulatory classifications.
Operational metadata: Information about data processing and usage. Load times, access logs, transformation histories, query patterns, data lineage records.
22.6.2 Data Catalogs
A data catalog is a searchable inventory of an organization's data assets — a metadata-driven directory that enables users to find, understand, and assess the fitness of data for their purposes.
A well-implemented data catalog contains: - Asset inventory: What datasets exist, where they are stored, and how to access them - Business glossary: Standardized definitions of business terms and data elements - Data lineage: Visual representation of where data comes from, how it is transformed, and where it goes - Quality scores: Current data quality metrics for each asset - Usage information: Who uses which datasets, how frequently, and for what purposes - Classification: Sensitivity levels, regulatory applicability, retention schedules
22.6.3 Data Lineage
Data lineage traces the path of data from its origin through its transformations to its current state and downstream uses. It answers the question: "Where did this number come from?"
Source System A Transformation Destination
┌──────────────┐ ┌───────────────┐ ┌──────────────┐
│ Patient EHR │───────>│ ETL Pipeline │───────>│ Analytics DB │
│ (raw vitals) │ │ (normalize, │ │ (aggregated │
│ │ │ filter, join) │ │ risk scores)│
└──────────────┘ └───────────────┘ └──────────────┘
│
v
┌───────────────┐
│ Audit Log │
│ (who, when, │
│ what changed)│
└───────────────┘
Data lineage matters for governance because: - Regulatory compliance: GDPR's right to erasure requires knowing everywhere a person's data has been copied and transformed. - Impact analysis: If a source system changes, lineage reveals which downstream systems and reports will be affected. - Trust: Analysts and decision-makers need to understand the provenance of the data they rely on. - Incident response: When a data quality issue is discovered, lineage helps trace it back to its origin.
VitraMed Thread: When VitraMed received its first data protection authority inquiry — a German patient requesting erasure of their data under GDPR Article 17 — the company discovered that it lacked comprehensive data lineage. The patient's data existed in the primary EHR database, a backup system, a de-identified research dataset, and as input to three different ML models. Without lineage tracking, identifying all these locations required a week of manual investigation. VitraMed invested in lineage tooling immediately afterward.
22.7 Data Governance Maturity Models
22.7.1 What Is a Maturity Model?
A data governance maturity model provides a structured framework for assessing an organization's current governance capabilities and planning improvements. Maturity models typically define stages from ad hoc (no formal governance) to optimized (governance is fully integrated into organizational culture and continuously improving).
22.7.2 A Five-Level Maturity Model
| Level | Name | Characteristics |
|---|---|---|
| 1 | Initial/Ad Hoc | No formal governance. Data management is reactive. Quality issues are addressed individually when they cause visible problems. No defined roles or processes. |
| 2 | Repeatable | Some governance processes exist but are inconsistent. Individual teams may have quality standards but they are not coordinated. Data ownership is informal. |
| 3 | Defined | Formal governance program established. Policies documented. Roles defined (CDO, stewards, custodians). Standards exist but adoption is uneven. Quality is measured but not systematically managed. |
| 4 | Managed | Governance is operational and measured. Quality metrics are tracked against targets. Data catalogs and lineage are maintained. Cross-functional governance council has decision-making authority. Issues are addressed proactively. |
| 5 | Optimized | Governance is embedded in organizational culture. Continuous improvement driven by metrics. Advanced automation (ML-driven quality monitoring, automated classification). Governance adapts to new data types, regulations, and business needs. |
22.7.3 Assessing and Advancing Maturity
Most organizations begin at Level 1 or 2. The journey to Level 3 — establishing a formal program — typically takes 12-18 months. Reaching Level 4 — operational governance with measurable outcomes — takes 2-4 years. Level 5 is aspirational for most organizations and requires sustained executive commitment.
Ray Zhao assessed NovaCorp at Level 1.5 when he arrived. After two years of focused work, he brought it to Level 3 with clear progress toward Level 4. The key investments were:
- Executive sponsorship: The CEO publicly endorsed the governance program and attended the quarterly governance council.
- Quick wins: Ray started with a data quality project in customer master data that produced visible results within 90 days, building organizational credibility.
- Dedicated resources: A governance team of four full-time staff, plus 12 domain data stewards embedded in business units.
- Technology: A commercial data catalog tool, automated quality monitoring, and a metadata management platform.
- Culture change: Monthly "data quality awards" recognizing teams that improved their quality scores, and governance training incorporated into onboarding for all new hires.
"The technology was 20% of the work," Ray told the class. "The organizational change — getting people to care about data quality, to accept governance authority, to participate in stewardship — was 80%. Data governance is a people problem that technology can support but never solve."
Reflection: Think about an organization you have been part of — a university, an employer, a club, a government agency. Where would you place it on the maturity model? What would need to change to move it one level higher? Who would need to champion that change?
22.8 NovaCorp's Data Governance Journey
22.8.1 The Starting Point
When Ray Zhao arrived at NovaCorp — a mid-size financial services company with 2,400 employees — he found a familiar landscape of data governance dysfunction:
- 87 databases across the organization, with no central inventory
- No agreed definition of basic terms. The sales team's "customer" and the finance team's "customer" were defined differently, leading to conflicting reports that eroded trust in analytics
- Data quality problems that cost the company an estimated $4.2 million annually in rework, failed reports, and regulatory findings
- Regulatory pressure: Financial services regulators (OCC, SEC) had cited NovaCorp for inadequate data management in its most recent exam
22.8.2 The Governance Program
Ray's governance program followed a three-phase approach:
Phase 1: Foundation (Months 1-6) - Established the Executive Data Governance Council with CFO sponsorship - Hired a Data Governance Program Manager and two Data Quality Analysts - Conducted an initial data landscape assessment — inventorying all 87 databases - Defined the top-20 critical data elements (customer ID, account balance, transaction date, etc.) and established authoritative sources
Phase 2: Operationalization (Months 7-18) - Appointed 12 Domain Data Stewards across business units - Implemented a commercial data catalog (Alation) and began populating it - Deployed automated data quality monitoring for the top-20 critical data elements - Established data quality dashboards visible to executive leadership - Developed and published data governance policies: data classification, data access, data retention, data quality standards
Phase 3: Maturation (Months 19-36) - Expanded quality monitoring to cover 200+ data elements - Implemented data lineage tracking for regulatory reporting data flows - Achieved passing marks on regulatory data management exams - Measured ROI: $3.1 million in annual savings from reduced rework and regulatory findings - Began integrating data governance requirements into new project initiation processes
22.8.3 Lessons Learned
Ray distilled his experience into five principles that he shared with Dr. Adeyemi's class:
-
Start with pain, not principles. "Don't begin by explaining DAMA-DMBOK to the CFO. Begin by showing them the $4.2 million they're losing to bad data. Governance follows from a business case, not an academic framework."
-
Governance needs authority, not just advisory status. "Our governance council can block a project launch if data quality standards aren't met. That power was essential — without it, governance is just recommendations."
-
Stewards must come from the business, not IT. "A data steward for customer data should be someone who understands the business meaning of customer data — a business analyst, a compliance officer — not a database administrator."
-
Measure everything. "If you can't measure data quality, you can't improve it. Dashboards create accountability."
-
This is never 'done.' "Data governance is not a project with an end date. It's a permanent organizational function, like financial accounting. You don't 'finish' accounting."
Consent Fiction: NovaCorp's governance journey reveals a version of the Consent Fiction operating within organizations. Before Ray's program, NovaCorp's data governance policies existed on paper — they "consented" to good governance by writing policies. But the policies were not implemented, not measured, and not enforced. The fiction of governance-by-document is as pervasive as the fiction of consent-by-privacy-policy.
22.9 Data Governance in the Public Sector
22.9.1 Government Data Governance Challenges
Government agencies face distinctive data governance challenges:
- Scale and legacy: Government systems often span decades, with data in mainframe systems, paper archives, and modern cloud platforms simultaneously.
- Interoperability: Data must be shared across agencies with different systems, standards, and cultures.
- Democratic accountability: Government data governance has a public dimension — citizens have a right to know how their data is managed and used.
- Political cycles: Government data governance programs must survive changes in political leadership, which can disrupt long-term initiatives.
22.9.2 The UK National Data Strategy
The UK's National Data Strategy (2020) provides an example of government-level data governance thinking. The strategy identified five "missions":
- Unlocking the value of data across the economy
- Securing a pro-growth data regime (post-Brexit data adequacy)
- Transforming government's use of data to improve public services
- Ensuring data security and resilience
- Building the data foundation — skills, infrastructure, standards
The strategy's emphasis on data as an economic asset drew criticism from privacy advocates who argued that it prioritized commercial value over rights protection. Sofia Reyes, reviewing the strategy for a DataRights Alliance briefing, noted: "When a government strategy uses the word 'value' forty-seven times and 'rights' twelve times, that tells you where the priorities lie."
22.10 Chapter Summary
Key Concepts
- Data governance is the organizational framework of policies, roles, and processes that ensures data is managed as a strategic asset — distinct from data protection law.
- The DAMA-DMBOK identifies eleven knowledge areas constituting data management, with data governance at the center.
- Data governance councils require clear authority (not just advisory status), defined roles (CDO, stewards, custodians), and executive sponsorship.
- Data quality is measured across six dimensions: accuracy, completeness, consistency, timeliness, validity, and uniqueness.
- The DataQualityAuditor (Python) demonstrates that quality principles can be translated into automated, repeatable checks.
- Metadata management — including data catalogs and data lineage — is the infrastructure that makes governance operationally possible.
- Maturity models provide a framework for assessing current capabilities and planning governance improvements.
Key Debates
- Should data governance be centralized (a governance council with decision-making authority) or federated (distributed across business units with light coordination)?
- Is data governance a business function, a technology function, or both — and who should lead it?
- How should organizations balance the cost of governance (dedicated staff, technology, process overhead) against the cost of not governing (regulatory penalties, failed analytics, eroded trust)?
Applied Framework
When evaluating an organization's data governance: 1. Authority: Does the governance function have decision-making power, or only advisory status? 2. Roles: Are data stewards, custodians, and owners clearly identified for each critical data domain? 3. Quality: Is data quality measured against defined thresholds, monitored continuously, and reported to leadership? 4. Metadata: Does the organization maintain a data catalog with business definitions, lineage, and quality scores? 5. Maturity: What level has the organization reached, and what is the plan for advancement? 6. Culture: Does the organization genuinely value data stewardship — or merely claim to?
What's Next
In Chapter 23: Cross-Border Data Flows and Digital Sovereignty, we examine what happens when data crosses national borders — a challenge of growing urgency in a world of cloud computing, multinational organizations, and data localization mandates. We'll trace the dramatic history of EU-US data transfer frameworks from Safe Harbor through Privacy Shield to the current Data Privacy Framework, analyze the Schrems decisions, and explore the concept of digital sovereignty.
Before moving on, complete the exercises and quiz to solidify your understanding of data governance frameworks.