Exercises: Data Governance Frameworks and Institutions

DataField.Dev

Exercises: Data Governance Frameworks and Institutions

These exercises progress from concept checks to challenging applications. This is a Python chapter — Parts B and C include programming tasks using the DataQualityAuditor class. Estimated completion time: 4-5 hours.

Difficulty Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)

Part A: Conceptual Understanding ⭐

Test your grasp of core concepts from Chapter 22.

A.1. Distinguish between data governance and data management as defined in Section 22.1.2. Provide a concrete example for each: one governance decision and one management activity that would follow from it.

A.2. Section 22.1.3 distinguishes data governance from data protection law. Explain why an organization that is fully compliant with the GDPR might still have poor data governance. What does data governance provide that legal compliance alone does not?

A.3. The DAMA-DMBOK framework identifies eleven knowledge areas (Section 22.2). List at least eight of them. For each, write one sentence describing what it encompasses.

A.4. Define the role of a "data steward" as described in Section 22.3. How does a data steward differ from a database administrator? Why does the chapter argue that data stewardship is a governance function, not a technical function?

A.5. List the six data quality dimensions introduced in Section 22.4 (accuracy, completeness, consistency, timeliness, validity, and uniqueness). For each, provide a one-sentence definition and an example of a data quality problem that dimension would detect.

A.6. Explain the concept of "data lineage" from Section 22.5. Why is tracking where data comes from, how it has been transformed, and where it goes important for governance?

A.7. Section 22.6 describes data governance maturity models. What are the typical maturity levels, and what characterizes an organization at the lowest versus the highest level?

Part B: Applied Analysis and Python Tasks ⭐⭐

Analyze scenarios and implement Python solutions using concepts from Chapter 22.

B.1. Ray Zhao arrives at NovaCorp and finds the "governance framework" described in the chapter opening. Design an improved data governance council structure for NovaCorp. Your design should specify: (a) the council's membership and reporting lines, (b) at least five governance policies the council should establish in its first year, (c) the metrics the council should track to measure governance effectiveness.

B.2. ⭐⭐ Python Task. The DataQualityAuditor class from Section 22.4 assesses data quality across six dimensions. Write a Python script that creates a sample dataset (a pandas DataFrame) of 500 customer records with intentionally introduced quality issues:

At least 20 missing email addresses (completeness)
At least 15 duplicate customer IDs (uniqueness)
At least 10 phone numbers in inconsistent formats (consistency)
At least 5 records with future dates of birth (validity)
At least 8 records where city and zip code do not match (accuracy)

Then instantiate the DataQualityAuditor and run a quality audit on your dataset. Print the results for each dimension.

B.3. ⭐⭐ Python Task. Extend the DataQualityAuditor class to include a timeliness_score() method that accepts a DataFrame column containing timestamps and a max_age_days parameter. The method should return the percentage of records whose timestamps are within the specified age threshold. Test your method on a dataset of transaction records where some records are more than 90 days old.

# Starter code
def timeliness_score(self, df: pd.DataFrame, column: str, max_age_days: int) -> float:
    """
    Calculate the percentage of records where the timestamp in `column`
    is within `max_age_days` of the current date.

    Returns: float between 0.0 and 1.0
    """
    # Your implementation here
    pass

B.4. Mira is tasked with designing a data governance framework for VitraMed as it prepares for its HIPAA audit and EU expansion. VitraMed has: 50 employees, patient health records, clinical trial data, employee HR data, and financial records. Design a governance framework for VitraMed that includes: (a) a governance council structure appropriate for a company of this size, (b) data classification categories with at least four sensitivity levels, (c) access control policies for each classification level, and (d) a data retention schedule for each data type.

B.5. ⭐⭐ Python Task. Write a Python function generate_quality_report() that accepts a DataFrame and produces a formatted quality report. The report should include:

Total records
Completeness score for each column (percentage of non-null values)
Uniqueness score for a specified key column
A summary rating: "Excellent" (all scores > 95%), "Acceptable" (all scores > 85%), "Needs Improvement" (any score < 85%), or "Critical" (any score < 70%)

def generate_quality_report(df: pd.DataFrame, key_column: str) -> str:
    """
    Generate a formatted data quality report.

    Args:
        df: The DataFrame to audit
        key_column: The column that should contain unique identifiers

    Returns: A formatted string report
    """
    # Your implementation here
    pass

B.6. Section 22.5 describes metadata management and data catalogs. Design a metadata schema for VitraMed's patient records dataset. Your schema should include: (a) technical metadata (column names, data types, constraints), (b) business metadata (descriptions, data owners, sensitivity classification), and (c) operational metadata (creation date, last updated, update frequency, data lineage). Present your schema as a table.

B.7. ⭐⭐ Python Task. Write a Python function consistency_check() that accepts two DataFrames (representing data from two different systems) and a list of common columns, and returns a report of inconsistencies between corresponding records. The function should match records by a shared key column and flag any differences in the common columns.

def consistency_check(
    df1: pd.DataFrame,
    df2: pd.DataFrame,
    key_column: str,
    compare_columns: list
) -> pd.DataFrame:
    """
    Compare two DataFrames on shared columns, returning inconsistencies.

    Returns: DataFrame of records with mismatched values
    """
    # Your implementation here
    pass

Part C: Programming Challenges ⭐⭐-⭐⭐⭐

More advanced Python exercises that build on the DataQualityAuditor.

C.1. ⭐⭐ DataQualityAuditor Enhancement: Validity Rules. Extend the DataQualityAuditor to support custom validity rules. Implement a method add_validity_rule(column, rule_name, rule_func) that allows users to register validation functions for specific columns. Then implement run_validity_checks() that applies all registered rules and returns a report.

Example usage:

auditor = DataQualityAuditor(df)
auditor.add_validity_rule("age", "positive_age", lambda x: x > 0)
auditor.add_validity_rule("age", "reasonable_age", lambda x: x < 150)
auditor.add_validity_rule("email", "has_at_sign", lambda x: "@" in str(x))
report = auditor.run_validity_checks()

C.2. ⭐⭐⭐ Data Quality Dashboard. Build a Python script that reads a CSV file, runs the DataQualityAuditor across all six quality dimensions, and outputs a formatted text-based dashboard showing:

A bar chart (using text characters) for each dimension's score
Records that fail multiple quality checks (the "worst offenders")
A prioritized list of recommended remediation actions

Test your dashboard on a sample dataset of at least 1,000 records with various quality issues.

C.3. ⭐⭐⭐ Data Lineage Tracker. Implement a DataLineageTracker class that records transformations applied to a DataFrame. Each time the DataFrame is modified (filtered, joined, aggregated, columns renamed, etc.), the tracker should log: (a) the operation performed, (b) the timestamp, (c) the number of rows before and after, and (d) the columns affected. Implement a get_lineage() method that returns the complete transformation history.

class DataLineageTracker:
    def __init__(self, df: pd.DataFrame, source_name: str):
        """Initialize tracker with source DataFrame and name."""
        pass

    def log_transform(self, operation: str, result_df: pd.DataFrame,
                      columns_affected: list = None) -> pd.DataFrame:
        """Log a transformation and return the result DataFrame."""
        pass

    def get_lineage(self) -> list:
        """Return the complete lineage as a list of transformation records."""
        pass

    def print_lineage(self) -> None:
        """Print a formatted lineage report."""
        pass

C.4. ⭐⭐⭐ Governance Maturity Assessor. Build a Python tool that implements a simplified data governance maturity assessment. The tool should:

Present a series of yes/no and scored (1-5) questions across the DAMA-DMBOK knowledge areas
Calculate a maturity score for each knowledge area
Generate an overall maturity level (Initial, Managed, Defined, Measured, Optimized)
Produce a prioritized recommendation list based on the lowest-scoring areas

Part D: Synthesis & Critical Thinking ⭐⭐⭐

These questions require integration of multiple concepts.

D.1. Ray Zhao's experience at NovaCorp reveals a common pattern: organizations that claim to value data as a "strategic asset" but have no governance structures to back up that claim. Why does this gap between rhetoric and practice exist? Identify at least three organizational, cultural, or economic factors that prevent organizations from implementing effective data governance, even when they recognize its importance.

D.2. Dr. Adeyemi argues that data governance is inherently political — that decisions about data quality standards, access controls, and metadata definitions reflect and reinforce power structures within organizations. Develop this argument using VitraMed as an example. Who gets to define what counts as "quality" patient data? Whose categories are embedded in the metadata schema? How might different stakeholders (clinicians, data scientists, patients, regulators) define data quality differently?

D.3. Section 22.6 presents data governance maturity models as a progression from "initial" (ad hoc, reactive) to "optimized" (continuous improvement, quantitatively managed). Critique this model. Is governance maturity always a linear progression? Can an organization be "optimized" in some knowledge areas and "initial" in others? What external factors (regulatory changes, mergers, data breaches) might cause a mature organization to regress?

D.4. Eli's work on Detroit's data governance ordinance raises the question of whether data governance frameworks designed for corporations (like DAMA-DMBOK) can be adapted for municipal government. Identify three specific ways in which a city government's data governance needs differ from a corporation's. For each difference, explain how the DAMA-DMBOK framework would need to be adapted.

D.5. The chapter argues that data governance and data protection law are distinct but complementary. Using VitraMed's preparation for EU expansion as a case study, explain how implementing a strong data governance framework would make GDPR compliance easier. Identify at least three specific GDPR requirements that are better served by systematic data governance than by ad hoc compliance efforts.

Part E: Research & Extension ⭐⭐⭐⭐

These are open-ended projects for students seeking deeper engagement.

E.1. DAMA-DMBOK Deep Dive. Research the DAMA-DMBOK 2nd edition in detail. Select one of the eleven knowledge areas not covered extensively in the chapter (e.g., Reference and Master Data Management, Data Warehousing and Business Intelligence, or Document and Content Management). Write a 1,000-word summary of that knowledge area, including its key activities, roles, deliverables, and governance implications.

E.2. Data Governance Tools Assessment. Research three commercial or open-source data governance tools (e.g., Collibra, Alation, Apache Atlas, Amundsen, DataHub). For each, describe: (a) its core features, (b) the DAMA-DMBOK knowledge areas it addresses, (c) its strengths and limitations, and (d) the type of organization it is best suited for. Write a comparative analysis (800-1,200 words) recommending one tool for VitraMed and justifying your choice.

E.3. Data Governance in Healthcare. Research data governance practices in the healthcare sector specifically. Examine how organizations like the NHS (UK), Kaiser Permanente (US), or a health system in another country implement data governance. Write a 1,000-word report covering: (a) the unique governance challenges of health data, (b) the governance structures these organizations use, (c) how they balance data quality, patient privacy, and research utility, and (d) what VitraMed could learn from their experience.

Solutions

Selected solutions are available in appendices/answers-to-selected.md.