Quiz: Data Governance Frameworks and Institutions

DataField.Dev

Quiz: Data Governance Frameworks and Institutions

Test your understanding before moving to the next chapter. Target: 70% or higher to proceed.

Section 1: Multiple Choice (1 point each)

1. Data governance is best defined as:

A) The technical administration of databases, including backup, recovery, and performance tuning.
B) The exercise of authority, control, and shared decision-making over the management of data assets, encompassing policies, standards, processes, roles, and metrics.
C) Compliance with data protection regulations such as the GDPR and HIPAA.
D) The process of cleaning and transforming data for use in analytics and reporting.

Answer

**B)** The exercise of authority, control, and shared decision-making over the management of data assets, encompassing policies, standards, processes, roles, and metrics. *Explanation:* Section 22.1.1 defines data governance as fundamentally about authority and accountability — who decides how data is managed, according to what rules, and how compliance with those rules is monitored. Option A describes database administration (a data management function). Option C describes a subset of governance (regulatory compliance). Option D describes data engineering/preparation. Governance sets the *rules*; management *implements* them.

2. The DAMA-DMBOK framework identifies how many knowledge areas?

A) 5
B) 8
C) 11
D) 15

Answer

**C)** 11 *Explanation:* Section 22.2 describes the DAMA-DMBOK as organizing data management into eleven knowledge areas: Data Governance, Data Architecture, Data Modeling and Design, Data Storage and Operations, Data Security, Data Integration and Interoperability, Document and Content Management, Reference and Master Data Management, Data Warehousing and Business Intelligence, Metadata Management, and Data Quality Management. Data Governance sits at the center of the "wheel" diagram, with the other ten areas surrounding it.

3. When Ray Zhao arrived at NovaCorp, the existing "data governance framework" consisted of:

A) A comprehensive set of policies, trained stewards, and an active governance council.
B) An outdated classification policy, a database administrator contact spreadsheet, and a two-page aspirational principles document.
C) Nothing — the company had no data governance documentation of any kind.
D) A fully implemented DAMA-DMBOK framework with automated quality monitoring.

Answer

**B)** An outdated classification policy, a database administrator contact spreadsheet, and a two-page aspirational principles document. *Explanation:* The chapter opening describes Ray's discovery of NovaCorp's governance gap: a four-year-old classification policy that was never updated, a contact spreadsheet, and a "principles" document that consisted entirely of aspirational statements without implementation details. Ray's assessment — "This is not data governance. This is a wish list." — illustrates the common gap between recognizing data's importance and implementing the governance structures necessary to manage it.

4. A data steward's primary responsibility is:

A) Writing SQL queries and optimizing database performance.
B) Serving as the accountable authority for data quality, compliance, and appropriate use within a defined domain.
C) Approving all data access requests from external parties.
D) Developing machine learning models using organizational data.

Answer

**B)** Serving as the accountable authority for data quality, compliance, and appropriate use within a defined domain. *Explanation:* Section 22.3 defines data stewards as the individuals accountable for data within their domain — ensuring data quality, enforcing policies, resolving data issues, and acting as the bridge between governance policies and operational practice. Stewardship is a governance function, not a technical one. A data steward for patient records, for example, may be a clinician or health information manager, not a database administrator.

5. Which of the following is NOT one of the six data quality dimensions described in this chapter?

A) Accuracy
B) Completeness
C) Profitability
D) Timeliness

Answer

**C)** Profitability *Explanation:* Section 22.4 defines six data quality dimensions: accuracy, completeness, consistency, timeliness, validity, and uniqueness. Profitability is a business metric, not a data quality dimension. While high-quality data may contribute to profitability, the quality dimensions themselves measure the fitness of data for its intended purpose along specific technical and semantic axes.

6. The "completeness" dimension of data quality measures:

A) Whether data values conform to business rules and constraints.
B) The proportion of required data fields that are populated (not null, not blank) in a dataset.
C) Whether data values are identical across different systems.
D) Whether data reflects the true state of the real-world entity it represents.

Answer

**B)** The proportion of required data fields that are populated (not null, not blank) in a dataset. *Explanation:* Section 22.4 defines completeness as the degree to which all required data is present. A patient record missing a date of birth, an address field left blank, or an email column with null values are all completeness issues. Option A describes validity, option C describes consistency, and option D describes accuracy.

7. The DataQualityAuditor class introduced in this chapter is designed to:

A) Automatically fix all data quality issues in a dataset.
B) Programmatically assess data quality across multiple dimensions, producing scores that quantify the state of a dataset.
C) Replace human data stewards with automated governance.
D) Enforce GDPR compliance through code.

Answer

**B)** Programmatically assess data quality across multiple dimensions, producing scores that quantify the state of a dataset. *Explanation:* Section 22.4 introduces the `DataQualityAuditor` as a tool for measuring data quality — not for fixing it. The class accepts a DataFrame and calculates scores for each quality dimension (completeness, uniqueness, validity, etc.). It makes abstract quality concepts concrete and measurable. Automated assessment is a governance input — it informs decisions about remediation — but it does not replace human judgment about what quality standards are appropriate or how issues should be resolved.

8. Metadata management, as described in this chapter, involves:

A) Deleting all metadata from datasets to protect privacy.
B) Creating and maintaining a comprehensive inventory of information about data assets — including technical metadata, business metadata, and operational metadata.
C) Converting metadata into structured data for analysis.
D) Encrypting metadata to prevent unauthorized access.

Answer

**B)** Creating and maintaining a comprehensive inventory of information about data assets — including technical metadata, business metadata, and operational metadata. *Explanation:* Section 22.5 describes metadata management as one of the DAMA-DMBOK's eleven knowledge areas. Effective metadata management includes technical metadata (data types, schemas, constraints), business metadata (definitions, owners, sensitivity classifications), and operational metadata (lineage, update schedules, quality scores). A data catalog is the primary tool for metadata management, providing a searchable inventory that helps users find, understand, and trust data assets.

9. Data lineage refers to:

A) The age of a dataset, measured in years since creation.
B) The complete record of where data came from, how it was transformed, and where it has been shared or copied.
C) The genetic relationship between different versions of a database schema.
D) The chain of custody for physical storage media.

Answer

**B)** The complete record of where data came from, how it was transformed, and where it has been shared or copied. *Explanation:* Section 22.5 defines data lineage as the traceable history of a data asset through its lifecycle. Lineage answers questions like: Where did this data originate? What transformations were applied? Which systems has it passed through? Who has accessed or modified it? Lineage is essential for compliance (demonstrating that data has been handled lawfully), quality (tracing errors to their source), and accountability (knowing who is responsible at each stage).

10. A data governance maturity model serves to:

A) Rank companies against their competitors on data governance performance.
B) Assess an organization's current governance capabilities, identify gaps, and provide a roadmap for improvement across defined maturity levels.
C) Certify that an organization's data governance meets GDPR requirements.
D) Determine the market value of an organization's data assets.

Answer

**B)** Assess an organization's current governance capabilities, identify gaps, and provide a roadmap for improvement across defined maturity levels. *Explanation:* Section 22.6 describes maturity models as diagnostic and aspirational tools. They define levels — typically from "Initial" (ad hoc, reactive) through "Managed," "Defined," and "Measured" to "Optimized" (continuous improvement, quantitatively managed) — and allow organizations to assess their current state, identify priority improvement areas, and set realistic targets. They are internal assessment tools, not competitive rankings or regulatory certifications.

Section 2: True/False with Justification (1 point each)

11. "An organization that is fully compliant with the GDPR has, by definition, implemented effective data governance."

Answer

**False.** *Explanation:* Section 22.1.3 explicitly addresses this misconception. Data protection law compliance and data governance are related but distinct. An organization can achieve GDPR compliance through ad hoc, reactive measures — appointing a DPO, writing a privacy policy, responding to DSARs — without implementing the systematic governance structures (governance council, data stewards, quality metrics, metadata management, maturity assessment) that ensure data is managed consistently and sustainably. Compliance is the floor; governance is the framework that sustains compliance and extends beyond it to data quality, usability, and strategic value.

12. "The DAMA-DMBOK framework is a legal requirement that organizations must implement to comply with data protection regulations."

Answer

**False.** *Explanation:* Section 22.2 describes DAMA-DMBOK as an industry standard and body of knowledge — not a legal requirement. No data protection regulation mandates the adoption of DAMA-DMBOK specifically. However, the framework provides a structured approach to implementing the data management practices that regulations require in principle (data quality, security, access control, retention management). Organizations may choose DAMA-DMBOK, COBIT, ISO 8000, or other frameworks — or develop custom approaches. The value is in the structured thinking, not the specific framework.

13. "Data quality can be fully automated — once the right tools are in place, human judgment is no longer needed."

Answer

**False.** *Explanation:* Section 22.4 makes clear that while tools like the `DataQualityAuditor` can automate the *measurement* of data quality, defining what quality *means* — which dimensions matter most, what thresholds are acceptable, how to resolve conflicts between quality dimensions — requires human judgment. A completeness score of 95% might be excellent for one dataset and unacceptable for another, depending on the business context, regulatory requirements, and the consequences of missing data. Tools measure; humans decide.

14. "In a well-governed organization, the data governance council should include only IT and data professionals."

Answer

**False.** *Explanation:* Section 22.3 emphasizes that data governance is a cross-functional responsibility. A governance council composed only of IT and data professionals would lack the business context, legal expertise, and domain knowledge necessary to make effective governance decisions. Effective councils typically include representatives from business units (who understand data use and needs), legal/compliance (who understand regulatory requirements), senior management (who provide executive sponsorship and authority), and IT/data management (who understand technical capabilities and constraints). At VitraMed, for example, clinical staff, regulatory compliance officers, and patient advocates would all have valuable perspectives.

15. "The 'uniqueness' dimension of data quality refers to the absence of duplicate records in a dataset."

Answer

**True.** *Explanation:* Section 22.4 defines uniqueness as the degree to which each entity in a dataset is represented once and only once. Duplicate records — the same patient appearing twice with different IDs, the same transaction recorded in two systems with slight variations — violate uniqueness. Duplicates can cause incorrect aggregations, inaccurate reporting, and compliance problems (e.g., a patient receiving duplicate communications or treatments). The `DataQualityAuditor` measures uniqueness by checking for duplicate values in key columns.

Section 3: Short Answer (2 points each)

16. Explain why the chapter describes data governance as "not a project but a permanent function." What goes wrong when organizations treat governance as a one-time project?

Sample Answer

Data governance is described as a permanent function because data, and the systems that manage it, are continuously changing. New data sources are added, regulations evolve, organizational structures shift, and data quality degrades over time without ongoing maintenance. Organizations that treat governance as a project — implementing policies, setting up a council, and then moving on — typically experience a predictable pattern: initial compliance followed by gradual erosion as policies become outdated, stewardship roles go unfilled, quality monitoring lapses, and institutional memory fades. NovaCorp's four-year-old classification policy that was "never updated" illustrates this perfectly. Governance requires ongoing authority, ongoing accountability, ongoing measurement, and ongoing adaptation — which is why it must be a permanent organizational function, not a one-time initiative. *Key points for full credit:* - Explains that data and systems change continuously - Identifies the pattern of post-project erosion - Connects to the NovaCorp example or a similar illustration

17. Describe how the DataQualityAuditor class translates abstract data quality dimensions into measurable scores. Use the "completeness" dimension as a specific example.

Sample Answer

The `DataQualityAuditor` takes each abstract quality dimension and operationalizes it as a calculable metric applied to a pandas DataFrame. For completeness, the concept "all required data fields should be populated" is translated into a function that counts non-null values in each column and divides by the total number of records. If a "date_of_birth" column has 480 non-null values out of 500 records, the completeness score for that column is 96%. The auditor can calculate completeness per column, producing a granular view that reveals which fields are most problematic, and an overall completeness score across the entire dataset. This translation from abstract principle to concrete measurement is the core value of the class: it makes quality visible, comparable across time periods, and actionable for remediation prioritization. *Key points for full credit:* - Explains the translation from abstract concept to calculable metric - Uses completeness as a specific, worked example - Notes the practical value (visibility, comparability, actionability)

18. Section 22.5 describes three types of metadata: technical, business, and operational. For a patient health records dataset at VitraMed, provide one example of each type and explain why it matters for governance.

Sample Answer

**Technical metadata:** The column "blood_pressure_systolic" has data type INTEGER, allowable range 50-300, and is a required field (NOT NULL constraint). This matters for governance because it defines what constitutes a valid entry, enabling automated validation and preventing data entry errors that could affect clinical decisions. **Business metadata:** The definition of "active patient" is "any individual who has had a clinical encounter within the past 24 months," and the data owner is the Chief Medical Officer. This matters for governance because without a shared definition, different departments might use different criteria, producing inconsistent reports and potentially affecting compliance with regulatory reporting requirements. **Operational metadata:** The patient records table was last refreshed on 2025-12-15 from the EHR system via a nightly ETL process, with an average latency of 6 hours. This matters for governance because it tells users how current the data is, enabling them to assess whether it is timely enough for their intended use — a clinician making a treatment decision needs more current data than a researcher analyzing annual trends. *Key points for full credit:* - Provides a clear example for each of the three metadata types - Explains the governance significance of each - Uses the VitraMed health records context

19. Compare the governance challenges facing NovaCorp (a mid-size technology company) and the City of Detroit (a municipal government, as represented by Eli's work). Identify two challenges they share and two challenges unique to each.

Sample Answer

**Shared challenges:** (1) Both lack a starting foundation — NovaCorp's governance is a "wish list" and Detroit's data governance for municipal systems is nascent. Both must build governance structures from scratch, including establishing authority, defining roles, and creating policies. (2) Both face the challenge of governing data across multiple systems and departments, requiring cross-functional coordination and shared standards. **Unique to NovaCorp:** (1) NovaCorp faces competitive pressure to monetize data quickly, creating tension between governance (which takes time and limits flexibility) and business growth. (2) NovaCorp must satisfy sector-specific requirements for its commercial products, potentially different for each customer's industry. **Unique to Detroit:** (1) Detroit must govern data collected from citizens who have no contractual relationship with the city and often no knowledge that data is being collected (e.g., Smart City sensors). This creates a democratic accountability requirement that does not apply to NovaCorp's voluntary commercial relationships. (2) Detroit operates under public records laws and transparency requirements that may conflict with data minimization principles — the city may be legally required to retain data that a private company could delete. *Key points for full credit:* - Identifies at least two shared challenges with explanation - Identifies at least two challenges unique to each entity - Demonstrates understanding of the public/private governance distinction

Section 4: Applied Scenario (5 points)

20. Read the following scenario and answer all parts.

Scenario: HealthFirst Analytics

HealthFirst Analytics is a health-tech startup (35 employees) that has developed a patient readmission prediction model. The company ingests data from six hospital partners, each using a different EHR system. Data arrives in different formats, with different field names for the same concepts (e.g., "DOB," "date_of_birth," "birth_date"), different coding standards (ICD-10 vs. ICD-9 in legacy records), and different levels of completeness.

A recent audit revealed: 12% of patient records are missing diagnosis codes, 8% have duplicate patient IDs (the same patient appearing with different IDs across hospitals), and 15% of records have timestamps more than 48 hours old at the time of model input. The company's CEO says: "We'll clean the data when we have time. Right now, the model needs to ship."

(a) Using the six data quality dimensions from Section 22.4, identify and classify each quality issue described in the scenario. (1 point)

(b) Explain why the CEO's approach — "clean the data when we have time" — is especially risky in a healthcare context. What specific harms could result from deploying a prediction model trained on data with these quality issues? (1 point)

(c) Design a data governance structure appropriate for a 35-person health-tech company. Specify roles, responsibilities, and reporting lines. (1 point)

(d) Write pseudocode (or Python code) for a DataQualityAuditor method that would detect the duplicate patient ID problem described in the scenario — patients appearing with different IDs across hospitals. Explain why this problem is harder to detect than simple duplicate checking. (1 point)

(e) Propose a minimum viable data quality improvement plan that HealthFirst could implement within 30 days, prioritizing the issues most likely to cause patient harm. Justify your prioritization. (1 point)

Sample Answer

**(a)** Quality issues classified by dimension: - **Completeness:** 12% of records missing diagnosis codes — required data is absent. - **Uniqueness:** 8% duplicate patient IDs (same patient, different IDs across hospitals) — entities are represented multiple times, violating the one-entity-one-record principle. - **Timeliness:** 15% of records have timestamps more than 48 hours old at model input — data may not reflect current patient status. - **Consistency:** Different field names for the same concept across hospitals (DOB vs. date_of_birth vs. birth_date) — data is represented differently across systems. - **Validity:** Mixed coding standards (ICD-10 vs. ICD-9) — records may not conform to the expected standard. **(b)** In healthcare, data quality directly affects patient outcomes. A readmission prediction model trained on incomplete data (12% missing diagnoses) may systematically underpredict risk for patients whose diagnoses were not recorded — potentially leading to premature discharge and preventable readmissions. Duplicate patient records (8%) mean the model may treat one patient's history as two separate patients, losing critical context (prior conditions, medications, allergies). Stale data (15% > 48 hours old) means the model may not reflect recent deterioration or improvement. In a non-healthcare context, these issues cause business inefficiency; in healthcare, they can cause patient harm or death. **(c)** Governance structure for a 35-person company: - **Executive sponsor:** CEO or CTO, providing authority and budget for governance activities. - **Data Governance Lead (part-time):** A senior technical employee responsible for policy development, quality monitoring, and coordination. Reports to executive sponsor. - **Domain stewards (2-3):** Clinical data steward (responsible for patient record quality and clinical coding standards), Engineering data steward (responsible for data pipeline integrity and technical metadata), and Compliance steward (responsible for HIPAA compliance and hospital partner agreements). - **Data Governance Working Group:** Meets biweekly, includes the governance lead, domain stewards, and one representative from each hospital partnership. Reviews quality metrics, resolves data issues, and updates policies. - **Reporting:** Monthly quality dashboard to executive team; quarterly governance review. **(d)** Detecting cross-hospital duplicates is harder than simple duplicate checking because the same patient may have different IDs, different name spellings (e.g., "Robert Smith" vs. "Bob Smith"), and different demographic details across hospitals. Simple `df.duplicated(subset=['patient_id'])` will not catch these cases.

def detect_cross_hospital_duplicates(self, df, name_col, dob_col, hospital_col):
    """
    Detect probable duplicate patients across hospitals using
    fuzzy matching on name + exact match on date of birth.
    """
    from itertools import combinations

    potential_dupes = []
    # Group by DOB as an anchor — same birthday is a strong signal
    dob_groups = df.groupby(dob_col)

    for dob, group in dob_groups:
        if len(group) < 2:
            continue
        # Check pairs within this DOB group from different hospitals
        for i, j in combinations(group.index, 2):
            if group.loc[i, hospital_col] != group.loc[j, hospital_col]:
                # Fuzzy name comparison
                name_similarity = fuzzy_ratio(
                    str(group.loc[i, name_col]).lower(),
                    str(group.loc[j, name_col]).lower()
                )
                if name_similarity > 85:  # threshold
                    potential_dupes.append({
                        'record_1': i, 'record_2': j,
                        'name_similarity': name_similarity,
                        'dob': dob
                    })

    return pd.DataFrame(potential_dupes)

**(e)** 30-day priority plan: 1. **Week 1 (Highest priority — patient safety):** Address the duplicate patient ID problem. Implement a master patient index using DOB + name fuzzy matching to link records across hospitals. Duplicates affect model accuracy most dangerously because they fragment patient histories. Justification: A fragmented patient history means the model lacks critical context (allergies, prior conditions, medication interactions). 2. **Weeks 1-2 (High priority):** Implement diagnosis code validation. Flag and route records with missing diagnosis codes for manual review. Establish a process with hospital partners to request missing codes. Justification: Missing diagnoses directly affect the model's ability to predict readmission risk. 3. **Weeks 2-3:** Implement data pipeline timeliness monitoring. Set a maximum acceptable latency threshold (e.g., 24 hours) and alert when data falls below it. Work with hospital partners to reduce data transfer delays. Justification: Stale data means the model may not reflect recent clinical changes. 4. **Week 3-4:** Standardize field names and coding standards. Create a mapping layer that normalizes all hospital data to a common schema (ICD-10, standardized field names). Justification: Consistency issues create engineering overhead and increase error risk across all quality dimensions.

Scoring & Review Recommendations

Score Range	Assessment	Next Steps
Below 50% (< 15 pts)	Needs review	Re-read Sections 22.1-22.4, redo Part A exercises and Python tasks
50-69% (15-20 pts)	Partial understanding	Review specific weak areas, focus on Python exercises in Part B-C
70-85% (21-25 pts)	Solid understanding	Ready to proceed to Chapter 23
Above 85% (> 25 pts)	Strong mastery	Proceed to Chapter 23: Cross-Border Data Flows and Digital Sovereignty

Section	Points Available
Section 1: Multiple Choice	10 points (10 questions x 1 pt)
Section 2: True/False with Justification	5 points (5 questions x 1 pt)
Section 3: Short Answer	8 points (4 questions x 2 pts)
Section 4: Applied Scenario	5 points (5 parts x 1 pt)
Total	28 points