Quiz: Data Stewardship and the Chief Data Officer

Test your understanding before moving to the next chapter. Target: 70% or higher to proceed.


Section 1: Multiple Choice (1 point each)

1. According to the chapter, which three developments converged to make the CDO role necessary?

  • A) Cloud computing, mobile devices, and social media
  • B) The data explosion, the regulatory surge, and the analytics ambition
  • C) GDPR, CCPA, and HIPAA
  • D) AI deployment, cybersecurity threats, and remote work
Answer **B)** The data explosion, the regulatory surge, and the analytics ambition. *Explanation:* Section 27.1.1 identifies these three forces. The data explosion meant organizations had more data than anyone could track. The regulatory surge (GDPR, CCPA, HIPAA enforcement) created legal obligations requiring organization-wide visibility. The analytics ambition meant data needed to be governed and integrated for reliable analysis. Together, these made fragmented data management untenable.

2. The average CDO tenure, according to the NewVantage Partners survey cited in the chapter, is approximately:

  • A) 6 months
  • B) 2.5 years
  • C) 5 years
  • D) 8 years
Answer **B)** 2.5 years. *Explanation:* Section 27.1.3 cites the NewVantage Partners survey: the average CDO tenure is 2.5 years, shorter than any other C-suite role. This short tenure reflects the structural difficulty of the position -- responsible for everything data-related but with direct authority over almost nothing.

3. Which of the following best describes the difference between a data owner, a data steward, and a data custodian?

  • A) They are three names for the same role at different organizational levels
  • B) The owner determines purpose and use; the steward ensures compliance with policies and standards; the custodian manages technical storage and infrastructure
  • C) The owner stores the data; the steward uses the data; the custodian deletes the data
  • D) The owner is the CDO; the steward is the department head; the custodian is the IT administrator
Answer **B)** The owner determines purpose and use; the steward ensures compliance with policies and standards; the custodian manages technical storage and infrastructure. *Explanation:* Section 27.2.1 defines these three distinct roles. The data owner is the business function that determines the purpose and conditions of data use. The data steward is responsible for ensuring that data is managed in accordance with organizational policies and ethical standards. The data custodian is responsible for the physical and technical aspects of data storage and security. In practice these roles may overlap, but the conceptual distinction is important for governance clarity.

4. In a federated stewardship model, who is primarily responsible for data governance?

  • A) A centralized data governance team reporting to the CDO
  • B) Individual business units, each with their own data stewards
  • C) An external auditing firm
  • D) The organization's legal department
Answer **B)** Individual business units, each with their own data stewards. *Explanation:* Section 27.2.3 describes the federated model: data governance responsibility is distributed to individual business units, each with its own data stewards who manage data within their domain. The CDO provides guidelines and coordination but does not directly control departmental practices. This model leverages domain expertise and local buy-in but risks inconsistency and gaps in cross-departmental oversight.

5. The chapter argues that a data catalog is "ethical infrastructure" because:

  • A) It contains the organization's data ethics policy
  • B) Without it, organizations cannot fulfill fundamental data governance obligations like access requests, minimization, and impact assessments
  • C) It is required by GDPR Article 30
  • D) It replaces the need for an ethics committee
Answer **B)** Without it, organizations cannot fulfill fundamental data governance obligations like access requests, minimization, and impact assessments. *Explanation:* Section 27.3.3 argues that a data catalog is ethical infrastructure because it provides the organizational visibility necessary for ethical practice. Without knowing what data exists, where it lives, and how it's used, an organization cannot: fulfill data subject access requests (GDPR Article 15), implement data minimization, conduct meaningful privacy impact assessments, detect unauthorized access, or honor retention commitments. The catalog is not the ethics policy itself but the foundation without which ethical policy is unenforceable.

6. Ray Zhao says: "A data catalog is the conscience of the organization." He means:

  • A) The catalog makes moral judgments about data practices
  • B) The catalog makes ethical ignorance impossible by documenting every asset, eliminating the defense that "we didn't know we had that data"
  • C) The catalog replaces the need for ethical review
  • D) The catalog is only useful for regulatory compliance
Answer **B)** The catalog makes ethical ignorance impossible by documenting every asset, eliminating the defense that "we didn't know we had that data." *Explanation:* Ray's statement (Section 27.3.3) emphasizes that the catalog's value is in making data practices visible. When every asset is documented with its classification, retention policy, access patterns, and lineage, the organization can no longer claim ignorance about what data it holds or how it's used. This visibility is a precondition for accountability -- though, as the chapter notes, visibility alone is insufficient without governance mechanisms that act on what the catalog reveals.

7. The DataLineageTracker's check_retention() method returns three possible statuses. Which of the following is NOT one of them?

  • A) ACTIVE
  • B) EXPIRED
  • C) ARCHIVED
  • D) NO_EXPIRY_SET
Answer **C)** ARCHIVED. *Explanation:* Section 27.4.2 shows the `check_retention()` method returning three statuses: ACTIVE (data is within its retention period), EXPIRED (data has exceeded its retention period), and NO_EXPIRY_SET (no retention expiry date has been defined). "ARCHIVED" is not a status returned by this method -- though in a production system, it might be a useful addition.

8. The DataLineageTracker flags ethical concerns in its report when:

  • A) Any data is classified as "public"
  • B) The data asset has been accessed more than 100 times
  • C) Data classified as confidential or restricted has been exported without documented approval
  • D) The data asset is more than one year old
Answer **C)** Data classified as confidential or restricted has been exported without documented approval. *Explanation:* The `generate_lineage_report()` method (Section 27.4.2) includes three ethical review checks: (1) whether the asset contains sensitive data (confidential or restricted classification), (2) whether data has been exported (and how many times), and (3) whether any write or export access occurred without documented approval. Unapproved exports of sensitive data are flagged as a warning requiring review.

9. The hybrid stewardship model combines:

  • A) Internal and external governance
  • B) Centralized standards with federated execution
  • C) Data governance with data analytics
  • D) Manual and automated data management
Answer **B)** Centralized standards with federated execution. *Explanation:* Section 27.2.4 describes the hybrid model as combining the consistency and visibility of centralization with the context and buy-in of federation. The central governance team defines organization-wide policies, while departmental stewards implement these standards within their domains. A data governance council coordinates across boundaries, and the CDO retains authority on cross-departmental and regulatory matters.

10. The CDO role has evolved through four phases. In which phase does the CDO become focused on ethical governance and AI oversight?

  • A) Phase 1: Technical (2005-2012)
  • B) Phase 2: Analytical (2012-2017)
  • C) Phase 3: Regulatory (2017-2021)
  • D) Phase 4: Strategic (2021-present)
Answer **D)** Phase 4: Strategic (2021-present). *Explanation:* Section 27.1.2 traces the CDO's evolution through four phases. The Strategic phase (2021-present) marks the CDO's elevation to data strategy, ethical governance, AI oversight, and organizational transformation, with the CDO reporting to the CEO or board. In this phase, data is seen as a "strategic responsibility" rather than merely IT infrastructure, a business asset, or a legal liability.

Section 2: True/False with Justification (1 point each)

11. "A centralized data stewardship model is always superior to a federated model because it ensures consistency."

Answer **False.** *Explanation:* Section 27.2.2 identifies both strengths and limitations of centralized stewardship. While centralization ensures consistency, it creates bottlenecks, generates resentment from business units, can lose domain-specific context, and scales poorly in large organizations. Federated and hybrid models address these limitations. The chapter presents no model as universally superior -- the best choice depends on organizational size, culture, regulatory environment, and complexity.

12. "Data quality is a technical concern, not an ethical one."

Answer **False.** *Explanation:* Section 27.5 argues explicitly that data quality is an ethical issue. Poor data quality can produce biased algorithmic decisions (training on inaccurate data), misinform clinical decisions (corrupted health records), violate data subject rights (incorrect personal information that the subject cannot correct), and reproduce structural inequalities (systematic undercounting of marginalized populations). The chapter connects data quality directly to ethical outcomes.

13. "The DataLineageTracker's access log requires every access event to have documented approval."

Answer **False.** *Explanation:* The `log_access()` method (Section 27.4.2) includes an optional `approved_by` parameter -- approval is recorded when provided but is not required for all access types. The tracker does, however, flag unapproved write and export access as a concern in the ethical review section of its report. Read access, for instance, may not always require explicit approval.

14. "According to the chapter, where the CDO sits in the organizational structure significantly shapes what they can accomplish."

Answer **True.** *Explanation:* Section 27.6 addresses organizational design explicitly, and this principle is stated in the chapter's learning objectives. A CDO reporting to the CIO is constrained to technical data management. A CDO reporting to the General Counsel is focused on compliance. A CDO reporting to the CEO or board has strategic influence over the full range of data practices. The reporting line determines the CDO's scope, authority, and ability to influence organizational culture.

15. "Data lineage tracking is only necessary for organizations subject to GDPR."

Answer **False.** *Explanation:* While GDPR creates specific legal requirements for data documentation, Section 27.4.1 argues that lineage tracking is an ethical obligation for any organization handling personal data. Lineage enables accountability (knowing who accessed what and why), supports impact assessments, facilitates incident response (knowing what data was affected in a breach), and ensures that transformations that introduce bias or distortion can be identified and corrected. These benefits apply regardless of regulatory jurisdiction.

Section 3: Short Answer (2 points each)

16. Explain the concept of "shadow IT" and why it is a significant governance challenge for CDOs. How does a data catalog help address it?

Sample Answer Shadow IT refers to databases, spreadsheets, applications, and data repositories maintained by individuals or teams outside official IT channels -- without the knowledge or oversight of the central data governance function. Shadow IT is a governance challenge because it creates data assets that are invisible to the CDO: they are not documented in any inventory, not subject to organizational security policies, not included in retention schedules, and not discoverable during data subject access requests or impact assessments. A data catalog helps address shadow IT by establishing a comprehensive inventory that makes the gap between documented and undocumented data visible. When the catalog is the authoritative source of what data the organization holds, shadow IT systems become conspicuous by their absence -- and governance processes can target their integration or elimination. *Key points for full credit:* - Defines shadow IT accurately - Explains the governance risks (invisibility, non-compliance) - Describes how a data catalog addresses the problem

17. The chapter argues that data lineage matters for ethics because "transformations can introduce bias, distort meaning, and obscure provenance." Provide one example of each mechanism using a healthcare data scenario.

Sample Answer **Transformation introducing bias:** A patient dataset is cleaned by removing records with missing values. If patients from rural clinics are more likely to have incomplete records (due to older data entry systems), the cleaned dataset underrepresents rural patients -- introducing demographic bias into any model trained on it. **Transformation distorting meaning:** Patient lab results are normalized to a 0-1 scale for model input. The normalization uses reference ranges derived from a predominantly male patient population. For female patients, whose normal reference ranges differ for some tests, the normalization distorts the clinical meaning of the values -- a result that is normal for a woman may appear abnormal after normalization. **Transformation obscuring provenance:** Patient data from three clinics is merged into a single training dataset. During the merge, the clinic-of-origin field is dropped to simplify the schema. When the resulting model performs poorly for patients from one clinic (because that clinic's data had different coding conventions), the source of the problem is invisible because the lineage was not preserved. *Key points for full credit:* - One example per mechanism (bias, distortion, obscured provenance) - Healthcare context for each - Clear explanation of the mechanism of harm

18. What is the "visibility problem" (Section 27.3.1), and why does the chapter describe it as the "single most common data governance failure"?

Sample Answer The visibility problem is the inability of organizations to know what data they possess: where it is stored, how it is structured, who has access, how long it has been retained, and how it is being used. The chapter describes it as the most common governance failure because it is both fundamental and widespread. Organizations accumulate data over years across dozens of systems, often without central documentation. Legacy databases persist after the teams that created them move on. Acquisitions bring entire data ecosystems that are never cataloged. Shadow IT creates invisible repositories. Without visibility, every other governance practice fails: you cannot minimize data you do not know you have, protect data you cannot locate, or delete data you cannot find. The GDPR implementation rush of 2018 exposed this problem when many organizations discovered they could not answer basic questions about their own data holdings. *Key points for full credit:* - Defines the visibility problem accurately - Explains why it is fundamental (precondition for all other governance) - Provides at least one concrete consequence

19. Ray Zhao says that when he asked NovaCorp "How many customer records do we have?" he received four different answers from four different departments. Explain why this is not merely an inconvenience but a governance and ethical concern.

Sample Answer Four different answers to a basic question about customer data reveals that NovaCorp has no authoritative, shared understanding of its own data. This is a governance concern because it means the organization cannot: accurately scope impact assessments (how many people are affected?), respond completely to regulatory inquiries (what data do you hold?), or consistently apply retention and access policies across departments. It is an ethical concern because it means NovaCorp cannot know whether it is honoring its obligations to data subjects. If a customer requests deletion of their data under GDPR Article 17, can NovaCorp be confident that all instances have been deleted when it doesn't even know how many instances exist? If a breach occurs, can NovaCorp accurately notify all affected individuals when it doesn't know which records overlap across departments? The inconsistency signals a deeper failure: the absence of organizational accountability for data stewardship. *Key points for full credit:* - Identifies governance implications (compliance, impact assessment) - Identifies ethical implications (data subject rights, breach response) - Connects to the visibility problem

Section 4: Applied Scenario (5 points)

20. Read the following scenario and answer all parts.

Scenario: HealthSync Analytics

HealthSync Analytics is a health-tech startup that aggregates patient data from 200 clinics to build predictive models for chronic disease management. The company has grown rapidly from 30 to 250 employees in two years. It has no CDO and no data catalog. Data is managed by individual teams:

  • The Data Engineering team manages the data warehouse and ETL pipelines.
  • The Data Science team builds predictive models using data they copy from the warehouse into their own analytics environment.
  • The Product team maintains a customer-facing dashboard that pulls from both the warehouse and the analytics environment.
  • The Compliance team maintains a spreadsheet tracking which clinics have signed data processing agreements.

Last month, a clinic requested deletion of all its patient data under a state privacy law. The Compliance team sent the request to Data Engineering, who deleted the records from the warehouse. Two weeks later, the Data Science team ran an analysis using a copy of the data that still contained the deleted clinic's records. The analysis was shared with a pharmaceutical partner as part of a research collaboration.

(a) Using the data lifecycle framework and the visibility problem concept from Section 27.3, explain what went wrong. Identify at least three governance failures. (1 point)

(b) What stewardship model (centralized, federated, or hybrid) would you recommend for HealthSync, and why? Consider the company's size, growth rate, and regulatory obligations. (1 point)

(c) Design a minimal data catalog entry for "Patient Demographics" at HealthSync. Include: name, source, classification, location(s), retention policy, data owner, data steward, and applicable regulations. (1 point)

(d) Write Python pseudocode (or actual code using the DataLineageTracker) demonstrating how lineage tracking could have prevented the deletion failure. Show the key operations that would have flagged the problem. (1 point)

(e) The pharmaceutical partner now has data from a clinic whose patients never consented to pharmaceutical research, from a clinic that has exercised its deletion rights. What are HealthSync's ethical obligations at this point? Consider notification, remediation, and systemic prevention. (1 point)

Sample Answer **(a)** Three governance failures: 1. **No data catalog.** HealthSync had no inventory of where patient data existed across the organization. When the deletion request came, no one knew that copies of the data lived in the Data Science team's analytics environment. 2. **No lineage tracking.** The data's journey from the warehouse to the analytics environment was not documented. There was no record that the Data Science team had copied the data, so the deletion process could not identify all instances. 3. **No cross-team deletion protocol.** The deletion request was handled by a single team (Data Engineering) without a process for verifying that all copies across the organization had been located and deleted. The Compliance team's spreadsheet tracked clinic agreements but not data locations. **(b)** A **hybrid model** is recommended. HealthSync is growing rapidly and already has specialized teams with domain expertise -- which favors some federation. But the deletion failure demonstrates the need for centralized standards, a unified data catalog, and cross-team coordination -- which requires centralization. A hybrid model would: (1) establish a central data governance function (led by a CDO hire) that defines standards, maintains the catalog, and coordinates cross-team processes, (2) embed data stewards within each team to implement standards locally, and (3) create a data governance council with representatives from all teams. **(c)** Data catalog entry: | Field | Value | |-------|-------| | **Name** | Patient Demographics | | **Source** | Clinic intake forms via HealthSync API integration | | **Classification** | Restricted (contains PHI) | | **Locations** | vitramed-prod-warehouse (primary); ds-analytics-env (copy); product-dashboard-cache (read replica) | | **Retention Policy** | 7 years from last clinical visit per HIPAA; clinic-level deletion upon clinic termination | | **Data Owner** | VP of Clinical Partnerships | | **Data Steward** | Data Governance Lead (to be hired) | | **Applicable Regulations** | HIPAA, state privacy laws (varies by clinic location), data processing agreements with each clinic | **(d)** Python pseudocode:
# When Data Science copies data, lineage is tracked:
patient_demographics = DataLineageTracker(
    name="Patient Demographics - DS Copy",
    source="Copied from vitramed-prod-warehouse",
    description="Copy for chronic disease modeling",
    data_classification="restricted",
    current_location="ds-analytics-env",
    retention_policy="Same as source: 7 years from last visit",
)
patient_demographics.log_access(
    accessed_by="data_science_team",
    purpose="Copy for model training",
    access_type="export",
    approved_by="data_engineering_lead"
)

# When deletion request arrives, catalog search finds ALL copies:
# catalog.search("Patient Demographics")
# Returns: [warehouse instance, DS copy, dashboard cache]
# Deletion process iterates all instances, not just warehouse.

# After deletion, retention check confirms:
# for tracker in all_instances:
#     assert tracker.check_retention()["status"] == "DELETED"
**(e)** HealthSync's ethical obligations: **Notification:** HealthSync must notify the pharmaceutical partner that the shared dataset contains data from a clinic that has exercised deletion rights, and that the data must be deleted from the partner's systems. HealthSync should also notify the clinic that its data was inadvertently shared after the deletion request. **Remediation:** The pharmaceutical partner must delete the affected records. Any analysis already conducted using those records should be reviewed and, if published or acted upon, corrected. HealthSync should offer to re-run the analysis with the correct dataset. **Systemic prevention:** HealthSync must implement a data catalog and lineage tracking system, establish a cross-team deletion protocol that identifies all copies before confirming deletion, and require data processing agreements with research partners that include deletion obligations for recalled data.

Scoring & Review Recommendations

Score Range Assessment Next Steps
Below 50% (< 15 pts) Needs review Re-read Sections 27.1-27.3, redo Part A exercises
50-69% (15-20 pts) Partial understanding Review specific weak areas, attempt Python exercises
70-85% (21-25 pts) Solid understanding Ready to proceed to Chapter 28
Above 85% (> 25 pts) Strong mastery Proceed to Chapter 28: Privacy Impact Assessments and Ethical Reviews
Section Points Available
Section 1: Multiple Choice 10 points (10 questions x 1 pt)
Section 2: True/False with Justification 5 points (5 questions x 1 pt)
Section 3: Short Answer 8 points (4 questions x 2 pts)
Section 4: Applied Scenario 5 points (5 parts x 1 pt)
Total 28 points