Exercises: Data Stewardship and the Chief Data Officer

DataField.Dev

Exercises: Data Stewardship and the Chief Data Officer

These exercises progress from concept checks to challenging applications, including Python coding exercises. Estimated completion time: 4-5 hours.

Difficulty Guide: - * Foundational (5-10 min each) - ** Intermediate (10-20 min each) - *** Challenging (20-40 min each) - **** Advanced/Research (40+ min each)

Part A: Conceptual Understanding *

Test your grasp of core concepts from Chapter 27.

A.1. Section 27.1.1 identifies three developments that made the CDO role necessary: the data explosion, the regulatory surge, and the analytics ambition. For each, explain in one or two sentences why fragmented data management (each department managing its own data) became untenable.

A.2. Ray Zhao describes the CDO role as "being the person who's responsible for the roof but doesn't own any of the walls." Explain this metaphor. What is the "roof"? What are the "walls"? Why is this structure problematic for effective governance?

A.3. Define the terms data owner, data steward, and data custodian (Section 27.2.1). Using a hospital as an example, identify a plausible person or role for each and explain their distinct responsibilities.

A.4. Compare the three stewardship models -- centralized, federated, and hybrid (Sections 27.2.2-27.2.4). For each, identify one organizational context where it would be the best fit and explain why.

A.5. Section 27.3.2 defines a data catalog and lists six categories of information it should contain. Explain why the absence of a data catalog makes it impossible to comply with GDPR Article 15 (right of access).

A.6. What is data lineage (Section 27.4.1), and why does the chapter argue it is essential for ethics? Use the example of a patient record that has been cleaned, normalized, de-identified, aggregated, and fed into a predictive model.

A.7. Section 27.5 discusses data quality as an ethical issue, not just a technical one. Explain why poor data quality can produce ethical harms. Provide one example not found in the chapter.

Part B: Applied Analysis **

Analyze scenarios, arguments, and real-world situations using concepts from Chapter 27.

B.1. Consider the following scenario:

A mid-size retail company has 12 departments, each with its own customer database. The marketing database contains 2.3 million records with email addresses and purchase history. The loyalty program database contains 1.8 million records with demographic data and reward points. The customer service database contains 900,000 records with complaint histories and resolution data. There is no master customer record.

Using the stewardship frameworks from Section 27.2, diagnose the governance problems this creates. Then propose a migration strategy -- which stewardship model would you recommend, and how would you implement it?

B.2. Ray Zhao's anecdote about NovaCorp (Section 27.1.1) -- getting four different answers to "How many customer records do we have?" -- illustrates the visibility problem. Construct a scenario in a healthcare context where the same visibility problem could produce direct patient harm. Be specific about the mechanism of harm.

B.3. Section 27.3.3 argues that a data catalog is "ethical infrastructure." Apply this argument to the consent fiction theme. How does the absence of a data catalog enable consent fictions? How does its presence make those fictions harder to sustain?

B.4. The chapter presents the CDO's evolution through four phases: Technical (2005-2012), Analytical (2012-2017), Regulatory (2017-2021), and Strategic (2021-present). For each phase, identify the key stakeholder the CDO primarily served (IT, business units, legal, the board) and the primary risk of failing in that phase.

B.5. A CDO at a large financial institution reports to the CIO. Explain, using the organizational design principles from Section 27.6, why this reporting structure may constrain the CDO's effectiveness. Propose an alternative and explain its advantages.

B.6. The DataLineageTracker includes an ethical review notes section in its generated report. Review the logic that produces these notes (Section 27.4.2). Identify at least two additional ethical checks that the tracker should perform and explain why they would be valuable.

Part C: Python Coding Exercises -*

These exercises require you to work with the DataLineageTracker and related dataclasses from Chapter 27.

C.1. ** Basic DataLineageTracker Usage. Create a DataLineageTracker instance for the following data asset: VitraMed's patient demographics dataset, sourced from clinic intake forms, classified as "restricted," currently stored in the vitramed-prod-db-east database, with a retention policy of "7 years from last clinical visit" and a retention expiry date of December 31, 2032. Add at least two transformations and three access log entries. Generate the lineage report and review it.

# Your code here
# Hint: Use the TransformationRecord and AccessRecord dataclasses
# as shown in Section 27.4.2

C.2. ** Retention Compliance Check. Create three DataLineageTracker instances with different retention scenarios: - (a) An asset that expired 90 days ago - (b) An asset that expires in 30 days - (c) An asset with no expiry date set

Call check_retention() on each and print the results. Then write a function retention_audit(trackers: list[DataLineageTracker]) -> dict that takes a list of trackers and returns a summary dictionary with counts of expired, expiring-soon (within 90 days), active, and no-expiry assets.

C.3. *** Access Pattern Analysis. Write a function analyze_access_patterns(tracker: DataLineageTracker) -> dict that analyzes a tracker's access log and returns: - Total access count by type (read, write, export, delete) - Number of unapproved writes and exports - List of unique users who accessed the data - The most frequent accessor - Whether any export occurred without approval (boolean flag)

Test your function with a tracker that has at least 10 access log entries, including some unapproved exports.

C.4. *** Data Catalog Builder. Using the DataLineageTracker as a building block, create a DataCatalog class that manages a collection of tracked data assets. The catalog should support: - add_asset(tracker: DataLineageTracker) -- add a new asset - search(keyword: str) -> list[DataLineageTracker] -- search assets by name or description - filter_by_classification(level: str) -> list[DataLineageTracker] -- filter by classification level - retention_audit() -> str -- generate a summary report of all assets' retention status - generate_catalog_report() -> str -- produce a formatted inventory of all assets

# Your code here
# The DataCatalog should work like this:
# catalog = DataCatalog(organization="VitraMed")
# catalog.add_asset(patient_demographics_tracker)
# catalog.add_asset(lab_results_tracker)
# print(catalog.generate_catalog_report())

C.5. *** Lineage Chain Visualization. Write a function trace_lineage_chain(trackers: list[DataLineageTracker]) -> str that takes a list of trackers representing connected data assets (where one asset's output feeds into another's input) and produces a text-based visualization of the data flow:

[Patient Intake Forms]
    --> [Patient Demographics (restricted)]
        --> Transformation: De-identification (by Dr. Khoury)
        --> Transformation: Aggregation by clinic
    --> [Aggregate Risk Scores (confidential)]
        --> Transformation: Statistical smoothing
    --> [Insurance Partner Reports (internal)]

Your function should use the tracker names, classifications, and transformation histories to build the chain.

C.6. * *Enhanced DataLineageTracker. Extend the DataLineageTracker with the following features:

(a) A purpose_limitation field that records the specific purpose for which the data was collected, and a check_purpose_compliance(current_use: str) -> bool method that flags when data is being used for a purpose different from the one recorded.
(b) A data_subjects_count field and a check_proportionality(purpose: str) -> str method that evaluates whether the volume of data subjects is proportionate to the stated purpose.
(c) A cross_border_transfers list that logs any transfers to other jurisdictions, and a method that checks whether appropriate safeguards are documented for each transfer.

Test your enhanced tracker with a VitraMed scenario where patient data is transferred to an EU-based analytics partner.

Part D: Synthesis & Critical Thinking ***

These questions require you to integrate multiple concepts and think beyond the material presented.

D.1. The chapter presents data stewardship as a governance practice. But critics argue that the stewardship metaphor is itself problematic -- a "steward" manages someone else's property, implying that data subjects are passive beneficiaries of organizational care rather than rights-holders with agency. Evaluate this critique. Does the stewardship model adequately represent the interests of data subjects? What alternative metaphors might better capture the relationship between organizations and the people whose data they manage?

D.2. Ray Zhao notes that the average CDO tenure is 2.5 years -- shorter than any other C-suite role (Section 27.1.3). Drawing on the organizational design principles from Section 27.6 and the culture change discussion from Chapter 26, analyze why CDO tenure is so short. What organizational conditions would be necessary to extend it? Is the high turnover rate a sign that the role is failing, or that it is succeeding in confronting uncomfortable truths?

D.3. The DataLineageTracker is a model implementation -- a simplified version of real enterprise lineage tracking systems. Identify at least three limitations of the implementation as presented and explain how a production system would need to address them. Consider scale, integration, automation, and real-time requirements.

D.4. Section 27.3.3 argues that data catalogs make "ethical ignorance impossible." But is this true? Can an organization have a comprehensive data catalog and still behave unethically? Identify at least two mechanisms through which an organization could maintain a data catalog while circumventing its ethical implications. What additional governance mechanisms are needed to ensure that visibility translates into accountability?

Part E: Research & Extension ****

These are open-ended projects for students seeking deeper engagement.

E.1. CDO in Practice. Research the CDO role at a specific organization (public information is available for many Fortune 500 companies). Write a 1,000-word profile covering: the CDO's reporting line, their mandate, the stewardship model employed, and any publicly available evidence of the CDO's impact on data governance. Evaluate the organization's structure against the principles from Section 27.6.

E.2. Data Catalog Comparison. Research two commercial data catalog tools (e.g., Alation, Collibra, Apache Atlas, Google Data Catalog). Compare their features against the requirements described in Section 27.3.2. Which tool better supports ethical governance? What features are missing from both?

E.3. Build a Production-Ready DataLineageTracker. Extend the chapter's DataLineageTracker into a more production-ready tool: add database persistence (using SQLite), a web interface (using Flask or Streamlit), and automated retention monitoring that sends email-style alerts for expired or expiring assets. Document your implementation and write a 500-word reflection on what you learned about the gap between educational code and production systems.

Solutions

Selected solutions are available in appendices/answers-to-selected.md.