Case Study: Building a Data Catalog from Scratch

DataField.Dev

Case Study: Building a Data Catalog from Scratch

"When we started the catalog project, the most common response was: 'Why do you need to know about my data?' By the end, the most common response was: 'Why didn't we do this years ago?'" -- Data Governance Lead, regional healthcare network

Overview

Building a data catalog is one of the most valuable and most underestimated tasks in data governance. It sounds simple -- inventory your data assets -- but in practice, it involves navigating organizational politics, uncovering shadow IT, confronting data classification decisions with real consequences, and building the foundation for every subsequent governance activity. This case study follows a fictional but realistic data catalog implementation at MedBridge Health Partners, a regional healthcare network with 45 clinics, 2,800 employees, and no CDO.

Skills Applied: - Applying data catalog design principles from Section 27.3 - Navigating organizational resistance to data governance initiatives - Making data classification decisions with ethical implications - Connecting catalog creation to practical governance outcomes (access requests, breach response, impact assessments)

The Organization

MedBridge Health Partners

MedBridge is a healthcare network serving approximately 380,000 patients across a metropolitan area and surrounding rural communities. It was formed through successive acquisitions: three formerly independent clinic groups merged over five years, bringing three different electronic health record (EHR) systems, three different patient ID schemes, and three different approaches to data governance (which is to say, three variations of "no formal approach").

Key systems: - EHR System A (Epic): Used by 28 clinics, covering approximately 250,000 patients - EHR System B (Cerner/Oracle Health): Used by 12 clinics, covering approximately 95,000 patients - EHR System C (eClinicalWorks): Used by 5 rural clinics, covering approximately 35,000 patients - Data warehouse: A partially populated SQL Server data warehouse that receives feeds from Systems A and B but not System C - Analytics environment: A Databricks workspace used by a four-person data science team, containing copies of data from the warehouse plus supplementary datasets (census data, social determinants of health data, payer data) - Research database: A de-identified research dataset maintained by the clinical research department - Billing systems: Two billing platforms (one from the legacy systems, one current) - HR/Payroll: Workday

Additionally, MedBridge's marketing department maintained a customer relationship management (CRM) system with patient contact information, community health event registrations, and marketing campaign response data -- a system that the IT department did not know existed.

The Catalyst

In January 2025, a patient submitted a data subject access request under the state's newly enacted Consumer Health Data Privacy Act. The patient wanted: all health data MedBridge held about them, all parties with whom MedBridge had shared their data, and the purposes for which their data had been used.

The Compliance team spent three weeks attempting to fulfill the request. They could not.

"We found records in System A," the Compliance Director reported to the CEO. "Then we found additional records in the warehouse. Then the data science team told us they might have a copy. Then Research said they had a de-identified version. Then -- I'm not making this up -- Marketing had records from a community health screening event the patient attended two years ago. We have no way to confirm that we've found everything. We cannot honestly certify that our response is complete."

The CEO authorized a data catalog project the following week.

The Project

Phase 1: Discovery (Weeks 1-6)

The catalog project was led by Sarah Chen, a newly hired Data Governance Manager -- MedBridge's first. Sarah had 18 months of experience in data governance at a pharmaceutical company and a clear sense of the challenge.

The discovery process:

Sarah's team conducted structured interviews with the data owners and technical administrators of every known system. For each system, they documented:

Field	Description
System name	The official and informal names
Purpose	The primary purpose of the system
Data types	Categories of data stored (demographics, clinical, financial, etc.)
Data subjects	Who the data is about (patients, employees, vendors)
Volume	Approximate number of records
Classification	Public, internal, confidential, or restricted
Owner	The business function responsible
Technical custodian	The IT team or vendor managing the infrastructure
Data flows	Where data comes from and where it goes
Retention policy	How long data is kept (or "unknown")
Regulatory requirements	HIPAA, state laws, contractual obligations

What they found:

The discovery process identified 67 data assets -- significantly more than the 15 "major systems" that IT had documented. The gap was filled by:

Shadow IT systems (14 discovered): Departmental Access databases, shared Google Sheets with patient lists, a FileMaker Pro database maintained by one clinic's billing specialist for 11 years, an Excel workbook containing patient satisfaction survey data with identifiable comments.
Vendor-managed systems (8 discovered): Third-party platforms for telehealth, patient scheduling, prescription management, and patient portal services -- each holding patient data in the vendor's cloud environment.
Analytics copies (6 discovered): Datasets extracted from production systems and stored in the analytics environment, research database, or individual team member's local machines.
Decommissioned-but-not-deleted systems (4 discovered): Legacy databases from pre-merger systems that were no longer actively used but still contained patient records, accessible to anyone with the old credentials.

The CRM system that Marketing maintained independently was the most significant shadow IT discovery. It contained 45,000 patient records with names, addresses, phone numbers, email addresses, and health screening results -- stored on a platform with no encryption at rest and no access controls beyond a shared username and password.

Phase 2: Classification and Risk Assessment (Weeks 7-10)

With the inventory complete, Sarah's team classified every asset:

Classification	Criteria	Number of Assets
Restricted	Contains PHI (Protected Health Information) or PII with health data	38
Confidential	Contains PII without health data (employee records, financial)	14
Internal	Non-personal operational data	11
Public	Information intended for public access	4

The classification process surfaced difficult decisions:

The research database dilemma. The research database contained "de-identified" patient data. Under HIPAA's Safe Harbor method, the data had been stripped of 18 specified identifiers. But the dataset included detailed clinical information (diagnosis codes, procedure codes, medication lists, lab result ranges) combined with geographic data (three-digit ZIP code) and temporal data (year of service). Research by Sweeney and others has demonstrated that such combinations can enable re-identification. Should the research database be classified as "restricted" (treating the re-identification risk as real) or "confidential" (treating the de-identification as adequate)?

Sarah classified it as restricted. "If there's a plausible re-identification path, we treat the data as if it can identify people. The cost of over-protection is inconvenience. The cost of under-protection is a patient's medical history becoming public."

The marketing CRM decision. The CRM system containing 45,000 patient records with health screening results was classified as restricted -- and flagged for immediate remediation. The records were migrated to a secure system with proper access controls within two weeks. The shared credentials were revoked. The marketing department was required to complete HIPAA training.

Phase 3: Lineage Mapping (Weeks 11-16)

Sarah's team documented the data flows between systems, creating a lineage map that showed how patient data moved through MedBridge's infrastructure:

Patient encounter at clinic
    --> EHR System (A, B, or C)
        --> Data Warehouse (Systems A and B only)
            --> Analytics Environment (copies)
                --> Model training datasets
                --> Research Database (de-identified)
            --> Billing Systems
                --> Insurance claims
                --> Collections agencies (for unpaid bills)
        --> Patient Portal (view-only)
        --> Telehealth Platform (for virtual visits)
    --> Marketing CRM (some patients, via health screening events)
    --> Third-party prescription management

The lineage map revealed several concerning flows:

System C data was invisible. The five rural clinics using eClinicalWorks were not connected to the data warehouse. Their patient data existed in an isolated silo. If a patient seen at a rural clinic was also seen at an urban clinic, MedBridge had no way to connect those records or ensure consistent governance.
Analytics copies had no expiration. Data extracted to the analytics environment was never deleted. The data science team had copies of patient data dating back to 2019, some of which should have been subject to deletion requests or retention limits.
Insurance claims included clinical detail. The data flow from billing to insurance claims included diagnosis codes and procedure codes -- clinical information transmitted to commercial entities. While this was legally required for reimbursement, the lineage documentation made visible just how much clinical detail left MedBridge's control through routine billing.

Phase 4: Remediation and Governance (Weeks 17-24)

Based on the catalog findings, MedBridge implemented:

Immediate remediation: - Marketing CRM migrated to secure platform (2 weeks) - Decommissioned legacy systems permanently deleted (4 weeks, after legal review) - Analytics environment retention policy implemented: copies expire after 12 months unless renewed with documented justification - System C integration project initiated (ongoing, 6-month timeline)

Governance infrastructure: - Data governance council formed (representatives from each department, meeting monthly) - Access review process: quarterly reviews of who has access to restricted and confidential assets - Data sharing agreements: template created for all third-party data sharing, requiring governance review - Incident response data map: a simplified version of the catalog optimized for breach response, showing what data exists where and who is affected

Outcomes

The Original Access Request

Six months after the catalog project began, MedBridge received another data subject access request -- from a different patient. This time, the response took three days, not three weeks. The catalog identified every system containing the patient's data. The lineage map traced every flow. The response was complete and certified.

Quantified Impact

Metric	Before Catalog	After Catalog
Time to fulfill data access request	3 weeks	3 days
Known data assets	15	67
Assets with documented retention policy	4	61
Assets with identified data owner	8	67
Shadow IT systems identified and remediated	0	14
Time to scope a potential breach	Unknown	~4 hours

Cultural Impact

Sarah reflected on the project's most unexpected outcome: cultural change. "The catalog forced conversations that had never happened. When I asked the marketing director to classify her CRM data, she hadn't thought of it as 'patient data' -- it was 'event registration data.' The classification process made her see it differently. When the research team had to document the lineage of their de-identified dataset, they realized they hadn't thought carefully about the re-identification risk. The catalog didn't just inventory data. It changed how people thought about data."

Discussion Questions

The shadow IT problem. Fourteen shadow IT systems were discovered -- nearly a quarter of all data assets. Why do shadow IT systems proliferate, and what organizational conditions would reduce their occurrence without stifling legitimate departmental needs?
Classification as governance. Sarah's decision to classify the research database as "restricted" despite its de-identification had real consequences: it required additional access controls, training, and oversight. Was this the right call? What framework should guide classification decisions when re-identification risk is uncertain?
The rural clinic gap. System C's isolation meant that rural clinic patients were effectively invisible to MedBridge's governance infrastructure. What are the ethical implications of this invisibility? How might it affect the quality and equity of care?
The marketing CRM. The discovery that 45,000 patient records were stored with a shared password and no encryption is alarming. But the marketing team was not acting maliciously -- they were doing their job with the tools available. How should governance programs address shadow IT without creating a blame culture?
Ongoing maintenance. A data catalog is only useful if it stays current. New systems are adopted, data flows change, assets are created and retired. How should MedBridge ensure the catalog remains accurate over time? What governance processes are needed?

Your Turn: Mini-Project

Option A: Catalog Your Environment. Conduct a mini-data-catalog exercise for an organization you interact with (your university department, a student organization, a workplace). Identify at least 10 data assets, classify each, and document the data flows between them. Note what was easy to discover and what was hidden.

Option B: Classification Framework. Design a data classification framework for a healthcare organization, specifying criteria for each level (public, internal, confidential, restricted). Include decision rules for edge cases -- like de-identified datasets with re-identification risk, or marketing data that includes health information.

Option C: Lineage Map. Using the DataLineageTracker from Chapter 27, create tracker instances for five related data assets in a healthcare scenario. Add transformations and access records to each. Then write a function that generates a consolidated lineage report showing the relationships between assets.

References

Ladley, John. Data Governance: How to Design, Deploy, and Sustain an Effective Data Governance Program. 2nd ed. Amsterdam: Academic Press, 2019.
Seiner, Robert S. Non-Invasive Data Governance: The Path of Least Resistance and Greatest Success. Basking Ridge, NJ: Technics Publications, 2014.
Sweeney, Latanya. "Simple Demographics Often Identify People Uniquely." Carnegie Mellon University Data Privacy Working Paper 3, 2000.
Office for Civil Rights, U.S. Department of Health and Human Services. "Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule." HHS, 2012 (updated 2022).
DAMA International. DAMA-DMBOK: Data Management Body of Knowledge. 2nd ed. Bradley Beach, NJ: Technics Publications, 2017.
Plotkin, David. Data Stewardship: An Actionable Guide to Effective Data Management and Data Governance. Amsterdam: Academic Press, 2020.