Chapter 10: Privacy by Design and Data Minimization

DataField.Dev

38 min read

> "Privacy is not something that I'm merely entitled to, it's an absolute prerequisite."

Prerequisites

chapter-07
chapter-09
chapter-06

Learning Objectives

Explain Ann Cavoukian's seven foundational principles of Privacy by Design and apply them to system architecture decisions
Define data minimization and articulate its role in responsible data governance
Distinguish between anonymization, pseudonymization, and de-identification and evaluate the limitations of each
Define k-anonymity, implement a check for it in Python, and explain its vulnerabilities
Describe differential privacy at a conceptual level and explain the role of the epsilon parameter
Identify and compare major privacy-enhancing technologies including homomorphic encryption, secure multi-party computation, and federated learning
Apply Privacy by Design principles to evaluate a real-world data system

In This Chapter

Chapter Overview
10.1 Ann Cavoukian's Privacy by Design
10.2 Data Minimization: Collect Only What You Need
10.3 Anonymization, Pseudonymization, and Their Limits
10.4 K-Anonymity and Its Descendants
10.5 Differential Privacy: A Mathematical Guarantee
10.6 Privacy-Enhancing Technologies (PETs)
10.7 Privacy by Design at VitraMed and in Detroit
10.8 Case Study References
10.9 Chapter Summary
What's Next
Chapter 10 Exercises -> exercises.md
Chapter 10 Quiz -> quiz.md
Case Study: Apple's Differential Privacy Implementation -> case-study-01.md
Case Study: De-identification Failures — The Netflix Prize Dataset -> case-study-02.md

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 10: Privacy by Design and Data Minimization

"Privacy is not something that I'm merely entitled to, it's an absolute prerequisite." — Marlon Brando

Chapter Overview

In Chapter 9, we examined the troubled state of data collection and consent -- the notice-and-consent model that supposedly protects privacy but in practice often serves as a legal fig leaf. We documented consent fatigue, dark patterns, and the fundamental asymmetry between the organizations that draft privacy policies and the individuals who "agree" to them without reading a word.

This chapter asks a different question: What if, instead of relying on users to protect themselves, we built privacy into the architecture of data systems from the start?

This is the core insight of Privacy by Design -- a framework developed by Ann Cavoukian that shifts the burden from data subjects (who are asked to read, understand, and negotiate terms) to data controllers (who design the systems that collect, store, and process data). It is a move from reactive privacy protection (responding to breaches and complaints) to proactive privacy architecture (preventing harm before it occurs).

But principles require implementation. This chapter will take you from the philosophical foundations of Privacy by Design through the technical mechanisms that make it possible -- anonymization, pseudonymization, k-anonymity, differential privacy, and a suite of privacy-enhancing technologies. Along the way, you'll write Python code to test whether a dataset meets anonymization standards, and you'll discover why even sophisticated technical approaches have limits that demand continued ethical judgment.

In this chapter, you will learn to: - Apply Privacy by Design as an architectural philosophy, not just a compliance checklist - Implement data minimization practices across the data lifecycle - Evaluate the strengths and weaknesses of anonymization techniques - Write Python code to check k-anonymity and understand why it can fail - Explain differential privacy to a non-technical audience - Assess when privacy-enhancing technologies are appropriate and when they are insufficient

10.1 Ann Cavoukian's Privacy by Design

10.1.1 Origins and Context

In 1995, Ann Cavoukian -- then the Information and Privacy Commissioner of Ontario, Canada -- introduced Privacy by Design (PbD) as a response to what she saw as a fundamental failure of the prevailing approach to data protection. That approach, which we examined in Chapter 9, was reactive: it set rules about what organizations could do with data after collection, required consent, and imposed penalties for violations. Cavoukian argued that reactive regulation would always be outpaced by technological change.

Her alternative was proactive: embed privacy into the design of systems, business practices, and organizational structures from the outset. Don't ask users to protect themselves. Don't wait for something to go wrong. Build privacy in.

The idea was initially met with skepticism. In 2010, the International Conference of Data Protection and Privacy Commissioners adopted Privacy by Design as an international standard, giving it significant institutional legitimacy. In 2018, the European Union's General Data Protection Regulation (GDPR) codified a version of PbD in Article 25, requiring "data protection by design and by default." The concept went from visionary aspiration to legal obligation.

10.1.2 The Seven Foundational Principles

Cavoukian articulated Privacy by Design through seven principles. Each represents a shift from conventional thinking:

Principle 1: Proactive, Not Reactive; Preventive, Not Remedial

Privacy by Design does not wait for privacy risks to materialize. It anticipates them and prevents them. This means conducting privacy analysis before building a system, not after a breach has occurred.

Connection to Chapter 9: Recall the notice-and-consent model's fundamental limitation -- it intervenes at the point of collection, after the system has already been designed to collect. PbD intervenes at the point of design, before any data flows at all.

Principle 2: Privacy as the Default Setting

If a user does nothing -- takes no action, changes no settings, reads no policies -- they should still have privacy. This is the opposite of how most systems work today, where default settings maximize data collection and users must actively opt out.

Mira thought about this principle while reviewing VitraMed's patient onboarding flow. "Right now," she told Dr. Adeyemi, "patients have to check a box to opt out of data sharing with our research partners. If they miss it, their data gets shared. That's privacy as an afterthought, not a default."

"What would the default look like if you followed Cavoukian's principle?" Dr. Adeyemi asked.

"No sharing unless they explicitly opt in. Which means fewer research datasets. Which means some analytics we're doing would have to stop."

"Now you see why this principle is difficult," Dr. Adeyemi said. "It has costs."

Principle 3: Privacy Embedded into Design

Privacy should be a core component of system architecture, not a bolt-on feature added after the fact. This means privacy engineers and ethicists participate in system design from day one -- not just legal teams doing a compliance review at the end.

Principle 4: Full Functionality -- Positive-Sum, Not Zero-Sum

PbD rejects the assumption that privacy must come at the expense of other values (security, functionality, usability). It insists on positive-sum outcomes: privacy and functionality, privacy and security. This is aspirational, and sometimes genuinely achievable -- but as we'll see in Sections 10.4 and 10.5, real trade-offs exist.

Principle 5: End-to-End Security -- Full Lifecycle Protection

Privacy protections must extend across the entire data lifecycle (introduced in Chapter 1): collection, storage, processing, sharing, and deletion. Data that is carefully anonymized at collection but stored indefinitely in an unprotected database has not achieved Privacy by Design.

Principle 6: Visibility and Transparency -- Keep It Open

Organizations should be transparent about their data practices. Users, regulators, and auditors should be able to verify that privacy commitments are being honored. This connects directly to the accountability frameworks we'll examine in Chapter 17.

Principle 7: Respect for User Privacy -- Keep It User-Centric

Above all, Privacy by Design requires organizations to prioritize the interests of the individual. This means offering strong privacy defaults, providing clear and accessible information, and designing systems that empower users rather than extracting from them.

Critical Note: Critics of PbD observe that the seven principles are aspirational rather than operational. They tell you what to aim for but not how to get there. "Embed privacy into design" sounds excellent, but what specific technical and organizational choices does it require? The remainder of this chapter provides some answers.

Privacy by Design remained an influential but voluntary framework for over two decades. That changed in 2018, when the GDPR transformed it from an aspiration into a legal requirement.

Article 25 of the GDPR mandates:

"Taking into account the state of the art, the cost of implementation and the nature, scope, context and purposes of processing as well as the risks of varying likelihood and severity for rights and freedoms of natural persons posed by the processing, the controller shall, both at the time of the determination of the means for processing and at the time of the processing itself, implement appropriate technical and organizational measures... which are designed to implement data-protection principles... in an effective manner."

Article 25 also requires data protection by default -- personal data should be processed only to the extent necessary for each specific purpose, and by default should not be made accessible to an indefinite number of people.

The inclusion of "the cost of implementation" is significant. It acknowledges that PbD is not an absolute requirement irrespective of expense -- organizations must implement measures that are appropriate given their resources and the nature of the processing. This provides flexibility but also creates ambiguity: how much expenditure is "appropriate"? When is the cost of privacy protection justified, and when is it excessive? These are judgment calls that courts and data protection authorities are still working out.

Dr. Adeyemi highlighted this tension. "Article 25 is both the greatest triumph and the greatest challenge of Privacy by Design. It made privacy-by-design law. But it also made it law -- which means it's now subject to all the limitations of legal enforcement that we've been discussing. How do you audit compliance with 'appropriate technical and organizational measures'? How do you determine whether an organization genuinely embedded privacy or merely documented that they considered it?"

10.1.4 PbD in Practice: What It Looks Like

A system designed under PbD principles might include:

Design Choice	PbD Principle
Collect only the data fields strictly necessary for the stated purpose	Principles 1, 2
Encrypt data at rest and in transit using current standards	Principle 5
Auto-delete data after its purpose has been fulfilled	Principles 1, 5
Default all sharing settings to "off"	Principle 2
Include privacy engineers in the design team from project inception	Principle 3
Publish regular transparency reports	Principle 6
Provide clear, accessible privacy dashboards for users	Principle 7
Use differential privacy for analytics (Section 10.5)	Principles 1, 3, 4

10.2 Data Minimization: Collect Only What You Need

10.2.1 The Principle

Data minimization is the practice of collecting only the data that is strictly necessary for a specified purpose, retaining it only as long as needed, and deleting it when its purpose has been fulfilled. It is arguably the most consequential of PbD's implications, because it addresses the root cause of many privacy harms: you cannot breach data you never collected.

The GDPR enshrines data minimization in Article 5(1)(c): personal data shall be "adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed."

This sounds simple. It is not.

10.2.2 Why Organizations Over-Collect

If data minimization is so obviously beneficial for privacy, why do organizations routinely collect far more data than they need? Several forces push in the direction of over-collection:

The "just in case" mentality. Engineers and analysts argue that data might prove useful later. Deleting it forecloses future possibilities. This reasoning treats data as a free asset with no carrying cost -- ignoring the privacy risks, security obligations, and storage costs that come with every additional data field.
The business model. As Chapter 4 documented, many digital platforms are funded by advertising, and advertising is fueled by behavioral data. For these companies, data minimization is in fundamental tension with the revenue model.
The analytics imperative. Machine learning models generally perform better with more data. A VitraMed data scientist arguing for access to patients' social media activity is not being malicious -- they genuinely believe the additional data will improve predictions. The ethical question is whether that improvement justifies the collection.
Path dependency. Existing systems were built to collect certain data, and changing them requires engineering effort that competes with other priorities. "We've always collected date of birth" is not a justification, but it is a common explanation.

10.2.3 Implementing Data Minimization

Effective data minimization requires discipline across the data lifecycle:

At Collection: - For every data field, ask: "Is this necessary for the stated purpose?" Not useful, not interesting -- necessary. - Distinguish between primary data (needed for the service) and secondary data (useful for other purposes). Collect the primary; scrutinize the secondary. - Use progressive collection: ask for data only when it's needed, not all at once during registration.

At Storage: - Implement retention schedules: define how long each type of data will be kept and automate deletion when the period expires. - Store data at the lowest level of identifiability needed for each purpose. If aggregate statistics suffice, don't store individual records.

At Processing: - Apply the principle of least privilege: grant access to data only to those who need it for their specific role. - Use privacy-preserving analytics (Sections 10.5 and 10.6) where possible.

At Deletion: - Ensure deletion is real, not just a UI change. Data marked as "deleted" but retained in backups, logs, or shadow databases is not minimized. - Verify deletion: audit your systems to confirm that data scheduled for deletion has actually been removed.

Eli brought up a pointed example in class. "The city of Detroit installed Smart City sensors at major intersections to measure traffic flow. That's the stated purpose. But the sensors also capture license plates, pedestrian movements, and -- because they have microphones -- ambient sound. Is any of that necessary to count cars?"

"The traffic engineers would say the license plates help distinguish individual vehicles from repeat passes," another student offered.

"Then anonymize the plates," Eli said. "Hash them so you can count unique vehicles without knowing who they belong to. The pedestrian data? The audio? That's over-collection. Period."

Applied Framework: The Data Minimization Audit asks five questions for every data field: (1) What is the stated purpose? (2) Is this field necessary for that purpose? (3) Could the purpose be achieved with less identifying data? (4) How long is the data retained? (5) What happens when the retention period expires? Any field that fails questions 2 or 3 is a candidate for elimination.

10.2.4 Data Minimization in Practice: Tensions and Trade-offs

Data minimization sounds straightforward in principle. In practice, it generates genuine tensions that data governance professionals must navigate.

Tension 1: Minimization vs. Machine Learning. Modern machine learning systems are often described as "data-hungry" -- they improve with more data, more features, and more diverse training examples. A VitraMed data scientist argued in an internal memo: "If we had access to patients' dietary habits, exercise patterns, and sleep data in addition to their clinical records, our predictive models for cardiovascular risk would improve by an estimated 15%. That improvement translates to earlier interventions and better outcomes." The argument is not frivolous. But collecting dietary habits, exercise patterns, and sleep data from patients dramatically expands the privacy surface -- and each new data source creates new vectors for breach, misuse, and re-identification.

The PbD response is to ask whether the analytical goal can be achieved through privacy-preserving means: federated learning (Section 10.6), synthetic data, or differential privacy (Section 10.5). If so, the data need not be centralized, and the minimization principle is preserved. If not, the organization must make a judgment about whether the analytical benefit justifies the privacy cost -- and document that judgment transparently.

Tension 2: Minimization vs. Accountability. Sometimes retaining data is necessary to demonstrate compliance, resolve disputes, or enable auditing. A company that deletes all customer interaction records after 30 days has achieved excellent data minimization but may be unable to demonstrate that it obtained valid consent, responded to complaints, or complied with regulatory requirements. Data governance professionals must balance minimization against retention requirements, keeping data long enough to meet legal and accountability obligations but no longer.

Tension 3: Minimization vs. Serendipity. Some of the most important insights come from unexpected analyses of existing data. The discovery that aspirin reduces heart attack risk came from re-analysis of data originally collected for other purposes. A strict minimization regime that deletes data as soon as its original purpose is fulfilled forecloses such discoveries. This is a genuine loss. But it must be weighed against the privacy costs of indefinite retention -- and against the availability of privacy-preserving techniques that can enable secondary analysis without retaining identifiable records.

10.3 Anonymization, Pseudonymization, and Their Limits

10.3.1 Three Approaches to Reducing Identifiability

When organizations need to use data for purposes beyond the original collection -- research, analytics, reporting -- they must reduce the risk that individuals can be identified. Three approaches exist, differing in their strength and limitations:

De-identification is the general process of removing or obscuring identifiers from data. It is the broadest term and the weakest commitment -- it may involve nothing more than stripping names and Social Security numbers.

Pseudonymization replaces direct identifiers with pseudonyms (codes, tokens, or hashes). A patient named "Sarah Chen" becomes "Patient #4782." Crucially, a key exists that maps pseudonyms back to identities. The data can be re-identified if necessary (for clinical follow-up, for example), but day-to-day access is to the pseudonymized version. The GDPR explicitly recognizes pseudonymization as a risk-reduction measure but does not treat pseudonymized data as anonymous -- it is still personal data subject to GDPR requirements, because re-identification is possible.

Anonymization is the irreversible transformation of data such that individuals can no longer be identified, directly or indirectly, by any reasonably available means. Truly anonymized data falls outside the scope of the GDPR entirely, because it is no longer personal data.

The problem is that true anonymization is far harder to achieve than most people assume.

10.3.2 The Illusion of Anonymity

In a landmark 2000 study, computer scientist Latanya Sweeney demonstrated that 87% of the U.S. population could be uniquely identified by just three data fields: five-digit zip code, date of birth, and gender. These are not direct identifiers -- they are quasi-identifiers, combinations of seemingly innocuous attributes that together become uniquely identifying.

Sweeney proved her point by re-identifying the medical records of then-Massachusetts Governor William Weld from a publicly released "anonymous" health insurance dataset. She purchased the voter registration rolls for Cambridge, Massachusetts (which included name, address, zip code, date of birth, and gender), linked them with the anonymized health records (which included zip code, date of birth, gender, and medical diagnoses), and found exactly one person matching Weld's demographic profile. His "anonymous" medical records were now attached to his name.

This attack -- the linkage attack -- is the fundamental threat to anonymization. It demonstrates that whether data is identifiable depends not on the data itself but on what other data is available to an attacker.

Connection to Chapter 5: Linkage attacks illustrate the power asymmetry at the heart of data governance. The organizations that release "anonymized" datasets often have no idea what auxiliary information an attacker might possess. The re-identification risk is fundamentally unknowable, because it depends on the external knowledge of a hypothetical adversary.

10.3.3 The Netflix Prize Dataset

In 2006, Netflix released a dataset of 100 million movie ratings from approximately 480,000 subscribers as part of a competition to improve its recommendation algorithm. The dataset was "anonymized" -- subscriber names were replaced with random IDs. Netflix believed this was sufficient.

Researchers Arvind Narayanan and Vitaly Shmatikov demonstrated otherwise. By cross-referencing the Netflix dataset with public movie ratings on the Internet Movie Database (IMDb), they were able to re-identify specific Netflix subscribers. Even a few publicly known movie preferences were sufficient to uniquely identify an individual in the "anonymous" dataset, revealing their complete viewing history -- including potentially sensitive genres.

The consequences were concrete. A class-action lawsuit was filed, and Netflix cancelled its planned second competition. A woman identified only as "Jane Doe" in the lawsuit argued that her re-identified viewing history could reveal her sexual orientation -- a harm with real social consequences.

10.3.4 When Anonymization Works -- and When It Doesn't

Anonymization is not useless. It works reasonably well when: - The dataset is small and the population is large (low risk of uniqueness) - The data is highly aggregated (county-level statistics rather than individual records) - The quasi-identifiers have been generalized (age ranges instead of exact ages, state instead of zip code) - The data has limited external linkage potential (no easy auxiliary datasets to match against)

Anonymization fails when: - The dataset is rich (many attributes per record) - Quasi-identifiers are specific (exact dates, precise locations) - Auxiliary data is readily available (social media, voter rolls, public records) - The adversary is motivated and knowledgeable

The lesson is not that anonymization should be abandoned, but that it should be treated as a risk-reduction measure rather than a guarantee. True anonymity in rich datasets is, in many practical contexts, a myth.

10.4 K-Anonymity and Its Descendants

10.4.1 The Definition

K-anonymity, introduced by Latanya Sweeney in 2002, formalizes one approach to preventing linkage attacks. A dataset satisfies k-anonymity if, for every combination of quasi-identifiers in the dataset, at least k individuals share that combination. In other words, no individual can be distinguished from at least k-1 other individuals based on their quasi-identifiers.

If a dataset has 3-anonymity for the quasi-identifiers {zip code, age, gender}, then every person in the dataset has at least two other people with the same zip code, age, and gender. An attacker who knows your zip code, age, and gender cannot narrow you down to fewer than three records.

10.4.2 Achieving K-Anonymity

Two primary techniques are used to achieve k-anonymity:

Generalization: Replace specific values with broader categories. Instead of exact age (27), use an age range (25-30). Instead of a five-digit zip code (48201), use a three-digit prefix (482**).

Suppression: Remove records that cannot be made k-anonymous through generalization without unacceptable information loss. If only one person in the dataset is a 92-year-old woman in zip code 04401, and generalizing further would destroy the utility of the data, that record might be suppressed entirely.

Both techniques involve trade-offs: more generalization means more privacy but less analytical utility. The art lies in finding the balance.

10.4.3 Python Implementation: Checking K-Anonymity

The following Python function checks whether a given DataFrame satisfies k-anonymity for a specified set of quasi-identifiers. This is a practical tool for data governance -- before releasing or sharing a dataset, you can verify whether it meets your anonymization standard.

import pandas as pd


def check_k_anonymity(df: pd.DataFrame, quasi_identifiers: list[str], k: int) -> dict:
    """
    Check whether a DataFrame satisfies k-anonymity for given quasi-identifiers.

    K-anonymity requires that every combination of quasi-identifier values
    appears in at least k records. If any combination appears fewer than k
    times, individuals with that combination may be uniquely (or nearly
    uniquely) identifiable through a linkage attack.

    Parameters
    ----------
    df : pd.DataFrame
        The dataset to evaluate.
    quasi_identifiers : list[str]
        Column names that serve as quasi-identifiers (e.g., age, zip code,
        gender -- attributes that could be used in a linkage attack).
    k : int
        The minimum group size required for k-anonymity.

    Returns
    -------
    dict
        A dictionary with three keys:
        - "satisfies_k_anonymity" (bool): True if every quasi-identifier
          group has at least k records.
        - "smallest_group_size" (int): The size of the smallest equivalence
          class (the most vulnerable group).
        - "violating_groups" (pd.DataFrame): A DataFrame of quasi-identifier
          combinations that fail the k-anonymity check, with their counts.
          Empty if the dataset satisfies k-anonymity.

    Example
    -------
    >>> data = {
    ...     "age":    [29, 29, 35, 35, 42, 42, 42, 29],
    ...     "gender": ["F", "F", "M", "M", "F", "F", "F", "F"],
    ...     "zip":    ["48201", "48201", "48201", "48201", "10001", "10001", "10001", "48201"],
    ...     "diagnosis": ["Flu", "Asthma", "Diabetes", "Flu", "Asthma", "Flu", "Diabetes", "Flu"]
    ... }
    >>> df = pd.DataFrame(data)
    >>> result = check_k_anonymity(df, quasi_identifiers=["age", "gender", "zip"], k=2)
    >>> print(result["satisfies_k_anonymity"])
    True
    """
    # Group by quasi-identifiers and count records in each equivalence class
    group_sizes = df.groupby(quasi_identifiers).size().reset_index(name="count")

    # Find the smallest equivalence class
    smallest = group_sizes["count"].min()

    # Identify any groups that violate k-anonymity
    violating = group_sizes[group_sizes["count"] < k]

    return {
        "satisfies_k_anonymity": bool(smallest >= k),
        "smallest_group_size": int(smallest),
        "violating_groups": violating,
    }


# === Demonstration: A dataset that SEEMS anonymous but FAILS k-anonymity ===

# Suppose a hospital releases "de-identified" patient records.
# They removed names and IDs but kept age, gender, zip code, and diagnosis.
patient_data = pd.DataFrame({
    "age":       [29, 29, 34, 52, 52, 41, 41, 41, 63, 29],
    "gender":    ["F", "F", "M", "F", "F", "M", "M", "M", "F", "F"],
    "zip_code":  ["48201", "48201", "48201", "10001", "10001", "60601", "60601", "60601", "10001", "48201"],
    "diagnosis": ["Flu", "Asthma", "HIV", "Diabetes", "Hypertension", "Flu", "Flu", "Asthma", "Cancer", "Flu"],
})

print("=== Patient Dataset (names removed) ===")
print(patient_data.to_string(index=False))
print()

# Check for k=2 anonymity using age, gender, and zip code as quasi-identifiers
quasi_ids = ["age", "gender", "zip_code"]
result = check_k_anonymity(patient_data, quasi_ids, k=2)

print(f"Satisfies 2-anonymity? {result['satisfies_k_anonymity']}")
print(f"Smallest group size:   {result['smallest_group_size']}")
print()

if not result["satisfies_k_anonymity"]:
    print("FAILING GROUPS (vulnerable to linkage attack):")
    print(result["violating_groups"].to_string(index=False))
    print()
    print("These individuals can be uniquely identified if an attacker")
    print("knows their age, gender, and zip code from another source")
    print("(e.g., voter registration records, social media profiles).")

Walking through the output:

When you run this code, you'll find that the dataset fails 2-anonymity. Some combinations of age, gender, and zip code appear only once:

The 34-year-old male in zip code 48201 is unique in the dataset. If an attacker knows someone matching that demographic profile is in the dataset, they can identify that person's record -- and learn their HIV diagnosis.
The 63-year-old female in zip code 10001 is similarly unique. Her cancer diagnosis is exposed.

This is precisely the kind of vulnerability that Sweeney demonstrated with Governor Weld's medical records. The data looks anonymous (no names, no IDs), but the quasi-identifiers make it identifiable.

The fix: To achieve 2-anonymity, you could generalize the data -- replace exact ages with ranges (e.g., 30-39 instead of 34), broaden zip codes (e.g., 482** instead of 48201), or suppress the uniquely identifiable records. Each choice has costs: generalization reduces analytical precision, and suppression removes data points entirely.

Try it yourself: Download the code, modify the dataset, and experiment. What happens if you add more patients with the same quasi-identifier profile? What if you generalize ages to 10-year ranges? At what point does the dataset achieve 3-anonymity? 5-anonymity?

10.4.4 Attacks on K-Anonymity

K-anonymity was an important advance, but it has known vulnerabilities:

Homogeneity Attack: If everyone in a k-anonymous group has the same sensitive attribute, k-anonymity doesn't help. Suppose three people share the quasi-identifier profile {age: 29, gender: F, zip: 48201}, and all three have a diagnosis of "HIV." An attacker who links you to that group now knows your diagnosis with certainty, even though they can't identify your specific record.

Background Knowledge Attack: An attacker may have additional information that narrows possibilities. If an attacker knows that a 52-year-old woman in zip code 10001 is Japanese, and one record in the k-anonymous group is associated with a Japanese cultural center, the attacker can identify the specific record without breaking k-anonymity.

Skewness Attack: Even if sensitive values in a group are diverse, their distribution may reveal information. If a group contains 4 records, and 3 out of 4 have "Cancer" as their diagnosis, an attacker can infer with 75% probability that the target has cancer.

These vulnerabilities motivated the development of stronger privacy definitions.

10.4.5 L-Diversity

L-diversity, proposed by Machanavajjhala et al. (2007), extends k-anonymity by requiring that each equivalence class contains at least l "well-represented" values of the sensitive attribute. This addresses the homogeneity attack: if every group contains at least l different diagnoses, knowing someone's group doesn't reveal their diagnosis.

Several variants of l-diversity exist, from the simple requirement of l distinct values to more sophisticated entropy-based measures. The trade-off is that achieving l-diversity requires more generalization or suppression than k-anonymity alone, further reducing data utility.

10.4.6 T-Closeness

T-closeness, proposed by Li, Li, and Venkatasubramanian (2007), goes further still. It requires that the distribution of sensitive attributes within each equivalence class is close to the distribution in the overall dataset, where "close" is measured by a distance metric no greater than a threshold t.

This addresses the skewness attack: if a group's distribution of diagnoses matches the overall population's distribution, knowing someone's group reveals nothing beyond what was already publicly known about the population.

The progression from k-anonymity to l-diversity to t-closeness illustrates a recurring pattern in privacy engineering: every defense creates new attacks, and every new defense adds complexity and reduces utility. This arms race is one reason researchers began exploring fundamentally different approaches to privacy -- which brings us to differential privacy.

10.5 Differential Privacy: A Mathematical Guarantee

10.5.1 The Core Idea

Differential privacy, introduced by Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith in 2006, takes a fundamentally different approach from k-anonymity and its descendants. Instead of trying to make individual records unidentifiable within a released dataset, differential privacy adds carefully calibrated noise to the output of queries or analyses, providing a mathematical guarantee about how much any single individual's data can influence the result.

Here is the intuition: Imagine a database of medical records. You want to know the average blood pressure of patients in a certain age group. Differential privacy adds random noise to the answer -- perhaps reporting 127.3 instead of the true value of 126.8. The noise is large enough that the result would be essentially the same whether or not any specific individual was in the database, but small enough that the answer is still useful for research purposes.

The formal definition is precise: a mechanism satisfies epsilon-differential privacy if the probability of any output changes by at most a factor of e^epsilon when a single individual is added to or removed from the dataset.

Don't worry if the math is opaque. What matters is the guarantee: your participation in the dataset cannot significantly change the output. An attacker who sees the result cannot determine whether you were in the dataset at all.

10.5.2 The Epsilon Parameter

The strength of the privacy guarantee is controlled by epsilon -- a parameter that represents the privacy "budget." Smaller epsilon means stronger privacy (more noise, less precision). Larger epsilon means weaker privacy (less noise, more precision).

Think of epsilon as a dial: - Epsilon near 0: Extremely strong privacy. The output is nearly random -- almost no information about any individual leaks. But the results may be too noisy to be useful. - Epsilon around 1: A common benchmark for "reasonable" privacy. Some information leaks, but the risk is bounded. - Epsilon of 10 or more: Weak privacy guarantee. The noise is so small relative to the signal that individuals may be distinguishable.

Analogy: Imagine you're in a room with 100 people, and someone asks, "How many people in this room have been diagnosed with depression?" With strong differential privacy (low epsilon), the answer might be "somewhere between 10 and 25" -- true enough to be useful for public health research, but imprecise enough that no one can determine whether you are one of the people counted. With weak differential privacy (high epsilon), the answer might be "17 or 18" -- more precise, but potentially revealing if someone knows the room originally had a specific count.

10.5.3 The Privacy Budget

One crucial subtlety: every query against a differentially private system "spends" some of the privacy budget. If you ask one question with epsilon = 1, the individual's privacy loss is bounded by epsilon = 1. If you ask ten questions, each with epsilon = 1, the cumulative privacy loss is bounded by epsilon = 10 (a much weaker guarantee).

This means differential privacy is not a magic shield that allows unlimited queries on sensitive data. Organizations must manage their privacy budget carefully, deciding which queries are important enough to spend on and when the budget is exhausted.

10.5.4 Real-World Implementations

Apple was one of the first major technology companies to deploy differential privacy at scale, beginning in 2016. Apple uses local differential privacy to collect usage statistics (popular emoji, frequently visited websites, health data trends) from millions of iPhones. The noise is added on the device before data leaves the user's phone, so Apple's servers never see individual data. Apple set epsilon values between 1 and 8 for different data types, though researchers have debated whether the higher end provides meaningful protection.

The U.S. Census Bureau adopted differential privacy for the 2020 Census, adding noise to published statistics to prevent re-identification of individuals in small geographic areas. This was controversial: some state officials argued that the noise distorted population counts used for legislative redistricting and federal funding allocation. The Bureau responded that without differential privacy, modern re-identification techniques could compromise the confidentiality promise that underpins public trust in the Census.

Google developed RAPPOR (Randomized Aggregatable Privacy-Preserving Ordinal Response) to collect browser statistics from Chrome users with differential privacy guarantees.

Reflection: Apple's implementation raises a design question relevant to PbD Principle 4 (positive-sum, not zero-sum). Is differential privacy a genuine positive-sum solution -- providing both analytics and privacy -- or does it simply shift the trade-off to a different point? The answer depends on how much analytical precision you need and how much privacy you owe.

10.5.5 Local vs. Global Differential Privacy

An important distinction separates two approaches to differential privacy:

Local differential privacy adds noise before data leaves the individual's device. Each person's data is perturbed locally, and only the noisy version is sent to the server. Apple's implementation uses this approach. The advantage is that the server never sees true individual data, so even a compromised or malicious server cannot expose individuals. The disadvantage is that local noise reduces the accuracy of the aggregate results significantly, because the noise compounds across individuals.

Global (central) differential privacy collects true individual data on a trusted server, then adds noise to the aggregate output. The U.S. Census Bureau's approach uses this model. The advantage is greater accuracy: because noise is added once to the aggregate rather than independently to each record, the signal-to-noise ratio is much better. The disadvantage is that the server holds unperturbed individual data, creating a centralized point of vulnerability.

The choice between local and global differential privacy reflects a fundamental trust decision. Local DP trusts no one; global DP trusts the server operator. For organizations like Apple, which has built its brand on privacy, local DP reinforces the trust proposition. For government statistical agencies with strong confidentiality mandates and established trust, global DP offers better data quality.

Mira considered this distinction for VitraMed. "Our clinics need to trust us with their patient data -- that's the nature of being a business associate under HIPAA. So we're already in a global-trust model. But if we ever expand into patient-facing analytics -- showing patients their own health trends compared to population averages -- local DP would be the right choice, because patients shouldn't have to trust us not to peek."

10.6 Privacy-Enhancing Technologies (PETs)

10.6.1 Beyond Anonymization

K-anonymity, l-diversity, and differential privacy all address the same problem: how to extract useful information from data while protecting individual privacy. But they all operate on data that has already been collected and centralized. A more radical approach asks: what if the data never had to be collected in one place at all?

Privacy-enhancing technologies (PETs) are a family of cryptographic and computational techniques that enable useful computation on data while limiting exposure. Three are particularly important for data governance practitioners to understand.

10.6.2 Homomorphic Encryption

The concept: Homomorphic encryption allows computation on encrypted data without decrypting it. A hospital could encrypt patient records, send them to a cloud provider for analysis, receive encrypted results, and decrypt them -- without the cloud provider ever seeing the underlying data.

How it works (simplified): Special encryption schemes preserve mathematical relationships. If you encrypt the numbers 3 and 5, and then add the encrypted values, the result, when decrypted, is 8 -- even though the computation was performed entirely on ciphertext.

Current limitations: Fully homomorphic encryption (FHE) exists but is currently orders of magnitude slower than computation on plaintext. A calculation that takes seconds on unencrypted data might take hours or days with FHE. Research is narrowing this gap, but practical deployment at scale remains limited to specific use cases.

Governance relevance: Homomorphic encryption could transform health data analytics, financial risk assessment, and other domains where data sensitivity limits sharing. But it does not eliminate the need for governance -- questions about who conducts the computation, what questions are asked, and how results are used remain.

10.6.3 Secure Multi-Party Computation (SMPC)

The concept: SMPC enables multiple parties to jointly compute a function over their combined data without revealing their individual inputs to each other. Three hospitals could compute the average treatment outcome across all their patients without any hospital sharing its patient records with the others.

How it works (simplified): Each party "secret-shares" its data -- splits it into fragments that individually reveal nothing. The computation is performed on these fragments through a carefully designed protocol. The final result is assembled from the fragments and revealed to the participants, but no party ever sees another party's raw data.

Practical example: Researchers studying the effectiveness of a drug across multiple hospital systems face a dilemma: pooling patient data would produce better statistical power, but data-sharing agreements are complex, slow, and raise privacy concerns. SMPC allows the computation to proceed as if the data were pooled, without the data ever leaving its home institution.

Governance relevance: SMPC is particularly valuable where data minimization would prevent useful analysis. It aligns with PbD Principle 4 (positive-sum) by enabling collaboration without centralization.

10.6.4 Federated Learning

The concept: Federated learning is a machine learning approach where a model is trained across multiple decentralized devices or servers holding local data, without exchanging the data itself. Instead of sending data to a central server, the algorithm sends the model to the data, trains locally, and sends back only the model updates (gradients).

How it works (simplified): A central server distributes a machine learning model to thousands of devices (phones, hospital servers, bank branches). Each device trains the model on its local data and sends back only the learned parameters -- not the underlying data. The central server aggregates these updates into an improved model and distributes it again.

Practical example: Google uses federated learning to improve the next-word prediction on Gboard (its mobile keyboard). Your phone learns from your typing patterns and contributes to improving the model, but your actual text never leaves your device.

Limitations: Federated learning is not immune to privacy attacks. Model updates can sometimes reveal information about the underlying data (gradient inversion attacks). Combining federated learning with differential privacy -- adding noise to the gradient updates -- provides stronger protection at the cost of some model accuracy.

Governance relevance: Federated learning embodies the data minimization principle in a machine learning context: the data stays local, only insights travel. It is especially promising for healthcare, where patient data cannot easily cross institutional boundaries.

PETs Comparison Summary:

Technology What It Protects Key Limitation Maturity

Homomorphic Encryption Data during computation Very slow for complex operations Research/early commercial

Secure Multi-Party Computation Data from other participants Communication overhead; protocol complexity Commercial in specific sectors

Federated Learning Data from centralization Gradient inversion attacks possible Deployed at scale (Google, Apple)

10.7 Privacy by Design at VitraMed and in Detroit

10.7.1 VitraMed: Applying PbD to Patient Data

Mira spent a week mapping VitraMed's data flows against Cavoukian's seven principles. The results were sobering.

"We fail on almost every principle," she told her father over dinner. "We collect way more data than we need for EHR management -- Principle 2 violation. Our analytics pipeline was built first and privacy was added later as a compliance layer -- Principle 3 violation. We retain data indefinitely 'in case it's useful' -- Principle 5 violation. Our privacy dashboard for patients is a 12-page PDF that nobody reads -- Principle 7 violation."

Vikram Chakravarti set down his fork. "Mira, we're a startup. We had five engineers when we built that system. Privacy by design sounds nice in a textbook, but we were trying to survive."

"I know," Mira said. "I'm not saying you were malicious. I'm saying the system wasn't designed with privacy in mind, and now we have 50,000 patients whose data is more exposed than it needs to be. The question is what we do now."

This is the tension at the heart of PbD in practice. Retrofitting privacy into an existing system is harder and more expensive than building it in from the start. But for the thousands of organizations that already have operational data systems, "start over" is not a realistic option. The practical challenge is incremental improvement: applying PbD principles to the next feature, the next data flow, the next architectural decision -- while remediating the most critical gaps in the existing system.

Mira's proposed remediation plan prioritized three changes: 1. Data minimization audit: Review every data field collected and justify or eliminate it. 2. Retention policy: Implement automated deletion schedules, starting with the highest-risk data categories. 3. Privacy defaults: Flip the data-sharing toggle from opt-out to opt-in for research partnerships.

Connection to Chapter 9: Mira's third proposal -- changing from opt-out to opt-in -- directly addresses the consent architecture problems we examined in Chapter 9. The design of the default is a privacy decision, not a neutral one.

10.7.2 Eli's Detroit: PbD for Smart City Sensors

Eli's critique of the Detroit Smart City sensors took on new precision after studying Privacy by Design.

"Here's what I'd require," he wrote in a policy brief for Dr. Adeyemi's class. "If the city wants to deploy sensors in public spaces, those sensors need to be designed with PbD from day one. That means:

Purpose limitation -- traffic sensors collect traffic data, not faces, not conversations, not license plate numbers.
Edge processing -- data is processed at the sensor and only aggregate counts are transmitted. No raw video leaves the camera.
Automatic deletion -- raw data is overwritten every 24 hours. Only aggregated, de-identified statistics are retained.
Public dashboard -- residents can see what data is being collected, in what form, and how it's being used. In real time.
Community consent -- before any sensor is deployed, the affected community votes on whether to accept it. Not the city council. The community."

Dr. Adeyemi marked the brief highly but noted one challenge: "Your proposal is strong on principles but thin on enforcement. What happens when the city or its vendor violates these requirements? Who monitors the monitors?"

Eli wrote in the margin: "Chapter 17, I guess."

Debate Box: Eli's fifth proposal -- community consent for sensor deployment -- raises a question about the scope of Privacy by Design. Cavoukian's framework focuses on technical and organizational design. Eli is extending it to democratic design -- the idea that privacy-protective systems should be authorized by the communities they affect. Is this a natural extension of PbD, or a different kind of claim entirely? Dr. Adeyemi posed this question to the class, and the debate filled an entire session.

10.7.3 The Organizational Challenge: Who Owns Privacy?

Both the VitraMed and Detroit examples reveal an organizational question that technical frameworks alone cannot answer: who within an organization is responsible for Privacy by Design?

In many companies, privacy is treated as a legal function -- the general counsel's office ensures compliance with applicable regulations. In more mature organizations, a Data Protection Officer (DPO) or Chief Privacy Officer (CPO) has dedicated responsibility. In the most advanced organizations, privacy engineers work alongside product engineers, participating in design reviews and architectural decisions from the earliest stages.

The placement of the privacy function matters. If privacy reports to legal, it tends to be framed as compliance -- meeting minimum requirements at minimum cost. If privacy reports to engineering, it tends to be framed as a technical challenge -- implementing mechanisms without questioning whether the collection should happen at all. If privacy has its own organizational authority, it can serve as a cross-functional check on both product development and business strategy.

Ray Zhao described NovaCorp's evolution: "When I joined, privacy was one lawyer reviewing data practices after they were built. Now we have a privacy engineering team that participates in every product design review. They don't have veto power -- that would slow us down too much -- but they have advisory authority and escalation rights. If the privacy team flags a concern, it goes to me, and I make the call. That's imperfect, but it's vastly better than what we had."

Mira noted the contrast with VitraMed: "We don't have a privacy team. We don't even have a DPO. My dad is the CEO, the lead developer, and effectively the privacy officer. That's not sustainable."

Connection to Chapter 6: The organizational placement of privacy responsibility reflects the ethical frameworks at work. A compliance-focused approach (legal ownership) aligns with a deontological minimum: what duties does the law impose? A value-focused approach (dedicated privacy authority) aligns with virtue ethics: what kind of organization do we want to be? The structural choice reveals the ethical commitment.

10.8 Case Study References

Apple's Differential Privacy Implementation

Apple's adoption of differential privacy beginning in iOS 10 (2016) represents one of the most significant corporate deployments of a privacy-enhancing technology. Apple uses local differential privacy -- noise is added on the user's device before data is transmitted to Apple's servers -- for collecting emoji usage statistics, Safari web domain suggestions, health data trends, and keyboard usage patterns. The technical details are documented in Apple's machine learning research publications.

Key questions for analysis: - Apple's epsilon values reportedly range from 1 to 8 per data type, with a daily per-user budget of 1 to 4. Are these values sufficient for meaningful privacy protection? - Apple's model is "local" differential privacy (noise added on device). How does this differ from "global" differential privacy (noise added by a central server), and what are the trade-offs? - Does Apple's implementation satisfy PbD Principle 4 (positive-sum), or does it sacrifice analytical precision for privacy?

Full case study analysis: case-study-01.md

De-identification Failures: The Netflix Prize Dataset

The Netflix Prize dataset release (2006) and subsequent re-identification by Narayanan and Shmatikov remains one of the most cited examples of anonymization failure. Netflix removed subscriber names and replaced them with random IDs, believing this was sufficient. The researchers demonstrated that cross-referencing with public IMDb ratings could re-identify subscribers, revealing their complete (and potentially sensitive) viewing histories.

Key questions for analysis: - What quasi-identifiers existed in the Netflix dataset that Netflix failed to account for? - How does the Netflix case illustrate the difference between de-identification and true anonymization? - The resulting lawsuit raised the question of whether viewing habits can reveal sensitive information (sexual orientation, political views, health conditions). How does this connect to Nissenbaum's contextual integrity framework from Chapter 7? - What would a PbD-informed approach to the Netflix Prize have looked like?

Full case study analysis: case-study-02.md

10.9 Chapter Summary

Key Concepts

Privacy by Design (Cavoukian, 1995/2010): Seven principles for embedding privacy into system architecture proactively, codified in GDPR Article 25
Data minimization: Collect only what is necessary, retain only as long as needed, delete when done
Anonymization vs. pseudonymization: Anonymization (irreversible, no longer personal data) vs. pseudonymization (reversible, still personal data) -- with anonymization far harder to achieve than commonly assumed
K-anonymity: Every combination of quasi-identifiers appears in at least k records; vulnerable to homogeneity, background knowledge, and skewness attacks
L-diversity and t-closeness: Extensions of k-anonymity addressing its specific vulnerabilities
Differential privacy: Mathematical guarantee that no individual's participation significantly changes the output; controlled by the epsilon parameter
PETs: Homomorphic encryption, secure multi-party computation, and federated learning enable computation on sensitive data without centralized exposure

Key Debates

Is Privacy by Design achievable in practice, or is it an aspirational ideal that organizations can never fully realize?
Does the "just in case" argument for data retention ever justify deviation from data minimization?
Can anonymization ever provide meaningful protection in an era of big data and linkage attacks, or is it an inherently false promise?
Should differential privacy be mandatory for all public data releases?
Do PETs solve the privacy problem, or do they simply shift it to a different technical layer while leaving the governance questions unchanged?

Applied Framework

The Privacy by Design Evaluation asks: (1) Is the system proactive or reactive? (2) What are the defaults? (3) Is privacy embedded or bolted on? (4) Is functionality maintained? (5) Is data protected across the full lifecycle? (6) Is the system transparent and verifiable? (7) Is it user-centric? Score each principle on a scale from "not implemented" to "fully implemented" to identify priorities for remediation.

What's Next

In Chapter 11: The Economics of Privacy, we shift from the technical dimension of privacy to the economic one. If privacy is valuable, why do people trade it so cheaply? If data breaches are costly, why do they keep happening? We'll examine the privacy paradox, the hidden economy of data brokers, the true cost of breaches, and the economic models that shape -- and constrain -- privacy regulation. Understanding the economics is essential because, as Ray Zhao told Dr. Adeyemi's class, "You can design the most privacy-protective system in the world, but if the business model punishes you for using it, nobody will."

Before moving on, complete the exercises and quiz.

Technology	What It Protects	Key Limitation	Maturity
Homomorphic Encryption	Data during computation	Very slow for complex operations	Research/early commercial
Secure Multi-Party Computation	Data from other participants	Communication overhead; protocol complexity	Commercial in specific sectors
Federated Learning	Data from centralization	Gradient inversion attacks possible	Deployed at scale (Google, Apple)

Prerequisites

Learning Objectives

In This Chapter

Chapter 10: Privacy by Design and Data Minimization

Chapter Overview

10.1 Ann Cavoukian's Privacy by Design

10.1.1 Origins and Context

10.1.2 The Seven Foundational Principles

10.1.3 From Aspiration to Regulation: GDPR Article 25

10.1.4 PbD in Practice: What It Looks Like

10.2 Data Minimization: Collect Only What You Need

10.2.1 The Principle

10.2.2 Why Organizations Over-Collect

10.2.3 Implementing Data Minimization

10.2.4 Data Minimization in Practice: Tensions and Trade-offs

10.3 Anonymization, Pseudonymization, and Their Limits

10.3.1 Three Approaches to Reducing Identifiability

10.3.2 The Illusion of Anonymity

10.3.3 The Netflix Prize Dataset

10.3.4 When Anonymization Works -- and When It Doesn't

10.4 K-Anonymity and Its Descendants

10.4.1 The Definition

10.4.2 Achieving K-Anonymity

10.4.3 Python Implementation: Checking K-Anonymity

10.4.4 Attacks on K-Anonymity

10.4.5 L-Diversity

10.4.6 T-Closeness

10.5 Differential Privacy: A Mathematical Guarantee

10.5.1 The Core Idea

10.5.2 The Epsilon Parameter

10.5.3 The Privacy Budget

10.5.4 Real-World Implementations

10.5.5 Local vs. Global Differential Privacy

10.6 Privacy-Enhancing Technologies (PETs)

10.6.1 Beyond Anonymization

10.6.2 Homomorphic Encryption

10.6.3 Secure Multi-Party Computation (SMPC)

10.6.4 Federated Learning

10.7 Privacy by Design at VitraMed and in Detroit

10.7.1 VitraMed: Applying PbD to Patient Data

10.7.2 Eli's Detroit: PbD for Smart City Sensors

10.7.3 The Organizational Challenge: Who Owns Privacy?

10.8 Case Study References

Apple's Differential Privacy Implementation

De-identification Failures: The Netflix Prize Dataset

10.9 Chapter Summary

Key Concepts

Key Debates

Applied Framework

What's Next

Chapter 10 Exercises -> exercises.md

Chapter 10 Quiz -> quiz.md

Case Study: Apple's Differential Privacy Implementation -> case-study-01.md

Case Study: De-identification Failures — The Netflix Prize Dataset -> case-study-02.md

Related Reading

Chapter 10 Exercises -> `exercises.md`

Chapter 10 Quiz -> `quiz.md`

Case Study: Apple's Differential Privacy Implementation -> `case-study-01.md`

Case Study: De-identification Failures — The Netflix Prize Dataset -> `case-study-02.md`