Exercises: Privacy by Design and Data Minimization

DataField.Dev

Exercises: Privacy by Design and Data Minimization

These exercises progress from concept checks through applied analysis to Python coding challenges and open-ended research. Estimated completion time: 4-5 hours.

Difficulty Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)

Part A: Conceptual Understanding ⭐

Test your grasp of core concepts from Chapter 10.

A.1. List Cavoukian's seven foundational principles of Privacy by Design (Section 10.1). For each principle, write one sentence explaining how it differs from a purely reactive approach to privacy protection.

A.2. Section 10.2 distinguishes between data minimization, purpose limitation, and storage limitation. A startup collects user email addresses, browsing history, purchase history, and GPS location data in order to send personalized product recommendations via email. Identify which data elements violate the principle of data minimization given the stated purpose, and explain your reasoning.

A.3. Explain the difference between anonymization and pseudonymization as described in Section 10.3. Why does the GDPR treat pseudonymized data as personal data but anonymized data as outside the regulation's scope? Under what circumstances might this distinction break down in practice?

A.4. Define k-anonymity in your own words. Using the example dataset from Section 10.4.1, explain what it means for a dataset to satisfy 3-anonymity and why 1-anonymity is effectively no anonymity at all.

A.5. Section 10.4 describes three progressively stronger privacy models: k-anonymity, l-diversity, and t-closeness. Explain the specific vulnerability that l-diversity addresses which k-anonymity does not. Then explain the vulnerability that t-closeness addresses which l-diversity does not.

A.6. In your own words, explain differential privacy as described in Section 10.5. What does it mean to say that differential privacy provides a mathematical guarantee, and how does that guarantee differ from the guarantees provided by k-anonymity?

A.7. Section 10.6 introduces three Privacy-Enhancing Technologies (PETs): homomorphic encryption, federated learning, and secure multi-party computation. For each, describe one real-world use case where the technology addresses a privacy need that traditional encryption cannot.

Part B: Applied Analysis ⭐⭐

Analyze scenarios, arguments, and real-world situations using concepts from Chapter 10.

B.1. Consider the following scenario:

A city transit authority proposes a new system to reduce bus wait times. The system would track riders' smartphones via Bluetooth beacons at every bus stop, recording device IDs, arrival times, boarding times, and destinations. The transit authority argues that this data is necessary to optimize bus routes and schedules. They promise to "anonymize" the data by hashing device IDs.

Using Cavoukian's Privacy by Design principles, evaluate this proposal. Which principles does it satisfy? Which does it violate? Propose a redesigned system that achieves the same operational goal while better adhering to Privacy by Design.

B.2. Mira is reviewing VitraMed's patient intake process. Currently, the intake form collects 47 data fields including full name, date of birth, Social Security number, home address, employment history, emergency contacts (with their own personal details), insurance information, and a free-text "reason for visit" field. The data is stored indefinitely "in case it's needed for future care."

Apply the principle of data minimization to redesign the intake process. For each of the data elements listed, determine whether it is (a) strictly necessary for clinical care, (b) necessary for billing/insurance, (c) useful but not necessary, or (d) not justified. Propose a retention schedule that limits storage to what is required.

B.3. Section 10.4.1 presents a dataset that satisfies 3-anonymity but is vulnerable to a homogeneity attack. Construct your own example of a dataset with at least 12 records and four quasi-identifiers that satisfies 4-anonymity but remains vulnerable to a homogeneity attack. Explain the attack.

B.4. Eli discovers that the Detroit Police Department uses a facial recognition system that stores biometric templates of all individuals captured by public surveillance cameras. The department claims the system satisfies Privacy by Design because templates are "encrypted at rest." Evaluate this claim against all seven of Cavoukian's principles. Is encryption sufficient to constitute Privacy by Design?

B.5. Dr. Adeyemi presents the following argument in class:

"Data minimization sounds good in theory, but in practice it conflicts with the goals of medical research. If hospitals only collect the minimum data needed for each patient's treatment, researchers lose access to the large, rich datasets that enable breakthroughs in understanding disease. Data minimization is a luxury that costs lives."

Write a structured response that (a) acknowledges the legitimate tension this argument identifies, (b) challenges at least one of its assumptions, and (c) proposes a framework for balancing minimization with research needs — drawing on concepts from this chapter such as differential privacy, federated learning, and purpose limitation.

B.6. A social media company claims it has "anonymized" a dataset of 10 million user posts by removing usernames and replacing them with random IDs. However, the dataset retains: the full text of each post, the timestamp of each post (to the second), the user's self-reported city, and the number of followers. Using concepts from Section 10.3 and Section 10.4, assess the re-identification risk. What additional steps would be needed to meaningfully reduce this risk?

Part C: Python Coding Challenges ⭐⭐-⭐⭐⭐

These exercises require Python. Use the k-anonymity checker and differential privacy implementations from Section 10.4 and 10.5 as starting points.

C.1. ⭐⭐ k-Anonymity Checker. Write a Python function check_k_anonymity(df, quasi_identifiers) that takes a pandas DataFrame and a list of column names (the quasi-identifiers), and returns the minimum group size — that is, the k-value the dataset achieves. Test your function on the following dataset:

import pandas as pd

data = {
    'age': [25, 25, 30, 30, 30, 35, 35, 35, 40, 40],
    'zipcode': ['10001', '10001', '10002', '10002', '10002', '10003', '10003', '10003', '10004', '10004'],
    'gender': ['M', 'M', 'F', 'F', 'F', 'M', 'M', 'F', 'F', 'F'],
    'diagnosis': ['Flu', 'Cold', 'Flu', 'Diabetes', 'Cold', 'Flu', 'Flu', 'Cancer', 'Cold', 'Flu']
}
df = pd.DataFrame(data)

What k-value does this dataset satisfy for quasi-identifiers ['age', 'zipcode', 'gender']?

C.2. ⭐⭐ Generalization for k-Anonymity. Write a function generalize_age(df, bin_size) that replaces exact ages with age ranges (e.g., an age of 27 with bin_size=10 becomes "20-29"). Apply this function to the dataset from C.1 with different bin sizes (5, 10, 20) and recalculate the k-value each time using your function from C.1. What is the trade-off between bin size and k-value? At what bin size do you first achieve 3-anonymity?

C.3. ⭐⭐ l-Diversity Checker. Extend your k-anonymity checker to also verify l-diversity. Write a function check_l_diversity(df, quasi_identifiers, sensitive_attribute) that returns the minimum number of distinct values of the sensitive attribute across all equivalence classes. Test it on the dataset from C.1 with sensitive_attribute='diagnosis'. Does the dataset satisfy 2-diversity?

C.4. ⭐⭐⭐ Laplace Mechanism for Differential Privacy. Implement the Laplace mechanism for differential privacy as described in Section 10.5.2:

import numpy as np

def laplace_mechanism(true_value, sensitivity, epsilon):
    """
    Add Laplace noise to a true value.

    Parameters:
        true_value: The actual query result (e.g., a count or average)
        sensitivity: The maximum change in the query result
                     from adding or removing one record
        epsilon: Privacy budget (smaller = more private, noisier)

    Returns:
        The noisy result
    """
    # Your implementation here
    pass

Then simulate the following: a hospital has 1,000 patients, and the true count of patients with diabetes is 127. Use your Laplace mechanism with sensitivity=1 (counting query) to produce noisy answers for epsilon values of 0.01, 0.1, 0.5, 1.0, and 5.0. For each epsilon, run 1,000 trials and report the mean and standard deviation of the noisy answers. Plot the results. What happens to accuracy as privacy increases (epsilon decreases)?

C.5. ⭐⭐⭐ Privacy Budget Tracker. Section 10.5.3 discusses the concept of a privacy budget. Write a class PrivacyBudgetTracker that:

Is initialized with a total privacy budget (epsilon_total)
Has a method query(true_value, sensitivity, epsilon_cost) that checks whether sufficient budget remains, and if so, applies the Laplace mechanism and deducts the cost from the remaining budget
Has a method remaining_budget() that returns the remaining epsilon
Raises an exception if a query would exceed the total budget

Demonstrate its use with a total budget of 1.0 and a sequence of five queries, showing what happens when the budget is exhausted.

C.6. ⭐⭐⭐ Suppression Strategy. Write a function suppress_for_k_anonymity(df, quasi_identifiers, k) that achieves k-anonymity by suppressing (removing) records that belong to equivalence classes smaller than k. Apply it to the dataset from C.1 with k=3. How many records are suppressed? What is the information loss? Discuss the ethical implications of systematically removing records from minority groups.

C.7. ⭐⭐⭐⭐ Randomized Response. Section 10.5 briefly mentions randomized response as a local differential privacy technique. Implement a randomized response protocol for a yes/no question:

def randomized_response(true_answer, p_truth=0.75):
    """
    With probability p_truth, report the true answer.
    With probability (1 - p_truth), report a random answer.
    """
    pass

def estimate_true_proportion(responses, p_truth=0.75):
    """
    Given a list of randomized responses, estimate the true proportion
    of 'yes' answers using the correction formula.
    """
    pass

Simulate a population of 10,000 people where the true proportion answering "yes" to a sensitive question is 0.30. Run the randomized response protocol with p_truth values of 0.5, 0.75, and 0.9. For each, compare the estimated proportion to the true proportion and measure the error. How does the privacy-accuracy trade-off manifest?

Part D: Synthesis & Critical Thinking ⭐⭐⭐

These questions require you to integrate multiple concepts from Chapter 10 and think beyond the material presented.

D.1. The chapter presents Privacy by Design and data minimization as complementary principles. But consider a scenario where they conflict: a self-driving car company argues that Privacy by Design requires them to collect as much sensor data as possible — including images of pedestrians, license plates of nearby vehicles, and precise GPS trails — because the safest possible system requires the richest possible training data. "Privacy by Design means designing the safest product," they argue. "And safety requires data."

Evaluate this argument. Does Privacy by Design require maximizing safety even at the cost of minimization? How would you resolve the tension? Reference Cavoukian's principles in your analysis.

D.2. Section 10.5 introduces differential privacy as a mathematically rigorous privacy framework. Critics argue that differential privacy is too abstract for most organizations to implement, that the choice of epsilon is inherently subjective, and that the noise required for strong privacy guarantees often destroys the utility of the data for the purposes organizations need. Defenders respond that differential privacy is the only framework that provides a formal, provable privacy guarantee and that all other approaches (k-anonymity, suppression, generalization) have been repeatedly broken.

Write a 400-500 word analysis evaluating both sides. Under what circumstances would you recommend differential privacy over k-anonymity, and vice versa? Is there a role for combining multiple approaches?

D.3. Ray Zhao tells Mira: "Look, I understand the theory. But at NovaCorp, we have 200 million customer records. Every time we try to minimize data collection, a business unit pushes back — marketing needs behavioral data, risk needs transaction histories, compliance needs everything for seven years. Privacy by Design sounds great in a textbook, but in a real company, it's a constant negotiation with people whose bonuses depend on having more data, not less."

Write a response that takes Ray's practical concerns seriously while still arguing for the feasibility of Privacy by Design in a large organization. Propose three specific, actionable steps that NovaCorp could take to move toward Privacy by Design without requiring every business unit to agree simultaneously.

D.4. The chapter discusses how Apple uses local differential privacy to collect usage statistics from iPhones without learning individual user behavior. Consider the following critique:

"Apple's differential privacy implementation is a form of privacy theater. The epsilon values Apple uses have been shown to be quite large, meaning the privacy protection is weaker than the mathematical framework suggests. More importantly, Apple still collects vast quantities of non-differentially-private data — iCloud backups, Siri recordings, App Store purchase histories. Differential privacy on emoji usage is a distraction from the real data collection."

Evaluate this critique. Is it fair? What would a genuinely comprehensive Privacy by Design approach look like for a company of Apple's scope?

Part E: Research & Extension ⭐⭐⭐⭐

These are open-ended projects for students seeking deeper engagement. Each requires independent research beyond the textbook.

E.1. The GDPR and Privacy by Design (Article 25). Research Article 25 of the GDPR, which codifies "data protection by design and by default" into law. Write a 1,000-word analysis covering: (a) what Article 25 requires, (b) how it translates Cavoukian's principles into legal obligations, (c) at least two enforcement actions where regulators cited Article 25 violations, and (d) whether the legal requirements go far enough or whether they water down the original Privacy by Design vision. Use at least three sources beyond this textbook.

E.2. Differential Privacy in Practice: The U.S. Census. The 2020 U.S. Census used differential privacy for the first time to protect respondent data in published statistics. This decision was controversial — some states argued that the added noise would reduce the accuracy of redistricting data. Research this case and write a report (800-1,200 words) covering: (a) how the Census Bureau implemented differential privacy, (b) the specific objections raised by states, civil rights organizations, and demographers, (c) the trade-offs between privacy protection and data accuracy, and (d) your own evaluation of whether the Census Bureau made the right choice.

E.3. Federated Learning: Promise and Limitations. Google's Gboard keyboard uses federated learning to improve autocomplete predictions without sending individual keystrokes to a central server. Research federated learning and write a report (800-1,200 words) addressing: (a) how federated learning works technically, (b) what privacy guarantees it provides and what it does not protect against, (c) at least two real-world deployments beyond Gboard, and (d) the relationship between federated learning and differential privacy (can they be combined, and if so, how?).

E.4. Privacy by Design Audit. Select a product, service, or system you use regularly (a fitness tracker, a banking app, a university learning management system). Conduct a Privacy by Design audit by evaluating it against all seven of Cavoukian's principles. For each principle, assess whether the system satisfies, partially satisfies, or fails to satisfy the principle, providing specific evidence. Write a 1,500-word audit report with recommendations for improvement.

Solutions

Selected solutions are available in appendices/answers-to-selected.md.