Quiz: Privacy by Design and Data Minimization

DataField.Dev

Quiz: Privacy by Design and Data Minimization

Test your understanding before moving to the next chapter. Target: 70% or higher to proceed.

Section 1: Multiple Choice (1 point each)

1. Ann Cavoukian's Privacy by Design framework originated in which decade and context?

A) The 2010s, in response to the Cambridge Analytica scandal.
B) The 1990s, developed by the Information and Privacy Commissioner of Ontario, Canada.
C) The 2000s, as part of the U.S. National Institute of Standards and Technology (NIST) privacy framework.
D) The 1980s, as an extension of the OECD Fair Information Practice Principles.

Answer

**B)** The 1990s, developed by the Information and Privacy Commissioner of Ontario, Canada. *Explanation:* Section 10.1 explains that Ann Cavoukian developed the Privacy by Design framework in the 1990s during her tenure as Information and Privacy Commissioner of Ontario. It was formally published as seven foundational principles in 2009 and later recognized by the International Assembly of Privacy Commissioners. Options A, C, and D misattribute the origin to different decades and institutions. The framework predates both the Cambridge Analytica scandal and the NIST Privacy Framework.

2. Which of the following best captures the meaning of Cavoukian's principle "Proactive not Reactive; Preventative not Remedial"?

A) Organizations should respond quickly to privacy breaches when they occur.
B) Privacy protections should be built into systems before they are deployed, anticipating and preventing privacy violations rather than waiting for them to happen.
C) Users should proactively manage their own privacy settings rather than relying on organizations.
D) Regulators should impose penalties severe enough to deter future privacy violations.

Answer

**B)** Privacy protections should be built into systems before they are deployed, anticipating and preventing privacy violations rather than waiting for them to happen. *Explanation:* Section 10.1.1 describes this as the first and foundational principle: privacy should be anticipated and prevented rather than detected and remedied after the fact. Option A describes incident response, which is reactive by definition. Option C shifts responsibility to users, contradicting the organizational focus of PbD. Option D describes regulatory deterrence, a different mechanism entirely. The principle requires organizations to design systems with privacy protections embedded from the outset.

3. Data minimization, as described in this chapter, requires:

A) Collecting no data at all until a specific use case has been approved by a regulator.
B) Collecting only the data that is adequate, relevant, and limited to what is necessary for the specified purpose.
C) Minimizing the cost of data storage by deleting old records.
D) Reducing the number of data fields in a database to improve query performance.

Answer

**B)** Collecting only the data that is adequate, relevant, and limited to what is necessary for the specified purpose. *Explanation:* Section 10.2 defines data minimization using language drawn from the GDPR (Article 5(1)(c)): personal data must be "adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed." Option A overstates the requirement — minimization does not mean collecting nothing, but collecting only what is justified. Option C confuses data minimization with cost management. Option D describes database optimization, a technical concern unrelated to privacy.

4. A hospital replaces patient names with random alphanumeric codes but retains dates of birth, ZIP codes, and treatment records linked to those codes. This process is best described as:

A) Anonymization
B) Pseudonymization
C) Differential privacy
D) Data deletion

Answer

**B)** Pseudonymization *Explanation:* Section 10.3.1 distinguishes pseudonymization from anonymization. Pseudonymization replaces direct identifiers (names) with codes or tokens, but the data can still be re-linked to the individual if the mapping is available or if quasi-identifiers (date of birth, ZIP code) enable re-identification. Because the hospital retains quasi-identifiers that could enable re-identification, the data is pseudonymized, not anonymized. Anonymization (A) would require making re-identification reasonably impossible. Differential privacy (C) involves adding calibrated noise. Data deletion (D) is the removal of data entirely.

5. A dataset satisfies 5-anonymity. This means:

A) Exactly five individuals appear in every equivalence class.
B) Every combination of quasi-identifier values in the dataset is shared by at least five records.
C) Five percent of the records have been suppressed to prevent re-identification.
D) The dataset has been reduced to five data fields.

Answer

**B)** Every combination of quasi-identifier values in the dataset is shared by at least five records. *Explanation:* Section 10.4.1 defines k-anonymity: a dataset satisfies k-anonymity if every combination of quasi-identifier values (the "equivalence class") appears in at least k records. For 5-anonymity, no individual can be distinguished from at least four other people sharing the same quasi-identifier values. Option A is too restrictive — equivalence classes can contain more than five records. Options C and D describe unrelated concepts.

6. The homogeneity attack against k-anonymity exploits which specific weakness?

A) The quasi-identifiers are not sufficiently generalized.
B) All records in an equivalence class share the same value for the sensitive attribute, so an attacker learns the sensitive value despite not identifying the specific individual.
C) The dataset contains too few records to achieve meaningful anonymity.
D) The encryption key used to protect the data is too short.

Answer

**B)** All records in an equivalence class share the same value for the sensitive attribute, so an attacker learns the sensitive value despite not identifying the specific individual. *Explanation:* Section 10.4.2 explains the homogeneity attack: even if a dataset satisfies k-anonymity, an attacker who knows a target belongs to a specific equivalence class can learn their sensitive attribute if all members of that class share the same sensitive value. For example, if all five people in a 5-anonymous group have "Cancer" as their diagnosis, knowing someone is in that group reveals their diagnosis with certainty. This vulnerability motivated the development of l-diversity.

7. l-Diversity addresses the homogeneity attack by requiring:

A) That each equivalence class contains at least l distinct values for the sensitive attribute.
B) That the dataset is divided into l separate tables.
C) That each record is duplicated l times to increase the group size.
D) That at least l different encryption algorithms are used.

Answer

**A)** That each equivalence class contains at least l distinct values for the sensitive attribute. *Explanation:* Section 10.4.2 defines l-diversity: each equivalence class must contain at least l "well-represented" values for the sensitive attribute. This ensures that an attacker cannot learn the sensitive value simply by identifying the equivalence class. If a group of five people contains three different diagnoses (e.g., Flu, Cold, Cancer), the attacker cannot determine which diagnosis belongs to a target individual. Options B, C, and D describe unrelated mechanisms.

8. Differential privacy provides its guarantee by:

A) Removing all identifying information from datasets before analysis.
B) Adding carefully calibrated random noise to query results, making it mathematically impossible to determine whether any single individual's data was included.
C) Encrypting data so that only authorized users can read it.
D) Limiting the number of people who can access a database.

Answer

**B)** Adding carefully calibrated random noise to query results, making it mathematically impossible to determine whether any single individual's data was included. *Explanation:* Section 10.5 explains that differential privacy works by adding noise drawn from a specific probability distribution (typically Laplace or Gaussian) to the output of database queries. The noise is calibrated so that the probability of any output is nearly the same whether or not a given individual's data is in the dataset. This provides a formal, mathematical privacy guarantee expressed through the parameter epsilon. Options A, C, and D describe other privacy techniques (anonymization, encryption, access control) that do not provide the same mathematical guarantee.

9. In differential privacy, a smaller value of epsilon (the privacy parameter) means:

A) Less privacy and more accurate results.
B) More privacy but noisier (less accurate) results.
C) Faster computation time.
D) A larger dataset is required.

Answer

**B)** More privacy but noisier (less accurate) results. *Explanation:* Section 10.5.2 explains that epsilon controls the privacy-accuracy trade-off. A smaller epsilon means stronger privacy protection because more noise is added to query results, making it harder to infer any individual's contribution. However, this added noise reduces the accuracy of the results. A larger epsilon means less noise (more accuracy) but weaker privacy protection. This trade-off is fundamental to differential privacy and cannot be eliminated — it can only be managed.

10. Which of the following best describes federated learning as a Privacy-Enhancing Technology?

A) A technique that encrypts data so it can be processed without decryption.
B) A machine learning approach where the model is trained across multiple devices or servers holding local data, without exchanging the raw data itself.
C) A method for splitting a dataset into multiple parts so no single party sees the full picture.
D) A regulatory framework requiring companies to store data in the country where it was collected.

Answer

**B)** A machine learning approach where the model is trained across multiple devices or servers holding local data, without exchanging the raw data itself. *Explanation:* Section 10.6.2 describes federated learning: instead of centralizing data for model training, the model is sent to where the data resides. Each device or server trains a local model on its own data and sends only the model updates (gradients or parameters) — not the raw data — back to a central server, which aggregates them. This reduces the privacy risk of data centralization. Option A describes homomorphic encryption. Option C describes secure multi-party computation. Option D describes data localization policy, not a technology.

Section 2: True/False with Justification (1 point each)

For each statement, determine whether it is true or false and provide a brief justification.

11. "Privacy by Design requires that privacy and functionality be treated as a zero-sum trade-off — more privacy necessarily means less functionality."

Answer

**False.** *Explanation:* Section 10.1.4 explains Cavoukian's principle of "Positive-Sum, not Zero-Sum" (also called "Full Functionality — Win-Win"). The principle explicitly rejects the assumption that privacy must come at the cost of functionality, security, or business objectives. Privacy by Design holds that it is possible — and required — to design systems that achieve both privacy and full functionality. While trade-offs may exist in specific implementations, the framework insists that organizations should pursue solutions that accommodate both rather than treating privacy as an obstacle to innovation.

12. "k-Anonymity guarantees that an individual's sensitive attribute (such as medical diagnosis) cannot be learned by an attacker who identifies their equivalence class."

Answer

**False.** *Explanation:* Section 10.4.2 demonstrates that k-anonymity does not protect against the homogeneity attack. If all records in an equivalence class share the same sensitive value (e.g., all five people in a 5-anonymous group have the diagnosis "Cancer"), then an attacker who identifies the equivalence class learns the sensitive value with certainty. k-Anonymity protects against re-identification of individuals but does not protect against attribute disclosure. This limitation motivated the development of l-diversity and t-closeness.

13. "Differential privacy's guarantees hold regardless of what auxiliary information an attacker possesses."

Answer

**True.** *Explanation:* Section 10.5.1 explains that one of differential privacy's most powerful properties is composability with auxiliary information. The mathematical guarantee states that an attacker's ability to infer information about any individual is bounded by epsilon, regardless of what other information the attacker has access to — including other databases, public records, or even previous differentially-private queries on the same data. This is in contrast to k-anonymity and l-diversity, which can be defeated by attackers with sufficient auxiliary information about the quasi-identifiers.

14. "Pseudonymized data is considered anonymous under the GDPR and therefore falls outside the regulation's scope."

Answer

**False.** *Explanation:* Section 10.3.1 explicitly states that pseudonymized data remains personal data under the GDPR. Recital 26 of the GDPR clarifies that data that has been pseudonymized — where direct identifiers have been replaced but the data can still be attributed to a specific person using additional information — is subject to the regulation's full requirements. Only truly anonymized data, where re-identification is no longer reasonably possible, falls outside the GDPR's scope. This distinction is critical because many organizations incorrectly assume that replacing names with codes makes data "anonymous."

15. "Homomorphic encryption allows computations to be performed on encrypted data without ever decrypting it."

Answer

**True.** *Explanation:* Section 10.6.1 describes homomorphic encryption as a cryptographic technique that enables computations (addition, multiplication, and more complex operations in fully homomorphic encryption) to be performed directly on ciphertext. The results, when decrypted, match the results that would have been obtained from performing the same operations on the plaintext data. This means a cloud provider can process encrypted health records and return encrypted results to the hospital, without ever seeing the unencrypted patient data. While the chapter notes that fully homomorphic encryption remains computationally expensive, it represents a significant advance in privacy-preserving computation.

Section 3: Short Answer (2 points each)

16. Explain the concept of a "privacy budget" in differential privacy. Why is composition a concern, and what happens when the budget is exhausted?

Sample Answer

A privacy budget (typically denoted as the total epsilon) represents the cumulative privacy loss tolerated across all queries performed on a dataset. Each query consumes a portion of the budget — the epsilon allocated to that specific query. Under sequential composition, the total privacy loss is the sum of the individual epsilon values. This means that while a single query with epsilon = 0.1 provides strong privacy, running 100 such queries results in a total epsilon of 10.0, which provides very weak privacy. When the budget is exhausted, no further queries can be answered without violating the overall privacy guarantee. This is a fundamental constraint: differential privacy does not offer unlimited analysis. Organizations must decide in advance how many queries they will need and allocate the budget accordingly — a planning requirement that represents a significant departure from traditional data analysis practices. *Key points for full credit:* - Defines privacy budget as cumulative epsilon across all queries - Explains composition (sequential queries accumulate privacy loss) - Notes that an exhausted budget means no further queries without violating guarantees

17. Compare and contrast two approaches to achieving data minimization: (a) collecting less data at the point of collection, and (b) collecting data broadly but restricting access and deleting it promptly. What are the advantages and risks of each approach?

Sample Answer

Collecting less data at the point of collection (approach A) is the strongest form of minimization because data that was never collected cannot be breached, subpoenaed, or misused. It eliminates risk at the source. However, it requires organizations to define their purposes clearly *before* collection, which can be difficult in research or exploratory contexts where future needs are uncertain. The disadvantage is irreversibility — data not collected is permanently unavailable. Collecting broadly but restricting access and deleting promptly (approach B) preserves flexibility. Organizations can use data for defined purposes, then discard it. Access controls ensure that only authorized personnel see specific data elements. However, this approach carries significant risks: data that exists can be breached (as the Equifax case shows), access controls can fail, "prompt deletion" often becomes indefinite retention in practice, and the mere existence of rich datasets creates temptation for function creep. From a Privacy by Design perspective, approach A is preferred because it embodies the principle "privacy as the default setting" — the system does not collect more than necessary, regardless of whether technical controls exist to limit access after the fact. *Key points for full credit:* - Identifies that approach A eliminates risk at the source - Identifies that approach B preserves flexibility but retains risk - Notes the practical difficulty of ensuring "prompt deletion" in approach B - Connects to Privacy by Design principles

18. Mira discovers that VitraMed's patient database satisfies 3-anonymity for the quasi-identifiers [age, gender, ZIP code] but does not satisfy 2-diversity for the sensitive attribute "primary diagnosis." Explain what privacy risk this creates and describe two concrete steps VitraMed could take to address it.

Sample Answer

The lack of 2-diversity means that at least one equivalence class in the database contains patients who all share the same diagnosis. An attacker who knows a patient's age, gender, and ZIP code can identify their equivalence class and, if that class lacks diversity, learn the patient's diagnosis with certainty — even without knowing which specific record belongs to the patient. For example, if all patients aged 35-40, male, in ZIP code 48201 have the diagnosis "HIV positive," anyone who knows a 37-year-old man lives in 48201 can infer his diagnosis. Two steps to address this: First, VitraMed could further generalize the quasi-identifiers (e.g., broadening age ranges or ZIP code prefixes) to create larger equivalence classes that naturally contain more diagnostic diversity. This reduces specificity but improves privacy. Second, VitraMed could apply anatomization — separating the quasi-identifier table from the sensitive attribute table and linking them through group IDs rather than individual record IDs, ensuring that the mapping between specific individuals and specific diagnoses is obscured even within equivalence classes. Alternatively, VitraMed could apply differential privacy to query outputs rather than releasing the microdata directly. *Key points for full credit:* - Explains the homogeneity attack that results from failing l-diversity - Provides a concrete example showing the risk - Proposes two actionable remediation steps with reasoning

19. The chapter describes Apple's use of local differential privacy for collecting emoji usage statistics. Explain the difference between local and global (central) differential privacy. Why might Apple prefer the local model, and what are the trade-offs?

Sample Answer

In global (central) differential privacy, a trusted curator collects raw data from all individuals, stores it centrally, and adds noise only when answering queries on the aggregate data. In local differential privacy, each individual adds noise to their own data *before* sending it to the collector. The collector receives only noisy data and never sees any individual's true response. Apple likely prefers local DP because it eliminates the need for users to trust Apple with their raw data. Even if Apple's servers were breached, the attacker would find only noisy data. This aligns with Apple's brand positioning around privacy and avoids the liability of centralizing sensitive usage data. However, local DP has a significant trade-off: it requires substantially more noise than global DP to achieve the same level of privacy. To get accurate aggregate statistics, the collector needs data from many more users — the noise cancels out in aggregate but is large at the individual level. This means local DP is practical for Apple (with hundreds of millions of users) but would be impractical for an organization with a small user base. It also means that for any given privacy guarantee, global DP produces more accurate results. *Key points for full credit:* - Correctly distinguishes where noise is added (individual vs. curator level) - Explains why Apple's trust model favors local DP - Identifies the accuracy trade-off (local DP requires more noise/more data)

Section 4: Applied Scenario (5 points)

20. Read the following scenario and answer all parts.

Scenario: CareConnect Health Platform

CareConnect is a telehealth startup that connects patients with therapists via video calls. To improve its matching algorithm (which pairs patients with therapists based on specialty, availability, and patient preferences), CareConnect collects the following data: full name, date of birth, home address, insurance ID, diagnosis codes, therapist notes after each session (free-text), session recordings (video and audio), mood self-assessments (1-10 scale before and after each session), and device/browser metadata.

CareConnect's engineering team wants to release a research dataset to academic partners studying the effectiveness of telehealth for mental health treatment. They propose the following anonymization steps:

Replace patient names with random UUIDs

Remove insurance IDs

Keep diagnosis codes, session dates, mood scores, and therapist notes

Keep session recordings but blur faces using automated software

The dataset would contain 12,000 patient records spanning two years.

(a) Evaluate CareConnect's proposed anonymization against the concepts of pseudonymization, anonymization, k-anonymity, and l-diversity. Is the resulting dataset truly anonymous? Identify at least three specific re-identification risks. (1 point)

(b) Apply the principle of data minimization to the research dataset. Which data elements are necessary for studying telehealth effectiveness, and which should be excluded? Justify each decision. (1 point)

(c) Propose an alternative approach using at least two Privacy-Enhancing Technologies from Section 10.6 that would allow the research goals to be achieved without releasing a de-identified dataset. Explain how each PET would work in this specific context. (1 point)

(d) CareConnect's CTO argues: "We need to keep session recordings in the dataset because automated sentiment analysis of patient tone of voice is a key research variable. Blurring faces should be sufficient." Using concepts from this chapter, evaluate this argument. What privacy risks remain even with face blurring, and what alternatives exist? (1 point)

(e) Draft a brief Privacy by Design recommendation (3-5 bullet points) for CareConnect to implement before any research data sharing occurs. Each recommendation should reference a specific Cavoukian principle. (1 point)

Sample Answer

**(a)** The proposed dataset is pseudonymized, not anonymized. Replacing names with UUIDs is pseudonymization by definition — if the UUID-to-name mapping exists (or can be reconstructed), the data is re-identifiable. Three re-identification risks: 1. **Therapist notes (free-text):** Therapists may reference patients by name, mention employers, family members, or specific life events in their notes. Free-text fields are notoriously difficult to de-identify reliably — automated NLP tools miss edge cases, abbreviations, and indirect references. 2. **Session recordings with blurred faces:** Voice is a biometric identifier. Even with faces blurred, voice prints can identify individuals. Additionally, face-blurring algorithms can fail on certain skin tones, angles, or lighting conditions, and background details (bookshelves, wall art, room layout) may be recognizable to people who know the patient. 3. **Diagnosis codes + session dates + demographic attributes:** The combination of a specific diagnosis code, session dates and times, and date of birth could enable re-identification by cross-referencing with insurance claims databases, especially for rare diagnoses in small geographic areas. With 12,000 records, k-anonymity for the combination of [DOB, diagnosis, session dates] is likely very low. **(b)** For studying telehealth effectiveness: - **Keep (with modifications):** Diagnosis category (generalized, not specific codes), mood scores, session count and duration, therapist specialty. These are directly relevant to measuring outcomes. - **Exclude:** Full date of birth (replace with age range), home address (irrelevant to effectiveness), session recordings (extract sentiment scores algorithmically before anonymization, then discard recordings), therapist notes (replace with structured outcome codes completed by therapists specifically for research), device/browser metadata (irrelevant to clinical outcomes). - **Transform:** Session dates should be converted to relative offsets (days since first session) rather than absolute dates, preventing temporal re-identification. **(c)** Two PET-based alternatives: First, **federated learning:** Instead of releasing a dataset, CareConnect could host the data on its own secure servers and allow researchers to submit model architectures. The models would be trained on CareConnect's servers against the real data, and only the trained model parameters (not patient data) would be shared with researchers. This allows research without any data leaving CareConnect's control. Second, **differential privacy:** CareConnect could create an interactive query interface where researchers submit aggregate statistical queries (e.g., "What is the average mood improvement for patients with depression after 10 sessions?") and receive differentially-private responses. CareConnect would manage a privacy budget to limit the total information disclosed. This allows hypothesis testing without releasing microdata. **(d)** The CTO's argument underestimates several residual risks. First, voice is a biometric identifier — speaker recognition technology can match voice samples to known individuals with high accuracy, and voice cannot be "blurred" without destroying the very acoustic features needed for sentiment analysis. Second, face-blurring algorithms have documented failure rates, particularly for darker skin tones and partially occluded faces — meaning the "blurring" may not work consistently for all patients. Third, session content (topics discussed, details shared) can itself be identifying even without visual or vocal identity. An alternative: CareConnect could run the sentiment analysis *in-house* before any data sharing, extracting aggregate acoustic features (average pitch, pitch variability, speaking rate, pause frequency) and converting them to standardized scores. The scores — not the recordings — would be included in the research dataset. This preserves the research variable while eliminating the biometric risk. **(e)** Privacy by Design recommendations: - **Proactive, not Reactive (Principle 1):** Conduct a formal Privacy Impact Assessment before any research data sharing. Identify re-identification risks, assess the necessity of each data element, and implement mitigation measures before data leaves CareConnect's systems. - **Privacy as the Default (Principle 2):** Set the default to minimum disclosure. Researchers should receive only the specific data elements justified by their approved research protocol — not a comprehensive dump. Access should require a formal data use agreement. - **Privacy Embedded into Design (Principle 3):** Build privacy protections into the research data pipeline architecture. Automated de-identification (NLP for notes, voice stripping for recordings, date generalization) should be part of the technical pipeline, not a manual afterthought. - **Full Functionality (Principle 4):** Use PETs (federated learning, differential privacy) to enable the research goals without compromising patient privacy — demonstrating that privacy and research value are not zero-sum. - **End-to-End Security (Principle 5):** Ensure that research data, however shared, is encrypted in transit and at rest, with audit logging of all access. Define a data destruction protocol for when the research project concludes.

Scoring & Review Recommendations

Score Range	Assessment	Next Steps
Below 50% (< 15 pts)	Needs review	Re-read Sections 10.1-10.4, focus on definitions and examples
50-69% (15-20 pts)	Partial understanding	Review the k-anonymity/l-diversity/differential privacy distinctions, redo Part B exercises
70-85% (21-25 pts)	Solid understanding	Ready to proceed to Chapter 11; review any missed concepts
Above 85% (> 25 pts)	Strong mastery	Proceed to Chapter 11: The Economics of Privacy

Section	Points Available
Section 1: Multiple Choice	10 points (10 questions x 1 pt)
Section 2: True/False with Justification	5 points (5 questions x 1 pt)
Section 3: Short Answer	8 points (4 questions x 2 pts)
Section 4: Applied Scenario	5 points (5 parts x 1 pt)
Total	28 points