Chapter 32: Quiz

DataField.Dev

Chapter 32: Quiz

Test your understanding of privacy-preserving data science. Answers follow each question.

Question 1

Latanya Sweeney showed that 87% of the U.S. population can be uniquely identified by just three attributes. What are these attributes, and why does this result undermine "de-identification by removing identifiers" as a privacy strategy?

Answer

The three attributes are **zip code, date of birth, and gender**. These are quasi-identifiers — attributes that are not individually identifying but become identifying in combination. Because the intersection of a 5-digit zip code (~30,000 values), a birth date (~365 × 80 ≈ 29,000 values), and a gender (2 values) produces ~1.7 billion possible combinations for a population of ~330 million, the vast majority of individuals are uniquely identified by this combination alone. This undermines the "remove identifiers" approach because the data can be re-identified by linking it with any external dataset (voter rolls, hospital records, survey data) that contains these quasi-identifiers. The problem is not insufficient anonymization — it is that high-dimensional data is inherently identifiable.

Question 2

State the formal definition of $(\varepsilon, \delta)$-differential privacy. What does the parameter $\varepsilon$ control, and what does $\delta$ represent?

Answer

A randomized mechanism $\mathcal{M}: \mathcal{D} \to \mathcal{R}$ satisfies $(\varepsilon, \delta)$-differential privacy if for all neighboring datasets $D, D'$ (differing in at most one record) and all measurable subsets $S \subseteq \mathcal{R}$: $\Pr[\mathcal{M}(D) \in S] \leq e^{\varepsilon} \cdot \Pr[\mathcal{M}(D') \in S] + \delta$. **Epsilon** ($\varepsilon$) controls the maximum multiplicative change in the output probability distribution when one individual's data is added or removed. Smaller $\varepsilon$ means stronger privacy — the mechanism's output is less sensitive to any single record. **Delta** ($\delta$) is the probability that the pure $\varepsilon$ guarantee fails — a catastrophic privacy breach probability. It should be cryptographically small, typically $\delta < 1/n$ where $n$ is the dataset size. When $\delta = 0$, the guarantee is called *pure* differential privacy.

Question 3

What is the sensitivity of a function, and why is it the key quantity for calibrating noise in differential privacy? Compute the L1 sensitivity of: (a) a counting query, (b) the mean of $n$ values bounded in $[0, B]$.

Answer

The **sensitivity** of a function $f$ measures the maximum change in $f$'s output when one record is added or removed from the dataset: $\Delta f = \max_{D, D' \text{ neighbors}} \|f(D) - f(D')\|_1$. It is the key quantity because it determines how much noise is needed: higher sensitivity means the output changes more when one record changes, so more noise is needed to mask that change. **(a)** A counting query has L1 sensitivity **1**: adding or removing one record changes the count by exactly 1. **(b)** The mean of $n$ values in $[0, B]$ has L1 sensitivity $B/n$: in the worst case, one record changes from 0 to $B$ (or vice versa), changing the sum by $B$ and the mean by $B/n$. This explains why DP works well for aggregate queries on large datasets: the sensitivity of the mean shrinks as $n$ grows, requiring less noise.

Question 4

Describe the Laplace mechanism and the Gaussian mechanism. When would you prefer one over the other?

Answer

The **Laplace mechanism** achieves $\varepsilon$-DP (pure DP) by adding noise $Z \sim \text{Lap}(\Delta f / \varepsilon)$ to the true output, where $\Delta f$ is the L1 sensitivity. The **Gaussian mechanism** achieves $(\varepsilon, \delta)$-DP (approximate DP) by adding noise $Z \sim \mathcal{N}(0, \sigma^2)$ with $\sigma = (\Delta_2 f / \varepsilon) \sqrt{2\ln(1.25/\delta)}$, where $\Delta_2 f$ is the L2 sensitivity. The Gaussian mechanism is preferred when **composing many queries** because Gaussian noise composes more tightly than Laplace noise under advanced composition and RDP accounting — the privacy loss grows as $O(\sqrt{k})$ for $k$ queries rather than $O(k)$. The Laplace mechanism is preferred for single queries or when a pure ($\delta = 0$) guarantee is required, since the Gaussian mechanism always requires $\delta > 0$.

Question 5

What is the exponential mechanism, and when is it used instead of the Laplace or Gaussian mechanism?

Answer

The **exponential mechanism** selects an output $r$ from a discrete set $\mathcal{R}$ with probability proportional to $\exp(\varepsilon \cdot u(D, r) / (2\Delta u))$, where $u$ is a utility function and $\Delta u$ is the sensitivity of $u$. It is used when the output is **non-numerical** — selecting the best feature, choosing a hyperparameter, picking the most popular category — where adding continuous noise to the output is not meaningful. The Laplace and Gaussian mechanisms add noise to numerical answers; the exponential mechanism randomizes the *selection* of a discrete answer, biasing toward high-utility choices while maintaining differential privacy. Outputs with utility close to the maximum are exponentially more likely to be selected, so when the gap between the best and second-best candidate is large, the mechanism almost always selects the correct answer.

Question 6

Explain the difference between basic composition and advanced composition. Why does the distinction matter for DP-SGD?

Answer

**Basic composition** states that $k$ mechanisms with $(\varepsilon, \delta)$-DP compose to $(k\varepsilon, k\delta)$-DP — the privacy loss grows linearly. **Advanced composition** states that for any $\delta' > 0$, they compose to approximately $(\varepsilon\sqrt{2k\ln(1/\delta')}, k\delta + \delta')$-DP — the privacy loss grows as $\sqrt{k}$. This distinction matters critically for DP-SGD because training involves thousands of gradient steps (each consuming privacy budget). Under basic composition, 1,000 steps at $\varepsilon = 0.01$ each would cost $\varepsilon_{\text{total}} = 10$; under advanced composition, approximately $\varepsilon_{\text{total}} \approx 1.5$; under RDP accounting, approximately $\varepsilon_{\text{total}} \approx 0.5$. The difference between a useful and useless privacy guarantee for a trained model depends entirely on which composition theorem is used.

Question 7

Describe the three modifications that DP-SGD makes to standard stochastic gradient descent. What is the role of each?

Answer

DP-SGD modifies standard SGD in three ways: **(1) Per-example gradient computation:** Instead of computing the average gradient over a mini-batch, DP-SGD computes the gradient for each individual example separately. This is necessary because the privacy guarantee requires bounding each individual's contribution. **(2) Gradient clipping:** Each per-example gradient is clipped to a maximum L2 norm $C$: $\bar{g}_i = g_i \cdot \min(1, C/\|g_i\|_2)$. This bounds the sensitivity of the gradient sum to $C$, which determines how much noise is needed. Without clipping, one outlier example could produce a very large gradient, requiring enormous noise to mask it. **(3) Gaussian noise addition:** Calibrated Gaussian noise $\mathcal{N}(0, \sigma^2 C^2 I)$ is added to the sum of clipped gradients. This noise makes the gradient update differentially private — the update is approximately the same whether or not any single example was in the batch.

Question 8

Why is BatchNorm incompatible with Opacus/DP-SGD, and what is the standard replacement?

Answer

`BatchNorm` computes the mean and variance of activations **across all examples in the batch**, then normalizes each example using these batch-level statistics. This means that each example's normalized activation depends on *every other example in the batch* — violating the per-example independence required by DP-SGD. If one example's data changes, the batch statistics change, which affects the gradients of all other examples in the batch, making it impossible to bound the sensitivity of any single example's gradient. The standard replacement is **`GroupNorm`** (or `LayerNorm`), which normalizes activations within each example independently — the statistics are computed over groups of channels (GroupNorm) or over all features (LayerNorm) for a single example, so no information flows between examples in a batch.

Question 9

What is RDP (Rényi Differential Privacy), and why does Opacus use it for privacy accounting instead of basic or advanced composition?

Answer

Rényi Differential Privacy defines privacy using the Rényi divergence of order $\alpha$: a mechanism satisfies $(\alpha, \hat{\varepsilon})$-RDP if the Rényi divergence between its output distributions on any neighboring datasets is at most $\hat{\varepsilon}$. RDP composes linearly (like basic composition) but converts to $(\varepsilon, \delta)$-DP with an optimization over $\alpha$ that produces much tighter bounds. For the Gaussian mechanism, the RDP cost at order $\alpha$ is $\alpha/(2\sigma^2)$, which gives a closed-form expression that can be optimized efficiently. Opacus uses RDP because it produces **substantially tighter privacy bounds** for DP-SGD training — often 5-10x tighter than advanced composition for typical training runs (thousands of steps). This means that a model can be reported as having a much smaller $\varepsilon$ (stronger privacy) than basic or advanced composition would suggest, without changing any noise or training parameters.

Question 10

Describe the FedAvg algorithm. What are its four main steps in each round?

Answer

FedAvg (Federated Averaging) is the foundational federated learning algorithm. Each round consists of four steps: **(1) Client selection:** The server selects a subset of $K$ clients to participate in this round. **(2) Model distribution:** The server sends the current global model parameters $\theta^{(t)}$ to each selected client. **(3) Local training:** Each client $k$ trains the model on its local data for $E$ local epochs, producing updated parameters $\theta_k^{(t+1)}$. **(4) Aggregation:** The server computes the weighted average of client models: $\theta^{(t+1)} = \sum_{k=1}^{K} (n_k/n) \theta_k^{(t+1)}$, where $n_k$ is client $k$'s dataset size and $n = \sum n_k$. The key property is that raw data never leaves the client — only model parameters (or gradients) are communicated.

Question 11

What is the "non-IID problem" in federated learning, and why does it cause client drift?

Answer

In federated learning, each client's data is drawn from a different distribution — hospitals have different patient populations, users in different markets have different behaviors. This is the **non-IID (non-Independent and Identically Distributed) problem.** It causes **client drift** because each client's local training optimizes toward a different local optimum: a hospital with mostly elderly patients trains the model to be good at predicting elderly outcomes, while a hospital with mostly young patients trains toward young-patient outcomes. After multiple local epochs, these local models diverge substantially from each other. When the server averages them, the updates partially cancel out, slowing convergence. In extreme non-IID cases, the averaged model may oscillate between client-specific optima across rounds, never converging to a single good solution. Mitigations include reducing local epochs (more frequent communication), FedProx (proximal regularization penalizing divergence from the global model), and SCAFFOLD (control variates to correct drift direction).

Question 12

Federated learning provides "communication privacy" — the server does not see raw data. Why is this insufficient for formal privacy, and what additional technique is needed?

Answer

Communication privacy means the raw data never leaves the client, but the model updates (parameters or gradients) that are communicated can still leak information about individual training examples. Research has shown that model gradients can be "inverted" to reconstruct individual training examples (Zhu et al., 2019, "Deep Leakage from Gradients"), and model parameters can be probed with membership inference attacks to determine whether a specific individual was in the training data. Therefore, federated learning alone provides no formal privacy guarantee about what the server or other participants can infer. To achieve formal privacy, federated learning must be combined with **DP-SGD** (each client adds noise to its local updates before sending them) and optionally **secure aggregation** (the server sees only the noisy aggregate, not individual noisy updates). DP-SGD provides output privacy (the model cannot leak individual data); secure aggregation provides communication privacy (the server cannot see individual updates).

Question 13

What is CTGAN, and what specific challenges of tabular data does it address that standard GANs do not handle well?

Answer

**CTGAN** (Conditional Tabular GAN, Xu et al., 2019) is a generative adversarial network designed specifically for tabular data. It addresses three challenges: **(1) Mixed data types:** Tabular data contains both continuous columns (income, age) and categorical columns (gender, zip code). Standard GANs assume all inputs are continuous. CTGAN uses mode-specific normalization for continuous columns (fitting a Gaussian mixture, then normalizing within each mode) and conditional generation for categorical columns. **(2) Imbalanced categories:** Categorical columns often have highly imbalanced value distributions (e.g., 95% of records are "non-fraud"). Standard GANs generate samples from the learned distribution, under-representing minority categories. CTGAN uses training-by-sampling, conditioning on each category with equal probability to ensure minority categories are well-represented. **(3) Multi-modal continuous distributions:** Continuous features often have non-Gaussian, multi-modal distributions. CTGAN's mode-specific normalization captures these modes, producing synthetic values that match the real distribution's shape.

Question 14

A colleague says: "We generated synthetic data, so it's private — no real individuals are in the synthetic dataset." Why is this statement potentially false, and under what conditions is synthetic data genuinely private?

Answer

The statement conflates "synthetic" with "private." A generative model trained without differential privacy can **memorize** individual training records and reproduce them (or near-copies) in its output. If a synthetic record closely matches a real individual — especially an outlier — anyone with auxiliary knowledge could identify that person. The mere fact that the data was generated by a model does not prevent this. Synthetic data is genuinely private only in two cases: **(1)** The generative model was trained with **differential privacy** (e.g., DP-SGD with Opacus), in which case the DP guarantee transfers to the synthetic output by the post-processing theorem. **(2)** The synthetic data was generated from **differentially private marginal distributions** (adding DP noise to marginals, then sampling from them), as in the 2020 U.S. Census TopDown algorithm. Without one of these formal mechanisms, synthetic data provides *heuristic* privacy (it may be harder to re-identify than the original) but no *formal* guarantee.

Question 15

What is the TSTR (train-on-synthetic, test-on-real) evaluation methodology, and what does a TSTR/TRTR ratio close to 1.0 indicate?

Answer

**TSTR** evaluates synthetic data utility by training a predictive model on the synthetic data and evaluating it on held-out *real* data. **TRTR** (train-on-real, test-on-real) trains on real data and evaluates on the same real test set, serving as the upper bound on achievable performance. The **TSTR/TRTR ratio** measures what fraction of the real model's performance is preserved when training on synthetic data. A ratio close to **1.0** indicates that the synthetic data preserves the statistical patterns needed for predictive modeling — the model trained on synthetic data performs nearly as well as the model trained on real data. A ratio significantly below 1.0 indicates that the synthetic data is missing important patterns (correlations, interactions, tail behavior) that the model needs. This ratio is the most practically meaningful evaluation of synthetic data quality, because it directly answers the question: "can I replace real data with synthetic data for model development without losing model quality?"

Question 16

Explain the privacy-utility tradeoff. Why is it fundamental and not merely a limitation of current algorithms?

Answer

The privacy-utility tradeoff states that stronger privacy guarantees (smaller $\varepsilon$) always come at the cost of reduced data utility (more noise, less accurate models, less informative statistics). This tradeoff is **fundamental** — it is a mathematical consequence, not an engineering limitation. Information-theoretically, if a mechanism reveals zero information about any individual (perfect privacy), it also reveals zero information about the population (zero utility). Useful analysis requires extracting some information from the data; privacy requires bounding how much information about any individual is extracted. These two objectives are inherently in tension. Concretely, the noise added by DP mechanisms must scale with sensitivity and inversely with $\varepsilon$; smaller $\varepsilon$ means more noise means less accurate results. No algorithm can circumvent this — it is the price of formal privacy. The practitioner's task is not to eliminate the tradeoff but to **navigate** it: choose $\varepsilon$ appropriate to the application, use tight privacy accounting (RDP, zCDP) to avoid wasting budget on loose bounds, and design computations that extract maximum utility from each unit of budget.

Question 17

In Case Study 2, Recall@20 degraded 29.5% at $\varepsilon = 1$ while accuracy degraded only 13.0%. Explain why ranking metrics are more sensitive to DP noise than classification metrics.

Answer

**Classification accuracy** measures whether the model gets the binary decision right — does the predicted class (click vs. no-click) match the true class? This requires only that the predicted probability is on the correct side of the 0.5 threshold. DP noise perturbs predicted probabilities, but as long as the perturbation does not cross the threshold, the classification is unchanged. **Recall@20** measures whether the correct items appear in the top 20 of a ranked list of thousands of items. This requires accurate *relative ordering* — item A must score higher than items B, C, D, etc. DP noise perturbs all scores by a similar magnitude, which has a small effect on the score of any single item but a large effect on the *ranking*, because items with similar true scores are constantly swapped by the noise. The denser the score distribution (many items with similar scores, as in recommendation), the more rankings are disrupted by a given noise level. This is why ranking metrics degrade disproportionately.

Question 18

The MediCore case study found that federated DP at $\varepsilon = 3$ outperformed per-site analysis without DP on CATE estimation. Explain this seemingly paradoxical result.

Answer

The per-site analysis estimates treatment effects at each hospital independently, then meta-analyzes the results. Small hospitals (n < 200) produce estimates with very high variance — the standard error of a treatment effect estimate scales as $1/\sqrt{n}$, so a hospital with 100 patients has a standard error 5x larger than one with 2,500 patients. This high-variance estimation at small sites degrades the overall CATE estimate. Federated learning pools data across all 3,600 patients, giving the treatment effect model access to the full statistical power of the combined dataset. The noise added by differential privacy ($\varepsilon = 3$) increases variance, but *less* than the variance already present in the per-site estimates from small hospitals. In other words, the **information gained from pooling exceeds the information lost to DP noise.** This is not paradoxical — it reflects the fundamental relationship between sample size and estimation variance. For small-to-moderate datasets, the benefit of pooling dominates the cost of privacy.

Question 19

What is secure aggregation, and what threat does it address that differential privacy does not?

Answer

**Secure aggregation** is a cryptographic protocol that allows a server to compute the weighted average of client model updates $\sum w_k \theta_k$ without learning any individual update $\theta_k$. It uses secret sharing: each client splits its update into shares distributed among other clients, and the server can reconstruct only the aggregate. Secure aggregation addresses the threat of a **malicious or curious server** — a server that honestly computes the aggregate but also inspects individual client updates to infer information about their local data. Differential privacy addresses a different threat: an adversary who sees the *output* (the trained model) and tries to infer information about the training data. DP does not protect against a server that sees individual client updates before aggregation. Conversely, secure aggregation does not protect against an adversary who analyzes the final model. The two techniques are complementary: secure aggregation provides **communication privacy** (individual updates are hidden), and DP provides **output privacy** (the final model does not leak individual data).

Question 20

You are designing a privacy-preserving ML pipeline. For each scenario, state which technique(s) you would use and justify your choice.

(a) A hospital wants to train a disease prediction model on its own data and publish the model weights for other hospitals to use.

(b) Ten hospitals want to collaboratively train a model, but none can share patient data.

(c) A data analytics team needs a dataset for exploratory analysis, but the real data is too sensitive to share.

(d) A tech company collects telemetry from user devices and wants to compute aggregate statistics (e.g., most popular features used) without collecting individual-level data.

Answer

**(a) DP-SGD.** The hospital trains the model with differential privacy (Opacus), ensuring the published model weights do not leak information about any individual patient. The DP guarantee applies to the model itself — anyone can analyze the weights without compromising patient privacy. **(b) Federated learning + DP-SGD + secure aggregation.** Federated learning keeps raw data at each hospital. DP-SGD ensures the model does not memorize individual records. Secure aggregation ensures the coordination server cannot inspect individual hospital updates. This is the MediCore case study approach. **(c) DP synthetic data.** Generate synthetic data using a generative model (CTGAN or TVAE) trained with DP-SGD. The synthetic data inherits the DP guarantee by the post-processing theorem and can be freely shared. Non-DP synthetic data would be insufficient because memorization could compromise privacy. **(d) Local differential privacy (LDP).** Each user device adds DP noise to its telemetry *before* sending it to the server. The server receives only noisy individual reports and computes aggregate statistics from them. This provides the strongest trust model — the server never sees true individual data. LDP is the approach used by Apple and Google for telemetry collection. The tradeoff is that LDP requires much more noise than centralized DP for the same accuracy, because the noise is added per-user rather than per-aggregate.