Chapter 32: Exercises

Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field


Differential Privacy Fundamentals

Exercise 32.1 (*)

A hospital wants to release a count of patients diagnosed with a rare disease in 2025. The true count is 47 out of 12,000 total patients.

(a) Compute the L1 sensitivity of this counting query.

(b) Using the Laplace mechanism with $\varepsilon = 0.1$, what is the scale parameter of the noise? What is the expected absolute error?

(c) Repeat with $\varepsilon = 1.0$. How much more accurate is the answer?

(d) The hospital also wants to release the fraction of patients with the disease (47/12,000 = 0.00392). What is the L1 sensitivity of this fraction query? Which query — count or fraction — requires less noise for a given $\varepsilon$, and why?


Exercise 32.2 (*)

A streaming platform wants to release a histogram of daily viewing hours across five buckets: [0, 1), [1, 2), [2, 4), [4, 8), [8+). Each user falls in exactly one bucket. The true histogram is [120,000, 85,000, 62,000, 28,000, 5,000].

(a) What is the L1 sensitivity of this histogram query?

(b) Apply the Laplace mechanism with $\varepsilon = 0.5$ to release the histogram. Write Python code that adds noise and ensures all noisy counts are non-negative (post-processing does not affect the DP guarantee — explain why).

(c) The marketing team wants the proportion histogram (normalize counts to sum to 1.0). They propose two approaches: (i) add noise to counts, then normalize; (ii) add noise to proportions directly. Which approach is correct for DP, and why is the other problematic?


Exercise 32.3 (*)

An analyst has a privacy budget of $\varepsilon_{\text{total}} = 2.0$ for a dataset of 50,000 customer records. They want to answer the following queries:

  1. Mean age (bounded in [18, 90])
  2. Mean annual income (bounded in [$0, $500,000])
  3. Count of customers in each of 5 income brackets

(a) Using basic composition, allocate the budget equally across the three queries. What $\varepsilon$ does each query receive?

(b) Compute the Laplace scale for each query under the allocation from (a).

(c) The analyst realizes that the income bracket histogram is the most important query (it will be published in a report). Propose a non-uniform budget allocation that gives the histogram twice the budget of each other query. What are the new $\varepsilon$ values?

(d) After answering all three queries, the analyst's manager asks for the median age. Can the analyst answer this query? Why or why not?


Exercise 32.4 (**)

Implement the Gaussian mechanism from scratch for a vector-valued query.

(a) Write a function gaussian_mechanism_vector(true_values, l2_sensitivity, epsilon, delta) that adds calibrated Gaussian noise to each component of a vector query.

(b) Apply it to release the mean of 5 features from a dataset of 10,000 records, where each feature is bounded in [0, 1]. What is the L2 sensitivity of this 5-dimensional mean query?

(c) Compare the noise magnitude to the Laplace mechanism applied to each component independently with the same total $\varepsilon$. Which mechanism adds less noise, and why?


Exercise 32.5 (**)

The exponential mechanism selects an output from a discrete set with probability proportional to $\exp(\varepsilon \cdot u / (2\Delta u))$.

(a) A recommendation system must privately select the top content category from 8 categories. The true engagement scores are [4200, 3800, 3100, 2900, 2200, 1500, 800, 300]. Write Python code implementing the exponential mechanism with $\varepsilon = 1.0$ and sensitivity = 1.

(b) Run the mechanism 10,000 times and report the selection frequency for each category. Does the mechanism reliably select the best category?

(c) Now reduce the gap: the top two categories have scores [4200, 4150]. How does this affect the selection distribution? Explain why this makes sense from a privacy perspective.

(d) Prove that the exponential mechanism satisfies $\varepsilon$-differential privacy. (Hint: bound the ratio of selection probabilities for neighboring datasets.)


Exercise 32.6 (***)

Implement a complete privacy budget manager.

(a) Write a PrivacyBudgetManager class that tracks individual queries, supports both basic and advanced composition, and prevents over-spending.

(b) The class should support "reservations" — an analyst can reserve a portion of the budget for a future query, preventing other analysts from spending it.

(c) Add a method that, given the current spending, estimates the tightest $(\varepsilon, \delta)$-DP guarantee under RDP composition (optimize over $\alpha$).

(d) Add logging and audit trail functionality: every query is recorded with a timestamp, analyst ID, query description, and privacy cost. Design the audit report format.


DP-SGD and Opacus

Exercise 32.7 (*)

Explain why each of the following steps in DP-SGD is necessary. What privacy violation would occur if the step were skipped?

(a) Per-example gradient computation (instead of batch-average gradients)

(b) Gradient clipping to norm $C$

(c) Gaussian noise addition to the clipped gradient sum

(d) Poisson subsampling of training examples


Exercise 32.8 (*)

A data scientist trains a model with DP-SGD using the following parameters: $n = 100{,}000$ examples, batch size $B = 1{,}024$, $\sigma = 1.1$, $C = 1.0$, 20 epochs. The subsampling rate is $q = B/n = 0.01024$.

(a) How many gradient steps does training involve? (Total = $n/B \times \text{epochs}$, rounded.)

(b) Using basic composition (each step is $(\varepsilon_{\text{step}}, \delta_{\text{step}})$-DP with $\varepsilon_{\text{step}} = q \cdot C\sqrt{2\ln(1.25/\delta_{\text{step}})}/\sigma$ at $\delta_{\text{step}} = 10^{-6}$), estimate the total $\varepsilon$ under basic composition.

(c) Explain why the RDP accountant (as used by Opacus) produces a much tighter bound. You do not need to compute the RDP bound — just explain the conceptual reason.


Exercise 32.9 (**)

The StreamRec team discovers that their model uses BatchNorm layers, which are incompatible with Opacus.

(a) Explain why BatchNorm violates the assumptions of DP-SGD. (Hint: consider what information flows between examples in a batch during the BatchNorm forward pass.)

(b) Opacus's ModuleValidator.fix() replaces BatchNorm with GroupNorm. Explain why GroupNorm is compatible with DP-SGD.

(c) Write a function that takes a PyTorch model, replaces all BatchNorm1d and BatchNorm2d layers with GroupNorm (using num_groups=min(32, num_channels)), and returns the modified model. Test it on a model with 3 BatchNorm layers.

(d) After replacing BatchNorm with GroupNorm, the non-DP model's accuracy drops by 0.8%. The team is concerned. Is this expected? What can they do to recover the lost accuracy?


Exercise 32.10 (**)

Implement a gradient clipping norm profiler.

(a) Write a function that hooks into a PyTorch training loop, computes the L2 norm of each per-example gradient for 5 batches, and returns the distribution of gradient norms (mean, median, 25th, 75th, 95th, 99th percentiles).

(b) Based on the profiler output, propose a strategy for selecting the clipping norm $C$. Your strategy should be stated as a rule (e.g., "set $C$ to the Pth percentile of observed gradient norms") with a justification.

(c) Run the profiler on a 3-layer MLP trained on a synthetic binary classification dataset (10,000 examples, 20 features). Report the gradient norm distribution. What $C$ would you choose?


Exercise 32.11 (***)

Compare DP-SGD with and without learning rate warmup.

(a) Train a 3-layer MLP on a synthetic dataset with DP-SGD ($\varepsilon = 3$, $C = 1.0$) using a constant learning rate of 0.001.

(b) Repeat with a linear warmup schedule: learning rate increases from 0 to 0.001 over the first 10% of steps, then decays linearly to 0.

(c) Compare the final accuracy and training loss curves. Explain why warmup helps DP-SGD. (Hint: early in training, gradients are large and frequently clipped; the noise-to-signal ratio is high.)


Federated Learning

Exercise 32.12 (*)

Five hospitals participate in a federated learning study. Their dataset sizes are [1000, 500, 200, 100, 50]. In FedAvg, the server computes a weighted average of client models, with weights proportional to dataset size.

(a) Compute the weight for each hospital.

(b) Hospital 5 (n=50) has a patient population that is very different from the others (e.g., a specialized pediatric hospital). The global model performs poorly for their patients. Explain why this happens with dataset-size weighting.

(c) Propose an alternative weighting scheme that gives Hospital 5 more influence. What are the tradeoffs?


Exercise 32.13 (**)

Implement a simulation comparing FedAvg with centralized training on non-IID data.

(a) Generate a synthetic binary classification dataset with 5,000 examples and 10 features. Partition it across 5 clients: - IID partition: random assignment, each client gets 1,000 examples - Non-IID partition: sort by label, then assign (client 1 gets mostly label 0, client 5 gets mostly label 1)

(b) Train a logistic regression model using centralized SGD (10 epochs) and FedAvg (10 rounds, 3 local epochs each). Compare test accuracy for both IID and non-IID partitions.

(c) Quantify the accuracy gap between centralized and federated training. How much worse is the non-IID case compared to the IID case?

(d) Add FedProx with $\mu = 0.01$. Does it reduce the gap?


Exercise 32.14 (**)

A fintech company wants to train a credit scoring model across 3 partner banks without sharing customer data.

(a) Design the federated learning architecture: what model is trained, what data stays on each bank's servers, what is communicated to the central server, and what security measures are needed?

(b) Bank A has 500,000 customers, Bank B has 50,000, and Bank C has 10,000. What challenges does this extreme data imbalance create for FedAvg? Propose solutions.

(c) The banks have different feature sets (Bank A has 80 features, Bank B has 60, Bank C has 45) with partial overlap. This is a vertical federated learning scenario. Explain why standard FedAvg does not work and describe an alternative approach.


Exercise 32.15 (***)

Implement secure aggregation for a simplified 3-client federated learning setup.

(a) Implement a basic additive secret sharing scheme: each client splits its model update into 3 shares such that the shares sum to the original update. Each client sends one share to each other client and one to the server. The server reconstructs the aggregate without seeing any individual update.

(b) Verify that the server can compute the correct weighted average from the shares.

(c) What happens if one of the 3 clients drops out mid-round? Design a protocol that handles client dropout gracefully (this is a simplified version of the Bonawitz et al. protocol).


Synthetic Data

Exercise 32.16 (*)

A healthcare organization has a dataset of 10,000 patient records with 5 columns: age (continuous), gender (binary), diagnosis (categorical, 8 values), length_of_stay (continuous), and readmitted_30_days (binary).

(a) Using the CTGAN approach (you may use the sdv library), generate 10,000 synthetic records.

(b) Evaluate the synthetic data on three dimensions: - Marginal distributions: compare the age distribution (KS test) and gender distribution (chi-squared test) between real and synthetic - Pairwise correlations: compare the correlation matrix - Utility: train a classifier to predict readmitted_30_days on real data and on synthetic data, then test both on held-out real data

(c) Report the TSTR (train-on-synthetic, test-on-real) F1 score relative to the TRTR (train-on-real, test-on-real) F1 score.


Exercise 32.17 (**)

A colleague claims: "Our synthetic data is private because it doesn't contain any real individuals." Construct a counterexample that demonstrates this claim is false.

(a) Create a dataset of 100 records with 3 features. Include one record that is a clear outlier (e.g., age=105, income=$2M).

(b) Train a CTGAN on this data (no DP) and generate 1,000 synthetic records.

(c) Search for synthetic records that are "close" to the outlier (within some distance threshold). If you find near-copies, the generative model has memorized the outlier — and anyone with knowledge that a 105-year-old millionaire was in the dataset could identify them.

(d) Repeat the experiment with DP-SGD training of the CTGAN (if using smartnoise-synth or manual DP training). Does the near-copy disappear?


Exercise 32.18 (**)

Implement and compare three synthetic data methods on a tabular dataset:

(a) Marginal sampling: estimate each column's marginal distribution independently and sample from it. This preserves marginals but destroys correlations.

(b) Gaussian copula: fit a Gaussian copula to the data and sample from it. This preserves pairwise correlations.

(c) CTGAN: train a conditional GAN on the data.

Compare all three on: (i) marginal fidelity (KS test per column), (ii) correlation fidelity (Frobenius norm of correlation matrix difference), and (iii) ML utility (TSTR F1 score). Which method wins on each dimension?


Exercise 32.19 (***)

Design a DP synthetic data generation pipeline.

(a) Train a TVAE on a tabular dataset with Opacus-wrapped training (DP-SGD with $\varepsilon = 5$).

(b) Generate synthetic data from the DP-trained TVAE.

(c) By the post-processing theorem, the synthetic data inherits the $\varepsilon = 5$ guarantee. Verify this claim: state the post-processing theorem and explain why it applies here.

(d) Compare the synthetic data quality (marginal fidelity, correlation fidelity, ML utility) against non-DP TVAE. Quantify the additional cost of the DP guarantee.


Privacy-Utility Tradeoff

Exercise 32.20 (*)

A company operates in three markets with different privacy requirements:

Market Data sensitivity Regulatory requirement Dataset size
EU (healthcare) High GDPR, strong DP expected 50,000 records
US (e-commerce) Moderate No specific DP regulation 2,000,000 records
Japan (finance) High APPI, moderate DP expected 200,000 records

(a) Recommend an $\varepsilon$ for each market. Justify each choice based on sensitivity, regulation, and dataset size.

(b) For the EU healthcare market, the model achieves 85% accuracy without DP. Using the empirical relationship from Case Study 2 (approximately 13% accuracy loss at $\varepsilon = 1$, 5% at $\varepsilon = 3$, 2% at $\varepsilon = 8$), estimate the expected accuracy at your recommended $\varepsilon$.

(c) The US e-commerce market has 40x more data than the EU healthcare market. Explain why the same $\varepsilon$ would result in less utility loss for the US market.


Exercise 32.21 (**)

Reproduce the privacy-utility tradeoff experiment from the progressive project.

(a) Generate a synthetic dataset with 50,000 training examples and 20 features. Train a 3-layer MLP at $\varepsilon \in \{0.5, 1, 2, 3, 5, 8, 15, \infty\}$ using Opacus.

(b) Plot the privacy-utility curve: $\varepsilon$ on the x-axis, test accuracy on the y-axis.

(c) Identify the "knee" of the curve — the $\varepsilon$ at which further tightening produces diminishing returns in privacy but accelerating loss in utility. Is this consistent with the StreamRec results in Case Study 2?

(d) Repeat the experiment with dataset sizes of 5,000, 50,000, and 500,000. How does the knee shift as dataset size increases?


Exercise 32.22 (**)

You are a data scientist at a health insurance company. The actuarial team wants to train a claims prediction model on 200,000 patient records. The legal team requires $\varepsilon \leq 3$.

(a) The model uses 50 features. You have a choice: use all 50 features with $\varepsilon = 3$, or select the top 20 features (using DP feature selection, consuming $\varepsilon = 0.5$) and train with $\varepsilon = 2.5$. Which approach is likely to produce a better model? Explain your reasoning. (Hint: DP-SGD noise scales with the number of parameters, and fewer features mean a smaller model.)

(b) Write Python code that uses the exponential mechanism to privately select the top-$k$ features by mutual information with the target.

(c) Implement the two approaches (all features at $\varepsilon = 3$ vs. selected features at $\varepsilon = 2.5$) and compare. Use a synthetic dataset if real data is unavailable.


System Design and Integration

Exercise 32.23 (**)

Design a privacy-preserving ML pipeline for a ride-sharing platform that operates in the EU.

(a) The platform trains a surge pricing model on trip data (pickup/dropoff locations, time, demand). Identify which fields are PII, which are quasi-identifiers, and which are non-sensitive.

(b) Propose a pipeline that combines: (i) DP aggregation for feature engineering (computing demand statistics per zone), (ii) DP-SGD for model training, and (iii) synthetic data for the analytics team's exploratory work. Allocate a total budget of $\varepsilon = 10$ across the three components.

(c) Draw a system architecture diagram (described in text) showing data flow, privacy boundaries, and where each DP mechanism is applied.


Exercise 32.24 (***)

Implement a complete federated DP-SGD training loop.

(a) Combine the FedAvgServer from Section 32.10 with the DPSGDTrainer from Section 32.7. Each client should clip and noise its local gradients before sending the update to the server.

(b) The privacy guarantee composes across both local DP-SGD steps and federated rounds. Using RDP composition, compute the total $\varepsilon$ for: 50 federated rounds, 5 local epochs per round, batch size 64, dataset size 1,000 per client, $\sigma = 1.0$, $C = 1.0$, 10 clients.

(c) Compare the quality of federated DP-SGD against: (i) centralized DP-SGD at the same total $\varepsilon$, and (ii) federated learning without DP. Which is better, and why?


Exercise 32.25 (***)

A national health agency wants to train a disease risk model across 50 hospitals. Each hospital has between 500 and 50,000 patient records (highly imbalanced). The model must satisfy $\varepsilon = 1$ differential privacy.

(a) Design the federated learning protocol. Address: client weighting strategy for imbalanced data, communication frequency, dropout handling, and secure aggregation.

(b) The smallest hospitals (n < 1,000) contribute very noisy local updates. Propose a scheme that allows these hospitals to participate without degrading the global model. (Hint: consider local differential privacy for small clients and centralized DP for the aggregation step.)

(c) The agency also wants to release a synthetic version of the dataset for public research. Design a pipeline that generates DP synthetic data using the federated model's learned representation (not the raw data). Explain the privacy accounting.


Exercise 32.26 (***)

The credit scoring anchor example faces a regulatory audit. The auditor asks:

(a) "How do you ensure that no individual applicant's data can be reconstructed from the model?" Write a 2-paragraph response explaining the DP guarantee in non-technical language.

(b) "What is the impact of your privacy measures on model accuracy for protected groups?" Design an experiment that measures accuracy, AUC, and calibration error for each demographic group at $\varepsilon \in \{1, 3, 8, \infty\}$. What would you do if DP disproportionately harms a protected group?

(c) "Can you prove that the model satisfies $\varepsilon = 3$ differential privacy?" Write the formal proof outline, referencing the privacy accounting chain: DP-SGD per step $\to$ subsampling amplification $\to$ RDP composition across steps $\to$ RDP-to-DP conversion.


Research-Level Problems

Exercise 32.27 (****)

The "privacy onion effect" (Carlini et al., 2022) shows that DP protects the most vulnerable (most memorizable) training examples at the cost of degrading utility for the average example. Investigate this phenomenon.

(a) Train a model without DP on a dataset containing 100 "canary" records — duplicated entries with a unique, identifiable pattern. Measure the model's memorization of these canaries using the membership inference attack success rate.

(b) Retrain with DP-SGD at $\varepsilon = 3$. Does memorization of the canaries decrease? By how much?

(c) After removing the 100 canaries and retraining without DP, identify the next 100 most memorized records. These are the second "layer of the onion." Do they become more memorized after the first layer's canaries are removed?

(d) Discuss the implications for DP: if protecting the most vulnerable examples causes the next layer to become the most vulnerable, is the DP guarantee misleading? Why or why not?


Exercise 32.28 (****)

Rényi differential privacy (RDP) composes linearly, and the conversion from RDP to $(\varepsilon, \delta)$-DP involves an optimization over the Rényi order $\alpha$.

(a) For the Gaussian mechanism with noise multiplier $\sigma = 1.0$, compute the RDP cost at orders $\alpha \in \{2, 5, 10, 20, 50, 100, 500\}$ for a single step.

(b) Compose 1,000 steps and convert to $(\varepsilon, \delta)$-DP with $\delta = 10^{-5}$. Report the optimal $\alpha$ and the resulting $\varepsilon$.

(c) Repeat for $\sigma \in \{0.5, 1.0, 2.0, 5.0\}$. How does the optimal $\alpha$ depend on $\sigma$ and the number of steps?

(d) Compare RDP to zero-concentrated DP (zCDP). For the same Gaussian mechanism, compute the zCDP parameter $\rho$ and convert to $(\varepsilon, \delta)$-DP. Is the zCDP bound tighter or looser than the optimized RDP bound?


Exercise 32.29 (****)

Federated learning with non-IID data is an active research area. Implement and compare three strategies for handling non-IID data:

(a) FedAvg with dataset-size weighting (baseline)

(b) FedProx with proximal parameter $\mu \in \{0.001, 0.01, 0.1, 1.0\}$

(c) SCAFFOLD with control variates

Use a 10-class classification task partitioned across 10 clients with a Dirichlet distribution ($\alpha \in \{0.1, 0.5, 1.0, 10.0\}$) controlling the degree of non-IID-ness. Report convergence speed (rounds to reach 80% accuracy) and final accuracy for each method at each $\alpha$.


Exercise 32.30 (****)

The U.S. Census Bureau used a mechanism called TopDown for the 2020 Census that combines differential privacy with a hierarchical constraint system (national totals must equal state totals, which must equal county totals, etc.).

(a) Read the TopDown algorithm description in Abowd (2018). Explain why simple Laplace noise on each geographic level would produce inconsistent statistics (e.g., state counts that do not sum to the national count).

(b) TopDown uses a post-processing step that projects noisy counts onto the constraint set. Explain why this post-processing preserves the DP guarantee (cite the post-processing theorem).

(c) The Census Bureau allocated $\varepsilon = 19.61$ across all published statistics. Critics argued this was too large (weak privacy); the Bureau argued it was necessary for data utility. Using the framework from Section 32.12, analyze this tradeoff. Consider: the Census publishes statistics at the block level (~8 million blocks, some with <10 people). What is the effective per-person privacy budget?

(d) Implement a simplified version of TopDown for a 3-level hierarchy (nation → 5 states → 25 counties). Add Laplace noise at each level, then project onto the constraint set using least-squares. Verify that the projection preserves the DP guarantee.