Chapter 32: Further Reading

Essential Sources

1. Cynthia Dwork and Aaron Roth, "The Algorithmic Foundations of Differential Privacy" (Foundations and Trends in Theoretical Computer Science, 2014)

The definitive reference on differential privacy theory. Dwork and Roth develop the subject from first principles: the formal definition of DP, the Laplace mechanism, the Gaussian mechanism, the exponential mechanism, composition theorems (basic and advanced), and applications to query release, histogram publication, and private learning. The monograph is mathematically rigorous but accessible to readers with an undergraduate background in probability and algorithms.

Reading guidance: Chapters 1-3 cover the core definitions, mechanisms, and composition theorems that correspond to Sections 32.2-32.6 of this chapter. Chapter 1's motivating examples (the linkage attack on the Massachusetts hospital discharge dataset, the Netflix Prize de-anonymization) provide compelling context for why DP exists. Chapter 3's treatment of the exponential mechanism is the most thorough published — it proves the privacy guarantee and provides utility bounds that Section 32.5 only summarizes. Chapter 7 covers the connection between DP and learning theory, establishing that any concept class learnable in the statistical query model is learnable with differential privacy — the theoretical foundation for DP-SGD. For a shorter introduction, Dwork's survey "Differential Privacy: A Survey of Results" (TAMC, 2008) covers the key ideas in 30 pages. For the connection to information theory, see Bun and Steinke, "Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds" (TCC, 2016), which introduces zCDP (Section 32.9).

2. Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang, "Deep Learning with Differential Privacy" (CCS, 2016)

The paper that introduced DP-SGD — the algorithm that makes differential privacy practical for deep learning. Abadi et al. define the three modifications to SGD (per-example gradient computation, gradient clipping, Gaussian noise addition), introduce the moments accountant (a precursor to RDP accounting) for tight privacy composition, and demonstrate that differentially private training of deep networks is feasible on MNIST and CIFAR-10 with modest accuracy loss. This paper is the direct theoretical foundation for Opacus.

Reading guidance: Section 3 defines DP-SGD with formal pseudocode — compare directly to the DPSGDTrainer class in Section 32.7. Section 4 introduces the moments accountant, which tracks the privacy loss random variable's moment-generating function rather than bounding it with worst-case inequalities. This is the key innovation: the moments accountant provides bounds that are 10-100x tighter than advanced composition for typical training configurations. The experimental results (Section 5) on MNIST and CIFAR-10 are now dated — modern results with Opacus and better hyperparameter tuning achieve significantly better privacy-accuracy tradeoffs — but the methodology (training at multiple $\varepsilon$, plotting the privacy-utility curve, reporting wall-clock overhead) remains the standard experimental template. For the RDP refinement of the moments accountant, see Mironov, "Rényi Differential Privacy" (CSF, 2017). For Opacus-specific implementation details and optimizations, see Yousefpour et al., "Opacus: User-Friendly Differential Privacy Library in PyTorch" (arXiv, 2021).

3. H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas, "Communication-Efficient Learning of Deep Networks from Decentralized Data" (AISTATS, 2017)

The paper that introduced Federated Averaging (FedAvg) — the foundational algorithm for federated learning. McMahan et al. demonstrate that training deep networks across decentralized clients with limited communication is not only feasible but can approach the quality of centralized training, even when client data is highly non-IID. The paper introduces the IID vs. non-IID data partition taxonomy that has become standard in federated learning research and establishes the key hyperparameters: number of local epochs $E$, client fraction $C$, and batch size $B$.

Reading guidance: Section 2 defines the FedAvg algorithm with pseudocode that maps directly to the FedAvgServer class in Section 32.10. Section 3's non-IID experiments are particularly valuable: the authors partition MNIST by digit label (each client has at most 2 digits) and show that FedAvg still converges, albeit more slowly than the IID case. This motivates the non-IID challenges discussed in Section 32.10 and the MediCore case study. Section 4's large-scale experiments on a next-word prediction task with real mobile keyboard data demonstrate federated learning at production scale. For the non-IID convergence theory, see Li et al., "On the Convergence of FedAvg on Non-IID Data" (ICLR, 2020) and the FedProx extension in Li et al., "Federated Optimization in Heterogeneous Networks" (MLSys, 2020). For secure aggregation, see Bonawitz et al., "Practical Secure Aggregation for Privacy-Preserving Machine Learning" (CCS, 2017). For a comprehensive survey of the federated learning landscape, see Kairouz et al., "Advances and Open Problems in Federated Learning" (Foundations and Trends in ML, 2021) — a 200-page survey covering privacy, communication efficiency, heterogeneity, and systems challenges.

4. Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni, "Modeling Tabular Data Using Conditional GAN" (NeurIPS, 2019)

The paper that introduced CTGAN — the conditional tabular GAN architecture designed to handle the specific challenges of tabular data synthesis. Xu et al. identify three problems that make standard GANs fail on tabular data (mixed continuous-categorical columns, imbalanced categorical distributions, multi-modal continuous distributions) and propose architectural solutions: mode-specific normalization for continuous columns, conditional generation with training-by-sampling for categorical columns, and a PacGAN discriminator to avoid mode collapse.

Reading guidance: Section 3.1 on mode-specific normalization is the key contribution: rather than fitting a single Gaussian to each continuous column (which fails for multi-modal distributions like income), CTGAN fits a variational Gaussian mixture and normalizes within each mode. This produces dramatically better synthetic continuous distributions. Section 3.2's training-by-sampling procedure addresses categorical imbalance by conditioning the generator on each category with equal probability during training, then sampling from the true categorical distribution at generation time. Section 4's experiments on 5 real-world datasets establish the TSTR evaluation methodology (Section 32.11) that has become the standard for synthetic data evaluation. For the VAE-based alternative, see the TVAE model in the same paper (Section 3.3), which often outperforms CTGAN on smaller datasets. For DP synthetic data specifically, see Torkzadehmahani et al., "DP-CGAN: Differentially Private Synthetic Data and Label Generation" (CVPR Workshop, 2019), and Jordon et al., "PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees" (ICLR, 2019), which train GANs with differential privacy using the PATE framework rather than DP-SGD.

5. Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova, "RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response" (CCS, 2014)

The paper that introduced Google's deployment of local differential privacy at scale. RAPPOR collects frequency estimates of categorical data (e.g., which homepage URL is set in Chrome) from millions of users, with each user's response individually randomized before leaving the device. This is the purest form of privacy: the server never sees any user's true answer.

Reading guidance: This paper complements the chapter's focus on centralized and federated DP by showing the local DP alternative, where noise is added on the user's device before data collection. Section 2 defines the RAPPOR mechanism: each user encodes their response as a bit vector, then flips each bit independently with probability $f/2$ (permanent randomization) and again with probability $q$ (instantaneous randomization). Section 3 shows how the server recovers approximate frequency estimates from millions of noisy responses using LASSO regression on the hash space. The key tradeoff is that local DP requires much more noise per user than centralized DP (because the guarantee applies to each individual response, not just the aggregate), and thus requires much larger populations to achieve useful accuracy. Apple's similar deployment for iOS telemetry (Tang et al., "Privacy at Scale: Local Differential Privacy in Practice," SIGMOD, 2017) provides a complementary industry perspective. For the formal comparison between local and centralized DP models, see Kasiviswanathan et al., "What Can We Learn Privately?" (FOCS, 2008).