Chapter 32: Key Takeaways

DataField.Dev

Chapter 32: Key Takeaways

Anonymization by removing identifiers does not work — differential privacy provides the mathematical alternative. Re-identification research (Sweeney, Narayanan and Shmatikov, Rocher et al.) has demonstrated that high-dimensional data is inherently identifiable: the intersection of a few quasi-identifiers is almost always unique. Differential privacy provides a formal, mathematical guarantee: a computation satisfies $(\varepsilon, \delta)$-DP if its output distribution changes by at most a factor of $e^{\varepsilon}$ (plus $\delta$) when any single record is added or removed. This guarantee holds against any adversary with any auxiliary information — it is not a heuristic but a theorem. The Laplace mechanism (for numerical queries, pure DP), Gaussian mechanism (for numerical queries, approximate DP with tighter composition), and exponential mechanism (for discrete selection) are the three fundamental tools for achieving it.
DP-SGD makes deep learning differentially private at the cost of accuracy, compute, and a new hyperparameter (the clipping norm). DP-SGD modifies standard gradient descent in three ways: per-example gradient computation (to isolate each individual's contribution), gradient clipping to norm $C$ (to bound sensitivity), and Gaussian noise addition proportional to $C$ (to mask individual contributions). Opacus provides a production-grade implementation for PyTorch, automatically calibrating noise to achieve a target $\varepsilon$ and tracking privacy expenditure via RDP accounting. The costs are real: 2-4x training time overhead, 2-30% accuracy degradation depending on $\varepsilon$, and the need to tune $C$ (too small clips useful signal; too large amplifies noise). Batch size is the single most impactful hyperparameter — larger batches improve the signal-to-noise ratio of the aggregated gradient.
Privacy accounting determines whether a DP guarantee is useful or vacuous — RDP is 5-10x tighter than basic composition. A DP-SGD training run involves thousands of gradient steps, each consuming privacy budget. Under basic composition, budget grows linearly with steps ($k\varepsilon$); under advanced composition, as $\sqrt{k}$; under Rényi DP accounting, substantially tighter still. For a typical training run, basic composition might report $\varepsilon = 50$ (useless), while RDP reports $\varepsilon = 5$ (useful) for identical training — same noise, same model, just tighter analysis. Always use the tightest available accountant (Opacus defaults to RDP). The privacy budget is a finite resource: once spent, no more queries can be answered from the same dataset without a new budget allocation.
Federated learning enables model training without centralizing data, but it does not inherently provide privacy — combining it with DP-SGD and secure aggregation does. FedAvg coordinates local training across decentralized clients, aggregating model updates at a central server. Raw data never leaves the client. However, model updates can leak information about local data (gradient inversion attacks), so federated learning alone is insufficient for formal privacy. DP-SGD at each client provides output privacy (the trained model does not memorize individuals); secure aggregation provides communication privacy (the server cannot inspect individual updates). The main technical challenge is non-IID data: heterogeneous client distributions cause client drift, which is addressed by FedProx, SCAFFOLD, or simply more frequent communication rounds.
Synthetic data is not inherently private — only synthetic data from a DP-trained generative model carries formal guarantees. A CTGAN or TVAE trained without differential privacy can memorize and reproduce individual training records, especially outliers. The fact that synthetic records do not directly correspond to real individuals provides no formal protection against re-identification. Synthetic data is formally private only when the generative model was trained with DP-SGD (the guarantee transfers via the post-processing theorem) or when the synthesis is based on differentially private marginal statistics. Evaluation requires three dimensions: statistical fidelity (do marginals and correlations match?), ML utility (TSTR/TRTR ratio close to 1.0?), and privacy (nearest-neighbor distance ratio close to 1.0, membership inference AUC close to 0.5?).
The privacy-utility tradeoff is fundamental and irreducible — the practitioner's task is to navigate it, not eliminate it. Stronger privacy (smaller $\varepsilon$) always costs more utility (more noise, less accurate models). This is a mathematical consequence: information about the population necessarily contains some information about individuals. Ranking metrics (Recall@20, NDCG) are more sensitive to DP noise than classification metrics (accuracy) because noise disrupts relative ordering more than binary decisions. The choice of $\varepsilon$ depends on regulatory requirements, data sensitivity, dataset size, number of queries, and the adversary model. There is no universal "correct" $\varepsilon$, but guidelines exist: $\varepsilon < 1$ for highly sensitive medical data, $\varepsilon = 1\text{--}8$ for moderate-sensitivity user data, $\varepsilon = 3\text{--}10$ for regulated model training.
Privacy-preserving data science is a software engineering investment, not a one-time analysis decision. Integrating DP-SGD requires pipeline changes (Opacus integration, gradient clipping tuning, batch size optimization), new infrastructure (privacy budget tracking, audit trails, privacy accounting dashboards), and updated testing (DP-aware behavioral test thresholds, segment-level quality analysis). Federated learning requires distributed coordination infrastructure. Synthetic data requires a separate evaluation pipeline. These investments are reusable across models and use cases — the StreamRec Opacus integration directly benefits future credit scoring DP training — but they require deliberate engineering effort. Organizations that treat privacy as infrastructure rather than an afterthought ship privacy-preserving systems faster and with fewer regulatory incidents.