Chapter 12: Key Takeaways
-
Generative models learn the data distribution, not just input-output mappings. Discriminative models learn $p(y \mid \mathbf{x})$ — enough for prediction, but insufficient for sampling, density evaluation, or understanding the structure of the data itself. Generative models learn $p(\mathbf{x})$, enabling synthetic data generation (Case Study 1: privacy-preserving EHR), anomaly detection (low-density inputs), data augmentation (oversampling rare classes), and stochastic simulation (Case Study 2: weather ensembles). The practical data scientist needs both: discriminative models for prediction, generative models for everything else.
-
The VAE ELBO is a lower bound on the log-likelihood, and understanding its two terms is the key to using VAEs effectively. The reconstruction term drives the decoder to produce faithful outputs; the KL regularizer shapes the latent space to match the prior. The tension between these terms creates the characteristic tradeoff: strong reconstruction ($\beta < 1$) produces good outputs but a tangled latent space; strong regularization ($\beta > 1$) produces a smooth, disentangled latent space but blurry outputs. The reparameterization trick — expressing a stochastic variable as a deterministic function of parameters plus external noise — is the technical insight that makes the entire framework trainable.
-
GANs produce sharp outputs through adversarial training, but the minimax game is fundamentally harder to optimize than a loss function. Mode collapse, training instability, and non-convergence are not bugs in the implementation — they are inherent properties of two-player games trained with gradient descent. Wasserstein distance and spectral normalization address the symptoms but not the cause. Use GANs when perceptual sharpness in a narrow domain justifies the engineering overhead; prefer stable alternatives (diffusion, flow matching) when mode coverage and training reliability matter more.
-
Diffusion models achieve the best quality and diversity by converting generation into a denoising problem. The simplified DDPM objective — predict the noise that was added — is a regression loss with no adversarial dynamics, no posterior collapse, and no mode collapse. The mathematical connection to score matching ($\boldsymbol{\epsilon}_\theta$ is proportional to the negative score function) reveals why denoising works: the model learns the gradient of the log-density at every noise level, and sampling follows this gradient from noise to data. The cost is inference speed: 500-1000 denoising steps per sample.
-
Flow matching is the simplest generative framework: learn the velocity that transports noise to data. By training on linear interpolation paths between noise and data, flow matching avoids both the noise schedule complexity of diffusion and the adversarial dynamics of GANs. Straighter transport paths enable fast sampling (10-50 ODE integration steps), and the continuous change-of-variables formula provides exact log-likelihoods. The field's progression from normalizing flows (complex invertible architectures) to diffusion (simple training, slow sampling) to flow matching (simple training, fast sampling) illustrates how deeper mathematical understanding leads to simpler methods, not more complex ones.
-
The choice between generative model families is an engineering decision, not a theoretical one. VAEs for structured latent spaces and real-time sampling. GANs for maximum sharpness in constrained domains. Diffusion for quality and diversity when inference speed is secondary. Flow matching for the best balance of simplicity, speed, and quality. No family dominates on all axes, and the right choice depends on the application's requirements — generation speed, mode coverage, training stability, density evaluation, and available compute.
-
Evaluating generative models requires domain-specific metrics, not universal scores. FID captures quality and diversity for images but is meaningless for tabular EHR data or weather fields. Tabular data requires marginal distribution matching, correlation preservation, downstream utility, and privacy distance. Weather ensembles require ensemble spread calibration, rank histograms, and physical consistency checks. The temptation to reduce evaluation to a single number must be resisted — different metrics capture different failure modes, and a model that scores well on one can fail catastrophically on another.