Chapter 12: Further Reading

Essential Sources

1. Diederik P. Kingma and Max Welling, "Auto-Encoding Variational Bayes" (ICLR, 2014)

The paper that introduced the variational autoencoder. Kingma and Welling show how to combine variational inference with neural network encoders and decoders, deriving the ELBO objective and the reparameterization trick that makes the entire framework trainable with standard backpropagation. The paper is concise (9 pages) and remarkably clearly written for such a foundational contribution.

Reading guidance: Section 2 derives the variational bound and introduces the reparameterization trick — read this carefully, as it is the mathematical core of the VAE. Section 2.3 explains the choice of recognition model (the encoder) and why a diagonal Gaussian is both sufficient for proof-of-concept and limiting in practice. Section 3 presents the specific architectural choices (MLP encoder and decoder) and the two datasets (MNIST and Frey Face). The experimental results are modest by modern standards, but the framework they introduced has spawned an enormous literature. For a more comprehensive treatment, see Kingma's PhD thesis, "Variational Inference with Deep Learning: A New Framework for Inference and Learning" (University of Amsterdam, 2017), which extends the original paper with importance-weighted autoencoders (IWAE), normalizing flows in the encoder, and connections to the wake-sleep algorithm.

2. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, "Generative Adversarial Nets" (NeurIPS, 2014)

The paper that introduced generative adversarial networks. Goodfellow et al. frame generative modeling as a minimax game between a generator and discriminator, prove that the optimal discriminator computes a density ratio, and show that the generator objective (under the optimal discriminator) minimizes the Jensen-Shannon divergence. The theoretical analysis is elegant, and the experimental results — while visually dated — demonstrated that adversarial training could produce sharp samples without explicit density estimation.

Reading guidance: Section 1 motivates the adversarial framework with a memorable counterfeiter-police analogy. Section 4 contains the theoretical analysis: Proposition 1 (optimal discriminator) and Theorem 1 (convergence of the training algorithm) are worth working through in detail. The proof of Theorem 1 assumes infinite capacity and perfect optimization — Section 5 discusses the practical implications of these assumptions. The "non-saturating" generator loss $-\log D(G(z))$ (replacing $\log(1 - D(G(z)))$) is mentioned briefly in Section 3 but has become the default in practice. For the Wasserstein GAN extension, read Arjovsky, Chintala, and Bottou, "Wasserstein GAN" (ICML, 2017) and Gulrajani, Ahmed, Arjovsky, Dumoulin, and Courville, "Improved Training of Wasserstein GANs" (NeurIPS, 2017). For spectral normalization, see Miyato, Kataoka, Koyama, and Yoshida, "Spectral Normalization for Generative Adversarial Networks" (ICLR, 2018). A comprehensive survey of GAN variants and training techniques is Saxena and Cao, "Generative Adversarial Networks (GANs): Challenges, Solutions, and Future Directions" (ACM Computing Surveys, 2022).

3. Jonathan Ho, Ajay Jain, and Pieter Abbeel, "Denoising Diffusion Probabilistic Models" (NeurIPS, 2020)

The paper that made diffusion models practical. Ho et al. showed that a simple denoising objective — predict the noise added to a training image — produces a generative model competitive with (and soon surpassing) the best GANs. The key contributions are: (1) the simplified training objective (Equation 14), (2) the connection between noise prediction and score matching, and (3) the demonstration that diffusion models can produce high-quality images with stable training.

Reading guidance: Section 2 defines the forward and reverse processes. The critical insight is in Section 3.2, where the authors derive the simplified loss $\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\|^2$ from the variational bound. Equation 14 is arguably the most important equation in generative modeling since the GAN objective — it turns a complex variational inference problem into a simple regression. Section 4 presents the experimental results on CIFAR-10 and LSUN, with FID scores competitive with the best GANs. For classifier-free guidance, which dramatically improved conditional generation, read Ho and Salimans, "Classifier-Free Diffusion Guidance" (NeurIPS Workshop, 2022). For faster sampling, see Song, Meng, and Ermon, "Denoising Diffusion Implicit Models" (ICLR, 2021), which introduces DDIM — a deterministic sampler that reduces the number of steps from 1000 to 50 without retraining. For the score-based unification, see Song, Sohl-Dickstein, Kingma, Kumar, Ermon, and Poole, "Score-Based Generative Modeling through Stochastic Differential Equations" (ICLR, 2021).

4. Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le, "Flow Matching for Generative Modeling" (ICLR, 2023)

This paper introduces flow matching as a simpler alternative to both normalizing flows and diffusion models. The core idea is to learn a velocity field that transports a simple distribution (Gaussian noise) to the data distribution along straight interpolation paths. The training objective — predict the velocity along the interpolation path — is even simpler than the DDPM noise-prediction loss, and the resulting ODE can be solved with fewer integration steps.

Reading guidance: Section 3 presents the conditional flow matching objective (Theorem 1), which is the key contribution. The paper shows that conditioning on individual data points (learning the velocity from noise to a specific data point) produces the same marginal velocity field as the intractable unconditional objective. This is analogous to how the DDPM simplified loss works: a per-sample regression loss approximates the global distribution-matching objective. Section 4 compares flow matching to continuous normalizing flows (CNFs) and shows that flow matching trains faster and produces straighter paths. For the connection to optimal transport, see Tong, Malkin, Fatras, Atanackovic, Zhang, Bengio, and Wolf, "Improving and Generalizing Flow-Based Generative Models with Minibatch Optimal Transport" (ICML, 2024), which introduces OT-CFM — coupling noise and data via the optimal transport assignment rather than random coupling.

5. Xu et al., "Modeling Tabular Data using Conditional TGAN" (NeurIPS, 2019) and Xu et al., "CTGAN" (subsequent work)

While the core papers above focus on continuous data (images), the MediCore case study uses tabular data. This line of work adapts GANs for tabular data with mixed types: continuous features are modeled with mode-specific normalization (representing each column as a mixture of Gaussians), categorical features are modeled with Gumbel-Softmax or straight-through estimators, and the training procedure is modified to handle the heterogeneity. The resulting CTGAN framework has become a standard baseline for synthetic tabular data generation.

Reading guidance: Focus on the mode-specific normalization for continuous columns (Section 3.1), which is the key technical contribution for tabular data. The paper demonstrates that standard GAN techniques (WGAN-GP, PacGAN) do not work well for tabular data without these modifications. For the broader landscape of synthetic data generation, see Bowen and Snoke, "Comparative Study of Differentially Private Synthetic Data Algorithms from the NIST PUMS Challenge" (Journal of Privacy and Confidentiality, 2021), which compares multiple synthetic data methods including VAEs, GANs, and Bayesian networks on real Census data with formal privacy guarantees. For the specific application to EHR data, see Choi et al., "Generating Multi-label Discrete Patient Records using Generative Adversarial Networks" (Machine Learning for Healthcare, 2017).