Chapter 12: Generative Models — VAEs, GANs, Diffusion Models, and the Frontier of Data Generation

DataField.Dev

25 min read

> — Richard Feynman, written on his blackboard at the time of his death (1988)

In This Chapter

Learning Objectives
12.1 Why Generative Models?
12.2 The Generative Modeling Problem
12.3 Variational Autoencoders: Derivation from First Principles
12.4 Generative Adversarial Networks
12.5 Diffusion Models: Learning to Denoise
12.6 Flow Matching and Normalizing Flows
12.7 Comparing Generative Model Families
12.8 Progressive Project: VAE for StreamRec Item Embeddings
12.9 Evaluation of Generative Models
12.10 Chapter Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 12: Generative Models — VAEs, GANs, Diffusion Models, and the Frontier of Data Generation

"What I cannot create, I do not understand." — Richard Feynman, written on his blackboard at the time of his death (1988)

Learning Objectives

By the end of this chapter, you will be able to:

Derive the variational autoencoder (VAE) objective from the evidence lower bound (ELBO) and implement it in PyTorch
Explain the GAN training dynamic (minimax game) and common failure modes (mode collapse, training instability)
Derive the forward and reverse diffusion processes and implement a simple denoising diffusion probabilistic model (DDPM)
Compare generative model families on quality, diversity, controllability, and computational cost
Apply generative models to practical tasks: data augmentation, synthetic data generation, and anomaly detection

12.1 Why Generative Models?

Every model we have built so far in this book is discriminative. Given input $\mathbf{x}$, we predict output $y$. The MLP in Chapter 6 predicts click probability. The CNN in Chapter 8 classifies images. The transformer in Chapter 10 predicts the next token. These models learn $p(y \mid \mathbf{x})$ — the conditional distribution of outputs given inputs — and they are spectacularly good at it.

But discriminative models have a fundamental limitation: they do not understand the data itself. A click predictor that achieves 0.85 AUC cannot generate a realistic user profile. An image classifier that achieves 95% accuracy cannot create a new image. A next-token predictor cannot tell you whether a given sentence is probable or improbable in absolute terms — only which token is most likely next.

Generative models learn the data distribution itself: $p(\mathbf{x})$, or a joint distribution $p(\mathbf{x}, y)$. Once you have an estimate of $p(\mathbf{x})$, you can do things that discriminative models cannot:

Sample new data. Draw $\mathbf{x}_{\text{new}} \sim p(\mathbf{x})$ to create synthetic examples — images, molecules, patient records, weather scenarios — that are statistically indistinguishable from real data.
Evaluate density. Compute $p(\mathbf{x})$ for a given observation, enabling anomaly detection (low-density points are anomalies), out-of-distribution detection, and model criticism.
Learn representations. Generative models with latent variables learn compressed representations that capture the factors of variation in data — a continuous "concept space" that can be navigated and interpolated.
Augment data. Generate synthetic training examples to address class imbalance, privacy constraints, or data scarcity.

This chapter covers three families of generative models — variational autoencoders (VAEs), generative adversarial networks (GANs), and diffusion models — each with its own mathematical framework, strengths, and failure modes. We also introduce flow matching and normalizing flows as complementary approaches. The chapter concludes by comparing all families and establishing when each is the right tool.

Understanding Why: Generative models are not a curiosity or an artistic tool. They are the mathematical answer to the question: "What does the data look like?" Understanding their derivations — why the ELBO works, why the GAN game has an equilibrium, why denoising learns the score — gives you the ability to diagnose failures, choose the right model, and adapt these ideas to new problems. The math is the tool; the intuition is the product.

12.2 The Generative Modeling Problem

Setup and Notation

We observe a dataset $\mathcal{D} = \{\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \ldots, \mathbf{x}^{(N)}\}$ of $N$ i.i.d. samples from an unknown data distribution $p_{\text{data}}(\mathbf{x})$. The goal is to learn a model $p_\theta(\mathbf{x})$, parameterized by $\theta$, such that $p_\theta(\mathbf{x}) \approx p_{\text{data}}(\mathbf{x})$.

The maximum likelihood objective is:

$$\theta^* = \arg\max_\theta \sum_{i=1}^{N} \log p_\theta(\mathbf{x}^{(i)})$$

This is the foundation for all generative models, though each family approximates or optimizes it differently.

The Challenge of High-Dimensional Data

For high-dimensional data (images, text, tabular records with hundreds of features), directly parameterizing $p_\theta(\mathbf{x})$ is intractable. Consider a $28 \times 28$ grayscale image: the space of all possible pixel configurations has $256^{784} \approx 10^{1888}$ elements. The data distribution — the tiny subset of this space that contains recognizable digits — lives on a low-dimensional manifold embedded in this vast space. Generative models must somehow discover and parameterize this manifold.

The three families we study each solve this problem differently:

VAEs introduce latent variables $\mathbf{z}$ and learn both an encoder $q_\phi(\mathbf{z} \mid \mathbf{x})$ (data → latent) and a decoder $p_\theta(\mathbf{x} \mid \mathbf{z})$ (latent → data).
GANs train a generator to transform noise into data, using a discriminator as a learned loss function.
Diffusion models define a fixed forward process that adds noise and learn the reverse process that removes it.

12.3 Variational Autoencoders: Derivation from First Principles

The Latent Variable Model

We posit that each observed data point $\mathbf{x}$ was generated by a two-step process:

A latent variable $\mathbf{z}$ is drawn from a prior: $\mathbf{z} \sim p(\mathbf{z})$ (typically $\mathcal{N}(\mathbf{0}, \mathbf{I})$).
The observation is drawn from a conditional distribution: $\mathbf{x} \sim p_\theta(\mathbf{x} \mid \mathbf{z})$.

The marginal likelihood of an observation is:

$$p_\theta(\mathbf{x}) = \int p_\theta(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, d\mathbf{z}$$

This integral is intractable for any nontrivial decoder $p_\theta(\mathbf{x} \mid \mathbf{z})$ because it requires integrating over the entire latent space. We cannot evaluate $p_\theta(\mathbf{x})$, so we cannot directly maximize the log-likelihood $\log p_\theta(\mathbf{x})$.

Deriving the Evidence Lower Bound (ELBO)

The key idea is to introduce a variational distribution $q_\phi(\mathbf{z} \mid \mathbf{x})$ — the encoder — that approximates the intractable true posterior $p_\theta(\mathbf{z} \mid \mathbf{x})$. We then derive a tractable lower bound on the log-likelihood.

Step 1: Write the log-likelihood in terms of the variational distribution.

$$\log p_\theta(\mathbf{x}) = \log \int p_\theta(\mathbf{x}, \mathbf{z}) \, d\mathbf{z}$$

Multiply and divide by $q_\phi(\mathbf{z} \mid \mathbf{x})$:

$$= \log \int \frac{p_\theta(\mathbf{x}, \mathbf{z})}{q_\phi(\mathbf{z} \mid \mathbf{x})} q_\phi(\mathbf{z} \mid \mathbf{x}) \, d\mathbf{z}$$

$$= \log \, \mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})} \left[ \frac{p_\theta(\mathbf{x}, \mathbf{z})}{q_\phi(\mathbf{z} \mid \mathbf{x})} \right]$$

Step 2: Apply Jensen's inequality. Since $\log$ is concave:

$$\log \, \mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})} \left[ \frac{p_\theta(\mathbf{x}, \mathbf{z})}{q_\phi(\mathbf{z} \mid \mathbf{x})} \right] \geq \mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})} \left[ \log \frac{p_\theta(\mathbf{x}, \mathbf{z})}{q_\phi(\mathbf{z} \mid \mathbf{x})} \right]$$

The right-hand side is the Evidence Lower Bound (ELBO):

$$\text{ELBO}(\theta, \phi; \mathbf{x}) = \mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})} \left[ \log p_\theta(\mathbf{x}, \mathbf{z}) - \log q_\phi(\mathbf{z} \mid \mathbf{x}) \right]$$

Step 3: Decompose into reconstruction and KL terms. Using $p_\theta(\mathbf{x}, \mathbf{z}) = p_\theta(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z})$:

$$\text{ELBO} = \mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})} \left[ \log p_\theta(\mathbf{x} \mid \mathbf{z}) \right] - D_{\text{KL}}\left( q_\phi(\mathbf{z} \mid \mathbf{x}) \,\|\, p(\mathbf{z}) \right)$$

This decomposition is the heart of the VAE:

Term	Name	Intuition
$\mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})} [\log p_\theta(\mathbf{x} \mid \mathbf{z})]$	Reconstruction loss	How well can the decoder reconstruct $\mathbf{x}$ from its encoding $\mathbf{z}$?
$-D_{\text{KL}}(q_\phi(\mathbf{z} \mid \mathbf{x}) \\| p(\mathbf{z}))$	KL regularizer	How close is the encoder's posterior to the prior?

Maximizing the ELBO simultaneously trains the decoder to reconstruct data and the encoder to produce latent codes that match the prior.

Mathematical Foundation: The gap between the log-likelihood and the ELBO is exactly the KL divergence between the approximate and true posteriors: $\log p_\theta(\mathbf{x}) - \text{ELBO} = D_{\text{KL}}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p_\theta(\mathbf{z} \mid \mathbf{x}))$. Since KL divergence is non-negative, the ELBO is indeed a lower bound. As the variational approximation improves, this gap shrinks and the ELBO tightens. This relationship is not merely a theoretical nicety — it tells us that improving the encoder (better approximate posterior) directly improves the bound on the log-likelihood.

The Reparameterization Trick

The ELBO contains an expectation over $q_\phi(\mathbf{z} \mid \mathbf{x})$, which depends on $\phi$. To optimize with respect to $\phi$ using gradient descent, we need $\nabla_\phi \mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})}[\cdot]$. But we cannot backpropagate through a sampling operation: if $\mathbf{z} \sim q_\phi(\mathbf{z} \mid \mathbf{x})$, the sample $\mathbf{z}$ is a stochastic node that blocks gradient flow.

The reparameterization trick solves this by expressing the random variable as a deterministic function of $\phi$ and an independent noise variable:

$$\mathbf{z} = \boldsymbol{\mu}_\phi(\mathbf{x}) + \boldsymbol{\sigma}_\phi(\mathbf{x}) \odot \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$$

where $\boldsymbol{\mu}_\phi(\mathbf{x})$ and $\boldsymbol{\sigma}_\phi(\mathbf{x})$ are the mean and standard deviation output by the encoder network. Now $\mathbf{z}$ is a differentiable function of $\phi$ (through $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$), and the randomness comes from $\boldsymbol{\epsilon}$, which does not depend on $\phi$.

This is why the trick works: gradients flow through the deterministic path $\phi \to \boldsymbol{\mu}, \boldsymbol{\sigma} \to \mathbf{z} \to \text{decoder} \to \text{loss}$, while the stochasticity is "externalized" to $\boldsymbol{\epsilon}$.

The KL Divergence in Closed Form

When both the encoder posterior and the prior are Gaussian, the KL term has a closed-form expression. For a single data point with $d$-dimensional latent space:

$$D_{\text{KL}}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p(\mathbf{z})) = -\frac{1}{2} \sum_{j=1}^{d} \left( 1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2 \right)$$

This formula is derived from the general KL divergence between two multivariate Gaussians with diagonal covariance. In practice, the encoder outputs $\log \sigma_j^2$ rather than $\sigma_j$ to ensure numerical stability and unconstrained optimization.

Implementation: VAE in PyTorch

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
from typing import Tuple, Dict, List
import numpy as np


class VAE(nn.Module):
    """Variational Autoencoder with Gaussian encoder and Bernoulli decoder.

    The encoder maps input x to parameters (mu, log_var) of a Gaussian
    posterior q(z|x). The decoder maps latent z to reconstruction
    probabilities p(x|z). The ELBO objective combines reconstruction
    loss (binary cross-entropy) with KL regularization.

    Args:
        input_dim: Dimensionality of input data.
        hidden_dim: Hidden layer size for encoder and decoder.
        latent_dim: Dimensionality of the latent space.
    """

    def __init__(
        self, input_dim: int = 784, hidden_dim: int = 512, latent_dim: int = 32
    ) -> None:
        super().__init__()
        self.latent_dim = latent_dim

        # Encoder: x -> h -> (mu, log_var)
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
        )
        self.fc_mu = nn.Linear(hidden_dim, latent_dim)
        self.fc_log_var = nn.Linear(hidden_dim, latent_dim)

        # Decoder: z -> h -> x_recon
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, input_dim),
            nn.Sigmoid(),  # Output probabilities for Bernoulli decoder
        )

    def encode(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """Encode input to latent distribution parameters.

        Args:
            x: Input tensor of shape (batch_size, input_dim).

        Returns:
            Tuple of (mu, log_var), each of shape (batch_size, latent_dim).
        """
        h = self.encoder(x)
        return self.fc_mu(h), self.fc_log_var(h)

    def reparameterize(
        self, mu: torch.Tensor, log_var: torch.Tensor
    ) -> torch.Tensor:
        """Apply the reparameterization trick.

        z = mu + sigma * epsilon, where epsilon ~ N(0, I).
        This allows gradients to flow through the sampling step.

        Args:
            mu: Mean of the posterior, shape (batch_size, latent_dim).
            log_var: Log variance of the posterior, shape (batch_size, latent_dim).

        Returns:
            Sampled latent vector z, shape (batch_size, latent_dim).
        """
        std = torch.exp(0.5 * log_var)  # sigma = exp(0.5 * log(sigma^2))
        eps = torch.randn_like(std)      # epsilon ~ N(0, I)
        return mu + std * eps

    def decode(self, z: torch.Tensor) -> torch.Tensor:
        """Decode latent vector to reconstruction.

        Args:
            z: Latent vector of shape (batch_size, latent_dim).

        Returns:
            Reconstruction probabilities, shape (batch_size, input_dim).
        """
        return self.decoder(z)

    def forward(
        self, x: torch.Tensor
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        """Full forward pass: encode, sample, decode.

        Args:
            x: Input tensor, shape (batch_size, input_dim).

        Returns:
            Tuple of (reconstruction, mu, log_var).
        """
        mu, log_var = self.encode(x)
        z = self.reparameterize(mu, log_var)
        x_recon = self.decode(z)
        return x_recon, mu, log_var


def vae_loss(
    x_recon: torch.Tensor,
    x: torch.Tensor,
    mu: torch.Tensor,
    log_var: torch.Tensor,
    beta: float = 1.0,
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """Compute the VAE ELBO loss (negative ELBO).

    Loss = -ELBO = reconstruction_loss + beta * KL_divergence.

    For a Bernoulli decoder, reconstruction loss is binary cross-entropy.
    KL divergence between q(z|x) = N(mu, sigma^2 I) and p(z) = N(0, I)
    has the closed-form expression.

    Args:
        x_recon: Reconstructed output, shape (batch_size, input_dim).
        x: Original input, shape (batch_size, input_dim).
        mu: Encoder mean, shape (batch_size, latent_dim).
        log_var: Encoder log variance, shape (batch_size, latent_dim).
        beta: Weight on the KL term (beta=1 is standard VAE, beta>1 is beta-VAE).

    Returns:
        Tuple of (total_loss, reconstruction_loss, kl_loss), all scalar tensors.
    """
    # Reconstruction: sum over features, mean over batch
    recon_loss = F.binary_cross_entropy(x_recon, x, reduction="sum") / x.size(0)

    # KL divergence: -0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)
    kl_loss = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp()) / x.size(0)

    total_loss = recon_loss + beta * kl_loss
    return total_loss, recon_loss, kl_loss

Training the VAE

def train_vae(
    model: VAE,
    train_loader: DataLoader,
    val_loader: DataLoader,
    epochs: int = 50,
    lr: float = 1e-3,
    beta: float = 1.0,
) -> Dict[str, List[float]]:
    """Train a VAE by maximizing the ELBO.

    Args:
        model: VAE instance.
        train_loader: Training data loader (expects flattened inputs).
        val_loader: Validation data loader.
        epochs: Number of training epochs.
        lr: Learning rate.
        beta: KL weight (1.0 for standard VAE).

    Returns:
        Dictionary with training history (losses per epoch).
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)

    history: Dict[str, List[float]] = {
        "train_loss": [], "train_recon": [], "train_kl": [],
        "val_loss": [], "val_recon": [], "val_kl": [],
    }

    for epoch in range(epochs):
        # Training
        model.train()
        epoch_losses = {"loss": [], "recon": [], "kl": []}
        for (x_batch,) in train_loader:
            x_batch = x_batch.to(device)
            optimizer.zero_grad()
            x_recon, mu, log_var = model(x_batch)
            loss, recon, kl = vae_loss(x_recon, x_batch, mu, log_var, beta)
            loss.backward()
            optimizer.step()
            epoch_losses["loss"].append(loss.item())
            epoch_losses["recon"].append(recon.item())
            epoch_losses["kl"].append(kl.item())

        history["train_loss"].append(np.mean(epoch_losses["loss"]))
        history["train_recon"].append(np.mean(epoch_losses["recon"]))
        history["train_kl"].append(np.mean(epoch_losses["kl"]))

        # Validation
        model.eval()
        val_losses = {"loss": [], "recon": [], "kl": []}
        with torch.no_grad():
            for (x_batch,) in val_loader:
                x_batch = x_batch.to(device)
                x_recon, mu, log_var = model(x_batch)
                loss, recon, kl = vae_loss(x_recon, x_batch, mu, log_var, beta)
                val_losses["loss"].append(loss.item())
                val_losses["recon"].append(recon.item())
                val_losses["kl"].append(kl.item())

        history["val_loss"].append(np.mean(val_losses["loss"]))
        history["val_recon"].append(np.mean(val_losses["recon"]))
        history["val_kl"].append(np.mean(val_losses["kl"]))

        if (epoch + 1) % 10 == 0:
            print(
                f"Epoch {epoch+1:3d} | "
                f"Train ELBO: {-history['train_loss'][-1]:.1f} "
                f"(recon: {history['train_recon'][-1]:.1f}, "
                f"KL: {history['train_kl'][-1]:.1f}) | "
                f"Val ELBO: {-history['val_loss'][-1]:.1f}"
            )

    return history

The beta-VAE and the Reconstruction-Regularization Tradeoff

The standard VAE sets $\beta = 1$, giving equal weight to reconstruction and KL terms. In practice, the KL term often "wins" early in training — the encoder collapses to the prior ($q_\phi(\mathbf{z} \mid \mathbf{x}) \approx p(\mathbf{z})$ for all $\mathbf{x}$), and the decoder ignores $\mathbf{z}$ entirely. This is called posterior collapse or KL vanishing.

The $\beta$-VAE (Higgins et al., 2017) modifies the objective:

$$\mathcal{L}_\beta = \mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})} [\log p_\theta(\mathbf{x} \mid \mathbf{z})] - \beta \, D_{\text{KL}}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p(\mathbf{z}))$$

$\beta < 1$: Weakens the regularizer. The encoder uses the latent space more aggressively, improving reconstruction at the cost of a less structured latent space.
$\beta > 1$: Strengthens the regularizer. The latent space is more disentangled (each latent dimension tends to capture an independent factor of variation), but reconstructions are blurrier.

A common practical strategy is KL annealing: start training with $\beta = 0$ (pure autoencoder), then linearly increase $\beta$ to 1 over the first $K$ epochs. This allows the decoder to learn a useful mapping before the regularizer forces the latent codes toward the prior.

Common Misconception: "VAEs produce blurry outputs because of the Gaussian assumption." This is partially true but misleading. The primary cause of blur is the reconstruction loss: mean squared error (or binary cross-entropy) averaged over pixels rewards the model for predicting the mean of the output distribution, which is inherently blurry for multimodal distributions. Replacing the pixel-wise loss with a perceptual loss or adversarial loss (as in VAE-GAN hybrids) dramatically sharpens outputs — the Gaussian latent space is not the bottleneck.

12.4 Generative Adversarial Networks

The Minimax Game

GANs take a radically different approach to generative modeling. Instead of maximizing a likelihood bound, they train two networks in opposition:

The generator $G_\theta: \mathbb{R}^k \to \mathbb{R}^d$ maps noise $\mathbf{z} \sim p(\mathbf{z})$ to synthetic data $G_\theta(\mathbf{z})$.
The discriminator $D_\phi: \mathbb{R}^d \to [0, 1]$ classifies inputs as real (from $p_{\text{data}}$) or fake (from $G_\theta$).

The training objective is a minimax game:

$$\min_\theta \max_\phi \; V(G_\theta, D_\phi) = \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}} [\log D_\phi(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p(\mathbf{z})} [\log(1 - D_\phi(G_\theta(\mathbf{z})))]$$

The discriminator wants to maximize $V$: assign high probability to real data ($D_\phi(\mathbf{x}) \to 1$) and low probability to fake data ($D_\phi(G_\theta(\mathbf{z})) \to 0$). The generator wants to minimize $V$: produce outputs that fool the discriminator ($D_\phi(G_\theta(\mathbf{z})) \to 1$).

Theoretical Equilibrium

Goodfellow et al. (2014) proved that for a fixed generator $G_\theta$, the optimal discriminator is:

$$D^*(\mathbf{x}) = \frac{p_{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_G(\mathbf{x})}$$

Substituting $D^*$ back into the minimax objective yields:

$$V(G, D^*) = -\log 4 + 2 \, D_{\text{JS}}(p_{\text{data}} \| p_G)$$

where $D_{\text{JS}}$ is the Jensen-Shannon divergence. At equilibrium, $p_G = p_{\text{data}}$, $D^*(\mathbf{x}) = 1/2$ for all $\mathbf{x}$, and $V(G, D^*) = -\log 4$.

Mathematical Foundation: The connection between the GAN objective and the Jensen-Shannon divergence is elegant but fragile. It holds only when the discriminator is optimal — and in practice, we update the discriminator for only a few steps before updating the generator. The theoretical guarantee that $p_G \to p_{\text{data}}$ assumes infinite capacity, infinite data, and perfect optimization. None of these hold in practice, which is why GAN training is notoriously unstable.

Training Dynamics and Failure Modes

In practice, GAN training is dominated by three failure modes:

1. Mode Collapse. The generator learns to produce a small number of outputs that fool the discriminator, ignoring the diversity of the real data distribution. In the extreme case, $G_\theta(\mathbf{z})$ maps all noise vectors to the same output. The discriminator eventually catches on, the generator switches to a different mode, and the cycle repeats without converging.

2. Training Instability. The discriminator becomes too strong too quickly, providing near-zero gradients to the generator (since $\log(1 - D_\phi(G_\theta(\mathbf{z}))) \to \log(1) = 0$ when $D_\phi \to 0$ on fake data). Alternatively, the generator fools the discriminator completely, and the discriminator's gradients become uninformative.

3. Non-convergence. The minimax game may not converge at all. Unlike supervised learning, where gradient descent minimizes a single loss function, GAN training involves two players with opposing objectives. The dynamics are governed by a system of coupled differential equations, and convergence is guaranteed only under restrictive conditions that rarely hold.

Practical Stabilization: Non-Saturating Loss

Goodfellow et al. (2014) proposed replacing the generator loss $\log(1 - D_\phi(G_\theta(\mathbf{z})))$ with $-\log D_\phi(G_\theta(\mathbf{z}))$. This does not change the equilibrium but provides stronger gradients early in training when the generator is poor:

class Generator(nn.Module):
    """GAN generator: maps noise to data space.

    Args:
        noise_dim: Dimensionality of the noise input z.
        hidden_dim: Hidden layer size.
        output_dim: Dimensionality of the generated output.
    """

    def __init__(
        self, noise_dim: int = 64, hidden_dim: int = 256, output_dim: int = 784
    ) -> None:
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(noise_dim, hidden_dim),
            nn.LeakyReLU(0.2),
            nn.Linear(hidden_dim, hidden_dim * 2),
            nn.LeakyReLU(0.2),
            nn.Linear(hidden_dim * 2, hidden_dim * 2),
            nn.LeakyReLU(0.2),
            nn.Linear(hidden_dim * 2, output_dim),
            nn.Tanh(),  # Output in [-1, 1]
        )

    def forward(self, z: torch.Tensor) -> torch.Tensor:
        return self.net(z)


class Discriminator(nn.Module):
    """GAN discriminator: classifies real vs. fake data.

    Uses spectral normalization on all linear layers for
    training stability (Miyato et al., 2018).

    Args:
        input_dim: Dimensionality of the input data.
        hidden_dim: Hidden layer size.
    """

    def __init__(self, input_dim: int = 784, hidden_dim: int = 256) -> None:
        super().__init__()
        self.net = nn.Sequential(
            nn.utils.spectral_norm(nn.Linear(input_dim, hidden_dim * 2)),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.utils.spectral_norm(nn.Linear(hidden_dim * 2, hidden_dim)),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.utils.spectral_norm(nn.Linear(hidden_dim, 1)),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)


def train_gan(
    generator: Generator,
    discriminator: Discriminator,
    train_loader: DataLoader,
    epochs: int = 100,
    lr_g: float = 2e-4,
    lr_d: float = 2e-4,
    noise_dim: int = 64,
    n_critic: int = 1,
) -> Dict[str, List[float]]:
    """Train a GAN using the non-saturating loss.

    The discriminator and generator are updated alternately.
    The discriminator is updated n_critic times per generator update.

    Args:
        generator: Generator network.
        discriminator: Discriminator network.
        train_loader: Training data loader.
        epochs: Number of epochs.
        lr_g: Generator learning rate.
        lr_d: Discriminator learning rate.
        noise_dim: Dimensionality of noise input.
        n_critic: Discriminator updates per generator update.

    Returns:
        Training history with generator and discriminator losses.
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    generator = generator.to(device)
    discriminator = discriminator.to(device)

    opt_g = torch.optim.Adam(generator.parameters(), lr=lr_g, betas=(0.5, 0.999))
    opt_d = torch.optim.Adam(discriminator.parameters(), lr=lr_d, betas=(0.5, 0.999))

    history: Dict[str, List[float]] = {"g_loss": [], "d_loss": [], "d_real": [], "d_fake": []}

    for epoch in range(epochs):
        g_losses, d_losses = [], []
        d_reals, d_fakes = [], []

        for (real_data,) in train_loader:
            real_data = real_data.to(device)
            batch_size = real_data.size(0)

            # ---- Train Discriminator ----
            for _ in range(n_critic):
                z = torch.randn(batch_size, noise_dim, device=device)
                fake_data = generator(z).detach()

                d_real = discriminator(real_data)
                d_fake = discriminator(fake_data)

                # Binary cross-entropy with logits
                d_loss = (
                    F.binary_cross_entropy_with_logits(
                        d_real, torch.ones_like(d_real)
                    )
                    + F.binary_cross_entropy_with_logits(
                        d_fake, torch.zeros_like(d_fake)
                    )
                )

                opt_d.zero_grad()
                d_loss.backward()
                opt_d.step()

            # ---- Train Generator ----
            z = torch.randn(batch_size, noise_dim, device=device)
            fake_data = generator(z)
            d_fake_for_g = discriminator(fake_data)

            # Non-saturating loss: maximize log(D(G(z)))
            g_loss = F.binary_cross_entropy_with_logits(
                d_fake_for_g, torch.ones_like(d_fake_for_g)
            )

            opt_g.zero_grad()
            g_loss.backward()
            opt_g.step()

            g_losses.append(g_loss.item())
            d_losses.append(d_loss.item())
            d_reals.append(torch.sigmoid(d_real).mean().item())
            d_fakes.append(torch.sigmoid(d_fake).mean().item())

        history["g_loss"].append(np.mean(g_losses))
        history["d_loss"].append(np.mean(d_losses))
        history["d_real"].append(np.mean(d_reals))
        history["d_fake"].append(np.mean(d_fakes))

        if (epoch + 1) % 20 == 0:
            print(
                f"Epoch {epoch+1:3d} | "
                f"D loss: {history['d_loss'][-1]:.4f} | "
                f"G loss: {history['g_loss'][-1]:.4f} | "
                f"D(real): {history['d_real'][-1]:.3f} | "
                f"D(fake): {history['d_fake'][-1]:.3f}"
            )

    return history

Wasserstein GAN (WGAN) and the Earth Mover's Distance

The original GAN objective optimizes the Jensen-Shannon divergence, which can be flat (zero gradient) when the real and generated distributions have disjoint supports — a common situation in high-dimensional spaces. Arjovsky et al. (2017) proposed replacing JS with the Wasserstein-1 distance (earth mover's distance):

$$W(p_{\text{data}}, p_G) = \inf_{\gamma \in \Pi(p_{\text{data}}, p_G)} \mathbb{E}_{(\mathbf{x}, \mathbf{y}) \sim \gamma} [\|\mathbf{x} - \mathbf{y}\|]$$

The Kantorovich-Rubinstein duality gives a tractable formulation:

$$W(p_{\text{data}}, p_G) = \sup_{\|f\|_L \leq 1} \left( \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}} [f(\mathbf{x})] - \mathbb{E}_{\mathbf{x} \sim p_G} [f(\mathbf{x})] \right)$$

where the supremum is over 1-Lipschitz functions $f$. The WGAN replaces the discriminator with a critic (no sigmoid, unbounded output) and enforces the Lipschitz constraint via weight clipping (WGAN) or gradient penalty (WGAN-GP):

$$\mathcal{L}_{\text{GP}} = \lambda \, \mathbb{E}_{\hat{\mathbf{x}}} \left[ (\|\nabla_{\hat{\mathbf{x}}} D(\hat{\mathbf{x}})\|_2 - 1)^2 \right]$$

where $\hat{\mathbf{x}}$ is uniformly interpolated between real and fake samples.

Research Insight: Spectral normalization (Miyato et al., 2018) offers a simpler and more effective approach to enforcing the Lipschitz constraint: divide each weight matrix by its spectral norm (largest singular value) after every update. This is computationally cheap (one power iteration step per layer per update) and empirically produces more stable training than gradient penalty. It has become the default stabilization technique in modern GAN architectures.

12.5 Diffusion Models: Learning to Denoise

Diffusion models have emerged as the dominant generative model family, surpassing GANs in image quality while offering training stability and mode coverage that GANs cannot match. The core idea is simple and beautiful: define a process that gradually adds noise to data until it becomes pure noise, then learn the reverse process that gradually removes noise to generate data.

The Forward Process

Given a data point $\mathbf{x}_0 \sim p_{\text{data}}$, the forward process produces a sequence of increasingly noisy versions $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_T$:

$$q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \, \mathbf{x}_{t-1}, \beta_t \mathbf{I})$$

where $\beta_1, \beta_2, \ldots, \beta_T$ is a noise schedule (a sequence of small positive values, typically increasing from $\beta_1 \approx 10^{-4}$ to $\beta_T \approx 0.02$).

At each step, the data is slightly scaled down and Gaussian noise is added. After $T$ steps (typically $T = 1000$), $\mathbf{x}_T$ is approximately $\mathcal{N}(\mathbf{0}, \mathbf{I})$ — the signal is completely destroyed.

A critical property is that we can sample $\mathbf{x}_t$ directly from $\mathbf{x}_0$ without iterating through all intermediate steps. Define $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$:

$$q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \, \mathbf{x}_0, (1 - \bar{\alpha}_t) \mathbf{I})$$

Equivalently:

$$\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \, \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \, \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$$

This closed-form expression is essential for efficient training: we can jump to any timestep $t$ in one step.

The Reverse Process

The reverse process starts from noise $\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ and iteratively denoises:

$$p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \sigma_t^2 \mathbf{I})$$

The neural network $\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)$ must predict the mean of the denoised sample at each timestep. Ho et al. (2020) showed that instead of predicting $\boldsymbol{\mu}_\theta$ directly, it is simpler and more effective to predict the noise $\boldsymbol{\epsilon}$ that was added:

$$\boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \right)$$

The Simplified Training Objective

The full variational lower bound for diffusion models decomposes into a sum of KL divergences at each timestep. Ho et al. (2020) showed that a simplified objective — predicting the noise — works just as well and is much simpler to implement:

$$\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\|^2 \right]$$

where $t \sim \text{Uniform}\{1, \ldots, T\}$, $\mathbf{x}_0 \sim p_{\text{data}}$, and $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$.

The training procedure is:

Sample a clean data point $\mathbf{x}_0$.
Sample a random timestep $t$.
Sample noise $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$.
Construct the noisy version: $\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \, \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \, \boldsymbol{\epsilon}$.
Train the network to predict $\boldsymbol{\epsilon}$ from $(\mathbf{x}_t, t)$.

Understanding Why: The simplicity of the diffusion training procedure is remarkable. There is no adversarial game, no posterior collapse, no mode collapse. The model is simply trained to predict noise — a regression problem. The mathematical elegance lies in the derivation that shows this simple regression loss is equivalent (up to a weighting) to the variational lower bound on the log-likelihood. The complexity is deferred to sampling, which requires iterating through all $T$ steps of the reverse process.

Connection to Score Matching

The noise prediction $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ is directly related to the score function — the gradient of the log-density:

$$\nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t) = -\frac{\boldsymbol{\epsilon}}{\sqrt{1 - \bar{\alpha}_t}}$$

The diffusion model is therefore learning the score at every noise level. Sampling via the reverse process is a form of annealed Langevin dynamics: following the score function (gradient of the log-density) from a high-noise distribution down to the data distribution.

This connection was made explicit by Song and Ermon (2019) in their score-based generative modeling framework, which unifies diffusion models and score matching under a single theoretical umbrella. Song et al. (2021) further generalized this to continuous-time stochastic differential equations (SDEs), where the forward process is:

$$d\mathbf{x} = \mathbf{f}(\mathbf{x}, t) \, dt + g(t) \, d\mathbf{w}$$

and the reverse process involves the time-reversed SDE with the score function.

Implementation: Simple DDPM

class SinusoidalPositionEmbedding(nn.Module):
    """Sinusoidal embedding for diffusion timestep.

    Maps integer timestep t to a continuous vector using
    sinusoidal functions at different frequencies, analogous
    to positional encoding in transformers (Chapter 10).

    Args:
        dim: Embedding dimensionality.
    """

    def __init__(self, dim: int) -> None:
        super().__init__()
        self.dim = dim

    def forward(self, t: torch.Tensor) -> torch.Tensor:
        """Embed timestep t.

        Args:
            t: Integer timesteps, shape (batch_size,).

        Returns:
            Embedding vectors, shape (batch_size, dim).
        """
        device = t.device
        half_dim = self.dim // 2
        emb = np.log(10000) / (half_dim - 1)
        emb = torch.exp(torch.arange(half_dim, device=device) * -emb)
        emb = t[:, None].float() * emb[None, :]
        return torch.cat([torch.sin(emb), torch.cos(emb)], dim=-1)


class DenoisingMLP(nn.Module):
    """Simple MLP-based denoising network for DDPM.

    Predicts the noise epsilon given the noisy input x_t and timestep t.
    For image data, a U-Net would be used instead (see Further Reading),
    but an MLP suffices for low-dimensional tabular data and pedagogy.

    Args:
        input_dim: Dimensionality of the data.
        hidden_dim: Hidden layer size.
        time_dim: Dimensionality of the timestep embedding.
    """

    def __init__(
        self, input_dim: int = 784, hidden_dim: int = 512, time_dim: int = 128
    ) -> None:
        super().__init__()
        self.time_embed = SinusoidalPositionEmbedding(time_dim)

        self.net = nn.Sequential(
            nn.Linear(input_dim + time_dim, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, input_dim),
        )

    def forward(self, x: torch.Tensor, t: torch.Tensor) -> torch.Tensor:
        """Predict noise given noisy input and timestep.

        Args:
            x: Noisy input x_t, shape (batch_size, input_dim).
            t: Timesteps, shape (batch_size,).

        Returns:
            Predicted noise epsilon, shape (batch_size, input_dim).
        """
        t_emb = self.time_embed(t)
        return self.net(torch.cat([x, t_emb], dim=-1))


class SimpleDDPM:
    """Denoising Diffusion Probabilistic Model (Ho et al., 2020).

    Implements the forward (noising) and reverse (denoising) processes
    with a linear noise schedule.

    Args:
        model: Denoising network that predicts noise.
        n_timesteps: Number of diffusion timesteps T.
        beta_start: Initial noise level.
        beta_end: Final noise level.
        device: Computation device.
    """

    def __init__(
        self,
        model: nn.Module,
        n_timesteps: int = 1000,
        beta_start: float = 1e-4,
        beta_end: float = 0.02,
        device: str = "cpu",
    ) -> None:
        self.model = model.to(device)
        self.n_timesteps = n_timesteps
        self.device = device

        # Linear noise schedule
        self.betas = torch.linspace(beta_start, beta_end, n_timesteps, device=device)
        self.alphas = 1.0 - self.betas
        self.alpha_cumprod = torch.cumprod(self.alphas, dim=0)
        self.alpha_cumprod_prev = torch.cat(
            [torch.tensor([1.0], device=device), self.alpha_cumprod[:-1]]
        )

        # Precompute quantities for forward and reverse processes
        self.sqrt_alpha_cumprod = torch.sqrt(self.alpha_cumprod)
        self.sqrt_one_minus_alpha_cumprod = torch.sqrt(1.0 - self.alpha_cumprod)
        self.sqrt_recip_alpha = torch.sqrt(1.0 / self.alphas)
        self.posterior_variance = (
            self.betas * (1.0 - self.alpha_cumprod_prev) / (1.0 - self.alpha_cumprod)
        )

    def forward_process(
        self, x_0: torch.Tensor, t: torch.Tensor
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """Add noise to x_0 at timestep t (closed-form).

        x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon

        Args:
            x_0: Clean data, shape (batch_size, input_dim).
            t: Timesteps, shape (batch_size,).

        Returns:
            Tuple of (x_t, epsilon) — noisy data and the noise that was added.
        """
        epsilon = torch.randn_like(x_0)
        sqrt_alpha_bar = self.sqrt_alpha_cumprod[t].unsqueeze(-1)
        sqrt_one_minus_alpha_bar = self.sqrt_one_minus_alpha_cumprod[t].unsqueeze(-1)
        x_t = sqrt_alpha_bar * x_0 + sqrt_one_minus_alpha_bar * epsilon
        return x_t, epsilon

    def compute_loss(self, x_0: torch.Tensor) -> torch.Tensor:
        """Compute the simplified DDPM loss: ||epsilon - epsilon_theta||^2.

        Args:
            x_0: Clean data batch, shape (batch_size, input_dim).

        Returns:
            Scalar loss tensor.
        """
        batch_size = x_0.size(0)
        t = torch.randint(0, self.n_timesteps, (batch_size,), device=self.device)
        x_t, epsilon = self.forward_process(x_0, t)
        epsilon_pred = self.model(x_t, t)
        return F.mse_loss(epsilon_pred, epsilon)

    @torch.no_grad()
    def sample(self, n_samples: int, input_dim: int) -> torch.Tensor:
        """Generate samples by running the reverse process.

        Starts from x_T ~ N(0, I) and iteratively denoises
        through T steps to produce x_0.

        Args:
            n_samples: Number of samples to generate.
            input_dim: Dimensionality of each sample.

        Returns:
            Generated samples, shape (n_samples, input_dim).
        """
        self.model.eval()
        x = torch.randn(n_samples, input_dim, device=self.device)

        for t in reversed(range(self.n_timesteps)):
            t_batch = torch.full((n_samples,), t, device=self.device, dtype=torch.long)
            epsilon_pred = self.model(x, t_batch)

            # Reverse step mean: mu = (1/sqrt(alpha_t)) * (x_t - beta_t/sqrt(1-alpha_bar_t) * eps)
            mean = self.sqrt_recip_alpha[t] * (
                x - self.betas[t] / self.sqrt_one_minus_alpha_cumprod[t] * epsilon_pred
            )

            if t > 0:
                noise = torch.randn_like(x)
                sigma = torch.sqrt(self.posterior_variance[t])
                x = mean + sigma * noise
            else:
                x = mean  # No noise at t=0

        return x


def train_ddpm(
    ddpm: SimpleDDPM,
    train_loader: DataLoader,
    epochs: int = 100,
    lr: float = 2e-4,
) -> List[float]:
    """Train the DDPM denoising network.

    Args:
        ddpm: SimpleDDPM instance.
        train_loader: Training data loader.
        epochs: Number of training epochs.
        lr: Learning rate.

    Returns:
        List of per-epoch average losses.
    """
    optimizer = torch.optim.Adam(ddpm.model.parameters(), lr=lr)
    losses = []

    for epoch in range(epochs):
        epoch_losses = []
        ddpm.model.train()
        for (x_batch,) in train_loader:
            x_batch = x_batch.to(ddpm.device)
            optimizer.zero_grad()
            loss = ddpm.compute_loss(x_batch)
            loss.backward()
            optimizer.step()
            epoch_losses.append(loss.item())

        avg_loss = np.mean(epoch_losses)
        losses.append(avg_loss)

        if (epoch + 1) % 20 == 0:
            print(f"Epoch {epoch+1:3d} | Loss: {avg_loss:.6f}")

    return losses

Classifier-Free Guidance

Conditional generation — producing samples that satisfy a condition $y$ (e.g., "generate a cat image" or "generate a patient record with diabetes") — can be achieved by training a conditional denoising network $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, y)$. Classifier-free guidance (Ho & Salimans, 2022) improves conditional sample quality by interpolating between conditional and unconditional predictions:

$$\hat{\boldsymbol{\epsilon}} = (1 + w) \, \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, y) - w \, \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \varnothing)$$

where $w > 0$ is the guidance scale and $\varnothing$ denotes the null condition (no class label). During training, the condition $y$ is randomly dropped (replaced with $\varnothing$) with some probability (e.g., 10%), so the same network learns both conditional and unconditional denoising. At inference, increasing $w$ sharpens the conditional distribution at the cost of diversity.

Production Reality: Classifier-free guidance is the mechanism behind the "CFG scale" slider in image generation tools. A scale of $w = 0$ produces diverse but loosely conditioned outputs. A scale of $w = 7\text{--}10$ produces sharply conditioned outputs that closely follow the prompt but with reduced variety. Very high scales ($w > 15$) produce oversaturated, artifact-heavy outputs. The tradeoff between quality/adherence and diversity is fundamental and cannot be eliminated — only managed.

12.6 Flow Matching and Normalizing Flows

Normalizing Flows: Exact Likelihood via Invertible Transformations

Normalizing flows take yet another approach: define an invertible transformation $f_\theta: \mathbb{R}^d \to \mathbb{R}^d$ that maps a simple base distribution $p_\mathbf{z}(\mathbf{z})$ (e.g., Gaussian) to the data distribution. The change-of-variables formula gives the exact density:

$$p_\theta(\mathbf{x}) = p_\mathbf{z}(f_\theta^{-1}(\mathbf{x})) \left| \det \frac{\partial f_\theta^{-1}}{\partial \mathbf{x}} \right|$$

or equivalently:

$$\log p_\theta(\mathbf{x}) = \log p_\mathbf{z}(f_\theta^{-1}(\mathbf{x})) + \log \left| \det \frac{\partial f_\theta^{-1}}{\partial \mathbf{x}} \right|$$

The key constraint is computational: the Jacobian determinant must be efficient to compute. This limits the architecture to special forms — coupling layers (RealNVP, Glow), autoregressive transforms (MAF, IAF) — where the Jacobian is triangular and its determinant is the product of diagonal entries.

Normalizing flows provide exact log-likelihoods (unlike VAEs, which provide a bound, or GANs, which provide no density at all). However, the invertibility constraint limits expressiveness, and flows have generally not matched GANs or diffusion models in sample quality for complex data like images.

Flow Matching: The Modern Alternative

Flow matching (Lipman et al., 2023) provides a simpler and more scalable approach to flow-based generation. Instead of constructing complex invertible architectures, flow matching learns a velocity field that transports samples from noise to data along straight paths.

Define a time-dependent probability path $p_t(\mathbf{x})$ that interpolates between noise ($t = 0$) and data ($t = 1$). The simplest choice is a linear interpolation:

$$\mathbf{x}_t = (1 - t) \boldsymbol{\epsilon} + t \mathbf{x}_1, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), \quad \mathbf{x}_1 \sim p_{\text{data}}$$

The velocity (time derivative) along this path is:

$$\mathbf{v}^* = \frac{d\mathbf{x}_t}{dt} = \mathbf{x}_1 - \boldsymbol{\epsilon}$$

The flow matching objective trains a neural network $\mathbf{v}_\theta(\mathbf{x}_t, t)$ to predict this velocity:

$$\mathcal{L}_{\text{FM}} = \mathbb{E}_{t, \mathbf{x}_1, \boldsymbol{\epsilon}} \left[ \|\mathbf{v}_\theta(\mathbf{x}_t, t) - (\mathbf{x}_1 - \boldsymbol{\epsilon})\|^2 \right]$$

Sampling uses an ODE solver to integrate the learned velocity field from $t = 0$ to $t = 1$:

$$\frac{d\mathbf{x}}{dt} = \mathbf{v}_\theta(\mathbf{x}, t), \quad \mathbf{x}_0 \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$$

Flow matching offers several advantages over DDPM-style diffusion:

Straighter paths require fewer integration steps at inference (10-50 vs. 1000 for DDPM).
No noise schedule to tune — the linear interpolation is parameter-free.
ODE sampling is deterministic (for a given initial noise), enabling exact likelihood computation via the continuous change-of-variables formula.

class FlowMatchingTrainer:
    """Flow matching training loop.

    Trains a velocity network to predict the conditional velocity
    field along linear interpolation paths from noise to data.

    Args:
        model: Velocity network v_theta(x_t, t) -> velocity.
        device: Computation device.
    """

    def __init__(self, model: nn.Module, device: str = "cpu") -> None:
        self.model = model.to(device)
        self.device = device

    def compute_loss(self, x_1: torch.Tensor) -> torch.Tensor:
        """Compute the flow matching loss.

        Args:
            x_1: Clean data samples, shape (batch_size, dim).

        Returns:
            Scalar loss tensor.
        """
        batch_size = x_1.size(0)
        # Sample time uniformly from [0, 1]
        t = torch.rand(batch_size, 1, device=self.device)
        # Sample noise
        epsilon = torch.randn_like(x_1)
        # Linear interpolation: x_t = (1 - t) * epsilon + t * x_1
        x_t = (1 - t) * epsilon + t * x_1
        # Target velocity: dx/dt = x_1 - epsilon
        target_v = x_1 - epsilon
        # Predict velocity
        pred_v = self.model(x_t, t.squeeze(-1))
        return F.mse_loss(pred_v, target_v)

    @torch.no_grad()
    def sample(
        self, n_samples: int, dim: int, n_steps: int = 50
    ) -> torch.Tensor:
        """Generate samples by integrating the velocity field (Euler method).

        Args:
            n_samples: Number of samples to generate.
            dim: Dimensionality of each sample.
            n_steps: Number of Euler integration steps.

        Returns:
            Generated samples, shape (n_samples, dim).
        """
        self.model.eval()
        dt = 1.0 / n_steps
        x = torch.randn(n_samples, dim, device=self.device)

        for i in range(n_steps):
            t = torch.full((n_samples,), i * dt, device=self.device)
            v = self.model(x, t)
            x = x + v * dt

        return x

Fundamentals > Frontier: Flow matching is conceptually the simplest generative model in this chapter: learn the velocity that transports noise to data. The simplicity is the point. The progression from normalizing flows (exact likelihood but complex architectures) to diffusion (simple training but 1000-step sampling) to flow matching (simple training AND fast sampling) illustrates how the field converges toward simpler formulations as understanding deepens. The math gets cleaner, not more complex.

12.7 Comparing Generative Model Families

The Comparison Framework

No single generative model family dominates on all axes. The right choice depends on which properties matter for the application:

Property	VAE	GAN	Diffusion (DDPM)	Flow Matching
Sample quality	Moderate (blurry)	High (sharp)	Highest	High
Mode coverage	Good	Poor (mode collapse)	Excellent	Excellent
Training stability	Stable	Unstable	Very stable	Very stable
Latent space	Continuous, structured	None (noise input)	None (noise input)	None (noise input)
Density evaluation	Approximate (ELBO)	None	Approximate	Exact (via ODE)
Sampling speed	Fast (one forward pass)	Fast (one forward pass)	Slow ($T$ steps)	Moderate (10-50 steps)
Controllability	Latent space manipulation	Conditional GAN	Classifier-free guidance	Conditional velocity
Training cost	Low	Moderate	High	Moderate-high

When to Use Each

VAEs are the right choice when you need: - A structured latent space for representation learning, interpolation, or clustering - Fast sampling (single forward pass through the decoder) - Anomaly detection via the ELBO (low ELBO = anomalous input) - Moderate quality is acceptable (e.g., synthetic tabular data, data augmentation)

GANs are the right choice when you need: - Maximum perceptual sharpness for a narrow domain (e.g., face generation, super-resolution) - Real-time generation (single forward pass through the generator) - You can invest in careful hyperparameter tuning and training monitoring

Diffusion models are the right choice when you need: - State-of-the-art sample quality and diversity - Mode coverage (no training examples should be "forgotten") - Conditional generation with flexible control (text-to-image, inpainting, editing) - Inference speed is not critical (or you can use distillation/fast samplers)

Flow matching is the right choice when you need: - Fast iterative sampling (10-50 steps vs. 1000 for DDPM) - Exact log-likelihoods for model comparison or anomaly detection - Simple, stable training without adversarial dynamics or noise schedules

Applications for the Practicing Data Scientist

Generative models are not only for image generation. Practical applications include:

1. Synthetic Data Generation. When real data is scarce, expensive, or privacy-restricted, generative models can produce synthetic datasets that preserve statistical properties. Case Study 1 (this chapter) shows this for electronic health records: a VAE trained on patient records generates synthetic patients whose marginal distributions, correlations, and clinical patterns match the real data — without containing any real patient's information.

2. Data Augmentation. For class-imbalanced problems, generative models can produce synthetic minority-class examples. A VAE trained on rare disease cases generates additional training examples that help the downstream classifier generalize.

3. Anomaly Detection. Generative models assign low probability (VAE: low ELBO; diffusion: high denoising loss) to out-of-distribution inputs. A VAE trained on normal manufacturing images flags defective parts as anomalous because they have low reconstruction probability.

4. Missing Data Imputation. Conditional generative models can fill in missing features by sampling from $p(\mathbf{x}_{\text{missing}} \mid \mathbf{x}_{\text{observed}})$. Multiple imputations from a single model provide uncertainty estimates for downstream analyses.

5. Stochastic Simulation. In scientific domains, generative models produce ensemble forecasts that capture the range of possible outcomes. Case Study 2 (this chapter) shows diffusion models generating stochastic weather ensemble members that capture the distribution of possible weather states.

12.8 Progressive Project: VAE for StreamRec Item Embeddings

In Chapter 8, we built content embeddings for StreamRec items using a 1D CNN. The CNN embeddings capture semantic similarity, but the embedding space has no generative structure — we cannot sample new items or smoothly interpolate between existing ones.

In this milestone, we build a VAE that learns a continuous latent space for StreamRec items. The key property is that the latent space is structured: nearby points correspond to similar items, interpolation between points produces meaningful "hybrid" items, and sampling from the prior generates realistic synthetic items.

Building the Item VAE

class ItemVAE(nn.Module):
    """VAE for StreamRec item embeddings.

    Takes item feature vectors (category one-hot + normalized metadata)
    and learns a continuous latent space where similar items are nearby
    and the decoder can generate realistic synthetic item features.

    Args:
        input_dim: Dimensionality of item feature vectors.
        hidden_dim: Hidden layer size.
        latent_dim: Dimensionality of the latent space.
    """

    def __init__(
        self, input_dim: int = 50, hidden_dim: int = 128, latent_dim: int = 16
    ) -> None:
        super().__init__()
        self.latent_dim = latent_dim

        # Encoder
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.BatchNorm1d(hidden_dim),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.BatchNorm1d(hidden_dim),
        )
        self.fc_mu = nn.Linear(hidden_dim, latent_dim)
        self.fc_log_var = nn.Linear(hidden_dim, latent_dim)

        # Decoder
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.ReLU(),
            nn.BatchNorm1d(hidden_dim),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.BatchNorm1d(hidden_dim),
            nn.Linear(hidden_dim, input_dim),
        )

    def encode(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        h = self.encoder(x)
        return self.fc_mu(h), self.fc_log_var(h)

    def reparameterize(
        self, mu: torch.Tensor, log_var: torch.Tensor
    ) -> torch.Tensor:
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std)
        return mu + std * eps

    def decode(self, z: torch.Tensor) -> torch.Tensor:
        return self.decoder(z)

    def forward(
        self, x: torch.Tensor
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        mu, log_var = self.encode(x)
        z = self.reparameterize(mu, log_var)
        x_recon = self.decode(z)
        return x_recon, mu, log_var


def generate_streamrec_items(
    n_items: int = 20000,
    n_categories: int = 20,
    n_numeric_features: int = 30,
    seed: int = 42,
) -> np.ndarray:
    """Generate synthetic StreamRec item features.

    Each item has a one-hot category vector plus numeric features
    (popularity score, recency, duration, engagement metrics, etc.)
    drawn from category-specific distributions.

    Args:
        n_items: Number of items to generate.
        n_categories: Number of content categories.
        n_numeric_features: Number of continuous features.
        seed: Random seed.

    Returns:
        Feature matrix of shape (n_items, n_categories + n_numeric_features).
    """
    rng = np.random.RandomState(seed)
    categories = rng.randint(0, n_categories, size=n_items)

    # One-hot encoding
    one_hot = np.zeros((n_items, n_categories), dtype=np.float32)
    one_hot[np.arange(n_items), categories] = 1.0

    # Category-specific numeric feature distributions
    numeric = np.zeros((n_items, n_numeric_features), dtype=np.float32)
    for cat in range(n_categories):
        mask = categories == cat
        n_cat = mask.sum()
        # Each category has different mean/std for each feature
        cat_means = rng.uniform(-1, 1, size=n_numeric_features)
        cat_stds = rng.uniform(0.3, 1.0, size=n_numeric_features)
        numeric[mask] = rng.normal(
            cat_means, cat_stds, size=(n_cat, n_numeric_features)
        ).astype(np.float32)

    features = np.hstack([one_hot, numeric])
    return features


def item_vae_loss(
    x_recon: torch.Tensor,
    x: torch.Tensor,
    mu: torch.Tensor,
    log_var: torch.Tensor,
    n_categories: int = 20,
    beta: float = 1.0,
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """Compute VAE loss for item features.

    Uses cross-entropy for the categorical part (first n_categories dims)
    and MSE for the numeric part (remaining dims).

    Args:
        x_recon: Reconstructed features, shape (batch, input_dim).
        x: Original features, shape (batch, input_dim).
        mu: Encoder mean, shape (batch, latent_dim).
        log_var: Encoder log-variance, shape (batch, latent_dim).
        n_categories: Number of category dimensions (one-hot encoded).
        beta: KL weight.

    Returns:
        Tuple of (total_loss, reconstruction_loss, kl_loss).
    """
    # Categorical reconstruction: cross-entropy on the one-hot category part
    cat_recon = F.cross_entropy(
        x_recon[:, :n_categories], x[:, :n_categories].argmax(dim=1)
    )

    # Numeric reconstruction: MSE on the remaining features
    num_recon = F.mse_loss(x_recon[:, n_categories:], x[:, n_categories:])

    recon_loss = cat_recon + num_recon

    # KL divergence
    kl_loss = -0.5 * torch.mean(
        torch.sum(1 + log_var - mu.pow(2) - log_var.exp(), dim=1)
    )

    return recon_loss + beta * kl_loss, recon_loss, kl_loss


def explore_latent_space(
    model: ItemVAE,
    features: np.ndarray,
    categories: np.ndarray,
    device: str = "cpu",
) -> Dict[str, np.ndarray]:
    """Encode items and analyze the latent space.

    Encodes all items, computes nearest neighbors in latent space,
    and generates synthetic items by sampling from the prior and
    interpolating between existing items.

    Args:
        model: Trained ItemVAE.
        features: Item feature matrix.
        categories: Category labels for each item.
        device: Computation device.

    Returns:
        Dictionary with latent embeddings, synthetic items, and
        interpolation results.
    """
    model.eval()
    x = torch.tensor(features, dtype=torch.float32, device=device)

    with torch.no_grad():
        mu, log_var = model.encode(x)
        z = mu  # Use mean (no stochasticity) for analysis

    z_np = z.cpu().numpy()

    # Nearest neighbors in latent space
    from sklearn.metrics.pairwise import cosine_similarity
    n_queries = 5
    rng = np.random.RandomState(42)
    query_indices = rng.choice(len(categories), size=n_queries, replace=False)

    print("=== Nearest Neighbors in Latent Space ===")
    for qi in query_indices:
        sims = cosine_similarity(z_np[qi:qi + 1], z_np)[0]
        sims[qi] = -1
        top_3 = np.argsort(sims)[-3:][::-1]
        query_cat = categories[qi]
        neighbor_cats = categories[top_3]
        print(
            f"Query (cat={query_cat}): neighbors are cats {neighbor_cats} "
            f"(sims: {sims[top_3].round(3)})"
        )

    # Generate synthetic items by sampling from the prior
    with torch.no_grad():
        z_sample = torch.randn(10, model.latent_dim, device=device)
        synthetic = model.decode(z_sample).cpu().numpy()

    print("\n=== Synthetic Item Statistics ===")
    print(f"Predicted categories: {synthetic[:, :20].argmax(axis=1)}")
    print(f"Numeric feature means: {synthetic[:, 20:].mean(axis=0)[:5].round(3)}")

    # Latent space interpolation between two items
    with torch.no_grad():
        z_a = mu[0:1]
        z_b = mu[100:101]
        alphas = torch.linspace(0, 1, 7, device=device).unsqueeze(1)
        z_interp = (1 - alphas) * z_a + alphas * z_b
        interp_items = model.decode(z_interp).cpu().numpy()

    print("\n=== Interpolation: Item 0 -> Item 100 ===")
    print(f"Categories along path: {interp_items[:, :20].argmax(axis=1)}")

    return {
        "latent_embeddings": z_np,
        "synthetic_items": synthetic,
        "interpolated_items": interp_items,
    }


# ---- Run the progressive project ----

# Generate item features
item_features = generate_streamrec_items(n_items=20000, n_categories=20)
categories = item_features[:, :20].argmax(axis=1)

# Prepare data
dataset = TensorDataset(torch.tensor(item_features, dtype=torch.float32))
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_ds, val_ds = torch.utils.data.random_split(
    dataset, [train_size, val_size], generator=torch.Generator().manual_seed(42)
)
train_loader = DataLoader(train_ds, batch_size=256, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=256, shuffle=False)

# Train the Item VAE
item_vae = ItemVAE(input_dim=50, hidden_dim=128, latent_dim=16)
device = "cuda" if torch.cuda.is_available() else "cpu"
item_vae = item_vae.to(device)
optimizer = torch.optim.Adam(item_vae.parameters(), lr=1e-3)

print("=== Training Item VAE ===")
for epoch in range(50):
    item_vae.train()
    epoch_loss = []
    for (batch,) in train_loader:
        batch = batch.to(device)
        optimizer.zero_grad()
        x_recon, mu, log_var = item_vae(batch)
        loss, recon, kl = item_vae_loss(x_recon, batch, mu, log_var, beta=0.5)
        loss.backward()
        optimizer.step()
        epoch_loss.append(loss.item())

    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1:3d} | Loss: {np.mean(epoch_loss):.4f}")

# Explore the latent space
results = explore_latent_space(item_vae, item_features, categories, device)

The latent space reveals structure: items from the same category cluster together, nearest neighbors in latent space share categories, and interpolation between items produces smooth category transitions. This latent space becomes an input feature for the StreamRec recommendation model — a continuous, low-dimensional representation of each item that captures both category membership and numeric feature patterns.

In Chapter 13, we will build a two-tower retrieval model where the item tower uses these latent embeddings (or embeddings from a contrastive learning objective) to match items with users.

12.9 Evaluation of Generative Models

Evaluating generative models is harder than evaluating discriminative models. There is no single metric analogous to accuracy or AUC. Different metrics capture different aspects of generation quality.

Quantitative Metrics

Frechet Inception Distance (FID) measures the distance between the distribution of real and generated samples in the feature space of a pretrained classifier (InceptionV3 for images). Lower is better. FID captures both quality (are individual samples realistic?) and diversity (does the generated distribution cover the real distribution?).

$$\text{FID} = \|\boldsymbol{\mu}_r - \boldsymbol{\mu}_g\|^2 + \text{Tr}\left(\boldsymbol{\Sigma}_r + \boldsymbol{\Sigma}_g - 2(\boldsymbol{\Sigma}_r \boldsymbol{\Sigma}_g)^{1/2}\right)$$

where $(\boldsymbol{\mu}_r, \boldsymbol{\Sigma}_r)$ and $(\boldsymbol{\mu}_g, \boldsymbol{\Sigma}_g)$ are the mean and covariance of InceptionV3 features for real and generated samples.

Log-likelihood (or ELBO for VAEs) measures how well the model explains the data. Higher is better. Available for VAEs (approximate) and normalizing flows/flow matching (exact), but not for GANs.

Precision and Recall (Kynkaanniemi et al., 2019) separately measure quality and diversity: - Precision: fraction of generated samples that fall within the support of the real distribution (quality). - Recall: fraction of real samples that fall within the support of the generated distribution (coverage/diversity).

For Tabular and Structured Data

When generating tabular data (as in the pharma synthetic EHR application), domain-specific metrics are essential:

Marginal distribution matching: Compare histograms and summary statistics (mean, variance, quantiles) of each feature between real and synthetic data.
Correlation preservation: Compare the correlation matrix of real and synthetic data.
Downstream utility: Train a classifier on synthetic data and evaluate on real data. Good synthetic data should yield performance close to training on real data.
Privacy: Measure distance from each synthetic record to its nearest real record. Synthetic records that are too close to real records indicate memorization, not generalization.

Implementation Note: For tabular data, always report all four evaluation categories. A model that matches marginals but not correlations generates plausible individual features with wrong joint structure. A model that achieves high downstream utility but low privacy distance is memorizing the training set. No single metric tells the full story.

12.10 Chapter Summary

This chapter derived three generative model families from first principles:

Variational Autoencoders maximize the ELBO — a tractable lower bound on the log-likelihood obtained by introducing a variational posterior and applying Jensen's inequality. The reparameterization trick enables gradient-based optimization through the stochastic sampling step. VAEs provide structured latent spaces and fast sampling but produce blurry outputs due to the pixel-wise reconstruction loss.
Generative Adversarial Networks frame generation as a minimax game between a generator and discriminator, implicitly minimizing the Jensen-Shannon divergence between real and generated distributions. GANs produce sharp outputs but suffer from mode collapse and training instability. Wasserstein GANs and spectral normalization partially address these issues.
Diffusion Models define a forward process that adds noise and learn the reverse process that removes it. The simplified DDPM objective — predicting the noise — is a regression problem with no adversarial dynamics. Diffusion models achieve the best quality and mode coverage but require many sampling steps. Classifier-free guidance provides fine-grained control over conditional generation.
Flow Matching learns a velocity field that transports noise to data along straight paths, enabling fast sampling with fewer steps and exact likelihood computation.

The choice between families depends on the application: VAEs for representation learning and fast sampling, GANs for maximum sharpness in narrow domains, diffusion models for quality and diversity, and flow matching for simplicity and speed.

Understanding Why: The progression from VAEs to GANs to diffusion models is not merely chronological — it reflects different mathematical strategies for the same fundamental problem: approximating a complex, high-dimensional data distribution. VAEs bound the likelihood from below. GANs bypass likelihood entirely and use an adversarial signal. Diffusion models construct a path from noise to data and learn to traverse it. Understanding these strategies — not just the architectures — is what enables you to adapt them to new domains and recognize when a new generative approach is truly novel versus a rearrangement of existing ideas.